Comments (8)
Hello @amarouni -- can you please elaborate a bit on the exact moments of time (in what particular order they do this) your nodes perform:
- join
- start
- stop
And what you define by "node is started" -- does it mean the node joined the cluster?
from constructr.
@aleksandr-vin Here's a rough timeline of what we're seeing :
- node1 is started (docker container started)
- node1 node adds itself as a seed node and then joins itself
- Akka cluster started with this single node, and everything works as expected
- handover started node2 is started (docker container started) & node1 is shutdown (docker container killed)
That's where we noticed this race condition, sometimes node2 will start before node1 is completely shutdown and it picks node1's address (as seed node) from zookeeper.
- node2 tries to join the non existing node1
- node2 times out
- Akka cluster is shut down
Hope this helps
from constructr.
I can also confirm the issue. My team is considering to adopt constructR as an akka-cluster discovery solution, and to this end we have examined the implementation and have come up with a couple of suggestions for improvement (will create corresponding issues soon), which of course we are more that happy to implement and contribute back to the project.
Back to the issue at hand: the problem here is that when a node gets the list of nodes from the coordination service, it performs a joinSeedNodes
on that list and waits until it joins the cluster. If the operation times out (specifically, if the Joining FSM state times out) then the ConstructrMachine stops it self, which will lead to the termination of the actor-system. This will be the case in a scenario such as the one described by the OP, where a first node joins, then it fails, and before the expiration of the TTL of its DB entry, a second node tries to join. The second node will receive a list of size 1, containing the node that is "dead", and will try to join it unsuccessfully until the FSM state timeout ends up killing it as well.
According to @hseeberger, this choice is made in order to fail fast, however for some users (us included), it may not be acceptable to terminate the actor-system unless the error is catastrophic indeed. This case however is about something that can easily occur, and can be worked around by transitioning the ConstructrMachine back to the GettingNodes state. This will allow it to either eventually receive an updated list of nodes that are actually online, or obtain a lock to add itself as the first seed node and join itself to formulate the cluster.
Of course, given that some users may actually want the original behavior (i.e. terminate the actor-system on timeout), the choice on how to handle the joining "timeout" should probably be configuration driven. Also, there could be some extra configuration which would impose an upper limit of how many times to perform this loop in presence of consecutive failures (i.e. get nodes -> join seed nodes -> join timeout ->
get nodes again etc)
Your thoughts on the above?
from constructr.
The problem here is that joinSeedNodes
repeatedly continues trying to join in the back (by default every 5 seconds or so). So transitioning back into GettingNodes
leads to an inconsistent state. This would only work if we used some other way (non-repeatedly) to join.
from constructr.
joinSeedNodes
restarts the repeated attempts with its given input every time it is called; i.e. executing joinSeedNodes(A)
and then joinSeedNodes(B)
will have the effect that the actor-system will stop trying connect to A
and start trying to connect to B
. I have confirmed this by going through the akka source code and I also made a proof of concept earlier this day that demonstrates the desired scenario. By transitioning back to GettingNodes
the FSM will receive the node list from the backend once again, and eventually switch to Joining
where it will re-execute the joinSeedNodes
. If the list is not yet refreshed (i.e. the downed node's key has not yet expired) the joining process will timeout again, thus re-triggering the process from scratch. This loop will stop as soon as the downed node's key expires, and the second node registers itself as the new seed.
Obviously, I will add an integration test that showcases this scenario.
from constructr.
What will happen when calling joinSeedNodes for a second Time? (Sorry, I have no access to source code right now). And what if while transitioning the first attempt finally succeeded concurrently?
from constructr.
No problem :) This is a glimpse of the internal joinSeedNodes implementation:
def joinSeedNodes(newSeedNodes: immutable.IndexedSeq[Address]): Unit = {
if (newSeedNodes.nonEmpty) {
stopSeedNodeProcess()
seedNodes = newSeedNodes // keep them for retry
seedNodeProcess = // ...
calling the method multiple times has the effect of stopping the previous attempt, and starting a new one with the new seed sequence.
Now in the case of successful connection while transitioning, what must have happened is that the the first node was down for a while, but has successfully restarted and has also joined the cluster (i.e. has joined itself and formed a cluster). If that is the case, then there is no meaning into "joining again" (I also don't know what effect it would have, but I will try it out since we mentioned it :P ), so we will make use of the MemberUp event reception by the FSM to guard against this corner case (i.e. if a node has become up but has also previously transitioned to GettingNodes
, then upon MemberUp reception it will transition immediately to AddingSelf
as it would normally do after Joining
).
from constructr.
I was working on the solution similar to @nick-nachos but didn't have enough time to test all corner cases. So if you can link this issue to your future PR, I'll be happy to take a look.
from constructr.
Related Issues (20)
- Setting etcd-port via environment HOT 2
- Get rid of retry limit for communication with coordination service HOT 15
- When Connecting to another akka using TCP it returns exception can not bind to the port HOT 1
- Publish 0.16.0 to maven central HOT 2
- Java API HOT 3
- Multiple ActorSystems and ClusterActorRefProvider cause duplicate init of akka-remoting
- Fixed path of cluster nodes in EtcdCoordination HOT 3
- Risk of island creation if max-nr-of-seed-nodes < actual number of seeds and host is reused HOT 2
- Can't build project due to java.lang.NoSuchMethodError
- akka.remote.netty.tcp.bind-hostname vs hostname HOT 1
- Release for 2.12 HOT 1
- Add close() method to Coordination for resource cleanup
- Delete key-value entries from backend if possible HOT 3
- Propagate ConstructR failures to user-space HOT 2
- Discussion: Given Akka Cluster Bootstrap, do we need ConstructR any longer HOT 21
- Deadlock when error during ConstructrMachine initialization
- Make the path in etcd customisable
- constructr 1.8.1 not available for scala 2.11 HOT 2
- Retirement HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from constructr.