We're experimenting with akka cluster & constructr (with zookeeper) with a single

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I was working on the solution similar to <a class="user-mention notranslate" data-hove

Join timing out in handover scenario about constructr HOT 8 CLOSED

amarouni commented on June 27, 2024

Join timing out in handover scenario

from constructr.

Comments (8)

aleksandr-vin commented on June 27, 2024

Hello @amarouni -- can you please elaborate a bit on the exact moments of time (in what particular order they do this) your nodes perform:

join
start
stop

And what you define by "node is started" -- does it mean the node joined the cluster?

from constructr.

amarouni commented on June 27, 2024

@aleksandr-vin Here's a rough timeline of what we're seeing :

node1 is started (docker container started)
node1 node adds itself as a seed node and then joins itself
Akka cluster started with this single node, and everything works as expected
handover started node2 is started (docker container started) & node1 is shutdown (docker container killed)

That's where we noticed this race condition, sometimes node2 will start before node1 is completely shutdown and it picks node1's address (as seed node) from zookeeper.

node2 tries to join the non existing node1
node2 times out
Akka cluster is shut down

Hope this helps

from constructr.

nick-nachos commented on June 27, 2024

I can also confirm the issue. My team is considering to adopt constructR as an akka-cluster discovery solution, and to this end we have examined the implementation and have come up with a couple of suggestions for improvement (will create corresponding issues soon), which of course we are more that happy to implement and contribute back to the project.

Back to the issue at hand: the problem here is that when a node gets the list of nodes from the coordination service, it performs a joinSeedNodes on that list and waits until it joins the cluster. If the operation times out (specifically, if the Joining FSM state times out) then the ConstructrMachine stops it self, which will lead to the termination of the actor-system. This will be the case in a scenario such as the one described by the OP, where a first node joins, then it fails, and before the expiration of the TTL of its DB entry, a second node tries to join. The second node will receive a list of size 1, containing the node that is "dead", and will try to join it unsuccessfully until the FSM state timeout ends up killing it as well.

According to @hseeberger, this choice is made in order to fail fast, however for some users (us included), it may not be acceptable to terminate the actor-system unless the error is catastrophic indeed. This case however is about something that can easily occur, and can be worked around by transitioning the ConstructrMachine back to the GettingNodes state. This will allow it to either eventually receive an updated list of nodes that are actually online, or obtain a lock to add itself as the first seed node and join itself to formulate the cluster.

Of course, given that some users may actually want the original behavior (i.e. terminate the actor-system on timeout), the choice on how to handle the joining "timeout" should probably be configuration driven. Also, there could be some extra configuration which would impose an upper limit of how many times to perform this loop in presence of consecutive failures (i.e. get nodes -> join seed nodes -> join timeout ->
get nodes again etc)

Your thoughts on the above?

from constructr.

hseeberger commented on June 27, 2024

The problem here is that joinSeedNodes repeatedly continues trying to join in the back (by default every 5 seconds or so). So transitioning back into GettingNodes leads to an inconsistent state. This would only work if we used some other way (non-repeatedly) to join.

from constructr.

nick-nachos commented on June 27, 2024

joinSeedNodes restarts the repeated attempts with its given input every time it is called; i.e. executing joinSeedNodes(A) and then joinSeedNodes(B) will have the effect that the actor-system will stop trying connect to A and start trying to connect to B. I have confirmed this by going through the akka source code and I also made a proof of concept earlier this day that demonstrates the desired scenario. By transitioning back to GettingNodes the FSM will receive the node list from the backend once again, and eventually switch to Joining where it will re-execute the joinSeedNodes. If the list is not yet refreshed (i.e. the downed node's key has not yet expired) the joining process will timeout again, thus re-triggering the process from scratch. This loop will stop as soon as the downed node's key expires, and the second node registers itself as the new seed.

Obviously, I will add an integration test that showcases this scenario.

from constructr.

hseeberger commented on June 27, 2024

What will happen when calling joinSeedNodes for a second Time? (Sorry, I have no access to source code right now). And what if while transitioning the first attempt finally succeeded concurrently?

from constructr.

nick-nachos commented on June 27, 2024

No problem :) This is a glimpse of the internal joinSeedNodes implementation:

def joinSeedNodes(newSeedNodes: immutable.IndexedSeq[Address]): Unit = {
    if (newSeedNodes.nonEmpty) {
      stopSeedNodeProcess()
      seedNodes = newSeedNodes // keep them for retry
      seedNodeProcess = // ...

calling the method multiple times has the effect of stopping the previous attempt, and starting a new one with the new seed sequence.

Now in the case of successful connection while transitioning, what must have happened is that the the first node was down for a while, but has successfully restarted and has also joined the cluster (i.e. has joined itself and formed a cluster). If that is the case, then there is no meaning into "joining again" (I also don't know what effect it would have, but I will try it out since we mentioned it :P ), so we will make use of the MemberUp event reception by the FSM to guard against this corner case (i.e. if a node has become up but has also previously transitioned to GettingNodes, then upon MemberUp reception it will transition immediately to AddingSelf as it would normally do after Joining).

from constructr.

amarouni commented on June 27, 2024

I was working on the solution similar to @nick-nachos but didn't have enough time to test all corner cases. So if you can link this issue to your future PR, I'll be happy to take a look.

from constructr.

Join timing out in handover scenario about constructr HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent