Giter Club home page Giter Club logo

Comments (8)

aleksandr-vin avatar aleksandr-vin commented on June 27, 2024

Hello @amarouni -- can you please elaborate a bit on the exact moments of time (in what particular order they do this) your nodes perform:

  • join
  • start
  • stop

And what you define by "node is started" -- does it mean the node joined the cluster?

from constructr.

amarouni avatar amarouni commented on June 27, 2024

@aleksandr-vin Here's a rough timeline of what we're seeing :

  1. node1 is started (docker container started)
  2. node1 node adds itself as a seed node and then joins itself
  3. Akka cluster started with this single node, and everything works as expected
  4. handover started node2 is started (docker container started) & node1 is shutdown (docker container killed)

That's where we noticed this race condition, sometimes node2 will start before node1 is completely shutdown and it picks node1's address (as seed node) from zookeeper.

  1. node2 tries to join the non existing node1
  2. node2 times out
  3. Akka cluster is shut down

Hope this helps

from constructr.

nick-nachos avatar nick-nachos commented on June 27, 2024

I can also confirm the issue. My team is considering to adopt constructR as an akka-cluster discovery solution, and to this end we have examined the implementation and have come up with a couple of suggestions for improvement (will create corresponding issues soon), which of course we are more that happy to implement and contribute back to the project.

Back to the issue at hand: the problem here is that when a node gets the list of nodes from the coordination service, it performs a joinSeedNodes on that list and waits until it joins the cluster. If the operation times out (specifically, if the Joining FSM state times out) then the ConstructrMachine stops it self, which will lead to the termination of the actor-system. This will be the case in a scenario such as the one described by the OP, where a first node joins, then it fails, and before the expiration of the TTL of its DB entry, a second node tries to join. The second node will receive a list of size 1, containing the node that is "dead", and will try to join it unsuccessfully until the FSM state timeout ends up killing it as well.

According to @hseeberger, this choice is made in order to fail fast, however for some users (us included), it may not be acceptable to terminate the actor-system unless the error is catastrophic indeed. This case however is about something that can easily occur, and can be worked around by transitioning the ConstructrMachine back to the GettingNodes state. This will allow it to either eventually receive an updated list of nodes that are actually online, or obtain a lock to add itself as the first seed node and join itself to formulate the cluster.

Of course, given that some users may actually want the original behavior (i.e. terminate the actor-system on timeout), the choice on how to handle the joining "timeout" should probably be configuration driven. Also, there could be some extra configuration which would impose an upper limit of how many times to perform this loop in presence of consecutive failures (i.e. get nodes -> join seed nodes -> join timeout ->
get nodes again etc)

Your thoughts on the above?

from constructr.

hseeberger avatar hseeberger commented on June 27, 2024

The problem here is that joinSeedNodes repeatedly continues trying to join in the back (by default every 5 seconds or so). So transitioning back into GettingNodes leads to an inconsistent state. This would only work if we used some other way (non-repeatedly) to join.

from constructr.

nick-nachos avatar nick-nachos commented on June 27, 2024

joinSeedNodes restarts the repeated attempts with its given input every time it is called; i.e. executing joinSeedNodes(A) and then joinSeedNodes(B) will have the effect that the actor-system will stop trying connect to A and start trying to connect to B. I have confirmed this by going through the akka source code and I also made a proof of concept earlier this day that demonstrates the desired scenario. By transitioning back to GettingNodes the FSM will receive the node list from the backend once again, and eventually switch to Joining where it will re-execute the joinSeedNodes. If the list is not yet refreshed (i.e. the downed node's key has not yet expired) the joining process will timeout again, thus re-triggering the process from scratch. This loop will stop as soon as the downed node's key expires, and the second node registers itself as the new seed.

Obviously, I will add an integration test that showcases this scenario.

from constructr.

hseeberger avatar hseeberger commented on June 27, 2024

What will happen when calling joinSeedNodes for a second Time? (Sorry, I have no access to source code right now). And what if while transitioning the first attempt finally succeeded concurrently?

from constructr.

nick-nachos avatar nick-nachos commented on June 27, 2024

No problem :) This is a glimpse of the internal joinSeedNodes implementation:

def joinSeedNodes(newSeedNodes: immutable.IndexedSeq[Address]): Unit = {
    if (newSeedNodes.nonEmpty) {
      stopSeedNodeProcess()
      seedNodes = newSeedNodes // keep them for retry
      seedNodeProcess = // ...

calling the method multiple times has the effect of stopping the previous attempt, and starting a new one with the new seed sequence.

Now in the case of successful connection while transitioning, what must have happened is that the the first node was down for a while, but has successfully restarted and has also joined the cluster (i.e. has joined itself and formed a cluster). If that is the case, then there is no meaning into "joining again" (I also don't know what effect it would have, but I will try it out since we mentioned it :P ), so we will make use of the MemberUp event reception by the FSM to guard against this corner case (i.e. if a node has become up but has also previously transitioned to GettingNodes, then upon MemberUp reception it will transition immediately to AddingSelf as it would normally do after Joining).

from constructr.

amarouni avatar amarouni commented on June 27, 2024

I was working on the solution similar to @nick-nachos but didn't have enough time to test all corner cases. So if you can link this issue to your future PR, I'll be happy to take a look.

from constructr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.