Giter Club home page Giter Club logo

Comments (2)

kuujo avatar kuujo commented on May 24, 2024

The reason this will not happen is because, when server 2 starts and rejoins the cluster, the leader will indeed see that it's already a member. But because server 2 is a member, the leader will still send periodic AppendRequests to server 2. Even if no commands are committed, leaders still must send an empty AppendRequest to maintain their leadership. This is sent every few hundred milliseconds (by default). So, after server 2 rejoins, it will set an election timeout. Within that election timeout, the leader will send an AppendRequest and server 2 will learn about the leader.

We could also add the leader to JoinResponse so restarting servers learn about the leader more quickly.

BTW as for testing, this is all tested very thoroughly in real networks with Jepsen, which arbitrarily partitions and crashes servers while submitting commands to the cluster. Jepsen tests force a number of different partition and failure scenarios that ultimately force many leader changes, require servers to converge on the correct leader, and then validate that the resulting state is linearizable.

On Dec 1, 2015, at 2:32 AM, Bastian Glöckle [email protected] wrote:

Assume a cluster with 2 nodes. Node 1 is the leader and 2 is a follower or is passive. Now, node 2 is shut down (=kill -9), but restarted quickly. This means that node 2 lost all of its main memory state and does not now about the leader of the cluster anymore. The CopycatServer will then be #open()ed again and it will try to re-join the cluster: It will send a JoinRequest to node 1 and receive a JoinResponse and then directly transition to passive or follower again (in ServerState#join(Iterator, CompletableFuture)). The problem now is that CopycatServer#open() on node 2 decides to wait until it receives information on who is the leader and only then it completes its CompletableFuture. But it won't receive the leaders address. This is set (if I see this correctly) in *State#append (LeaderState, ActiveState, PassiveState), but that method won't be called, as node 1 (the leader) decides to not send a new ConfigurationEntry, because it thinks that the node belongs to the clus ter alre ady (see LeaderState#join(JoinRequest), the if where the clusters members are checked).

Am I now using the CopycatServer wrong somehow?
AFAIK there cannot be a "timeout" of a node when it is assumed that the node left the cluster without it sending a LeaveRequest (as that would effectively allow split-brains), therefore the above problem might not even be one if a node is restarted quickly, but also if restarted after some time, e.g. after it has been killed (kill -9, network partition where the Ops people decided to restart the servers, power-loss on the machines, ...).
Are there any integration tests that test such behavior? This should be fairly simple to test: Start a cluster, kill one node that is not the leader, restart the node and check if it connected to the cluster successfully again.

I can see that this would not be a that big problem in a cluster where there's a lot of activity (= a lot of AppendRequests issued), as the node will reconnect on the first AppendRequest being sent. But there might be clusters where there's not that much activity, but the application (which uses copycat) wants to wait for the local copycat server to be initialized and then send some Commands to the cluster...

Perhaps the leader should append a NoOpEntry?


Reply to this email directly or view it on GitHub.

from copycat.

bgloeckle avatar bgloeckle commented on May 24, 2024

Ah, thanks for the explanation. I'm very sure that I saw this in my local diqube setup here yesterday: Node 2 did not learn about the leader.
But to be honest, I'm not 100% sure about the keep alives any more and what Duration I had set for those. And after I debugged and found that source location, I was pretty sure I found a bug. Sorry, my mistake :/

from copycat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.