Giter Club home page Giter Club logo

Comments (7)

pepinns avatar pepinns commented on July 2, 2024 1

Thanks for clarifying, I was overlooking the fact that a log is committed before it's replicated everywhere.

from openraft.

github-actions avatar github-actions commented on July 2, 2024

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

from openraft.

drmingdrmer avatar drmingdrmer commented on July 2, 2024

Thank you for your detailed report. By the current design it should not be
considered as a bug.

Because the removed and then restarted node has unexpectedly changed its data,
the leader can not guarantee handling such condition correctly.

In a distributed system like Openraft, a node must guarantee the persisted
data wont change. Otherwise it is a severe bug and the entire system should be
shutdown at once to prevent further data damage.

However, we understand that in certain scenarios, like testing or
troubleshooting, you may need to wipe out all data and restart the node.

To address your specific use case, I propose adding a feature flag that relaxes
the replication progress checks. This would allow the leader node to continue data
replication to the follower, even in scenarios where all data has been wiped out
and the node restarted.

Please let me know if you have further questions.

from openraft.

pepinns avatar pepinns commented on July 2, 2024

I guess I'm thinking of this slightly differently, in my case all the nodes are known in advance and configured as a cluster. These clusters are then deployed as an isolated unit of 5 machines.

I was thinking of production scenarios, where a node has to be replaced. In production the nodes could be VMs which can die at any time. Being able to bring a new one online to replace a failed one would make this easier to operate, and does seem to work if we don't panic in the leader.
With the panic, ( if handled and shutdown is initiated), then the rest of the cluster is fine, but the leader ends up shutting down abruptly and restarting.

Is there something I'm not considering here that makes this operational plan invalid or dangerous?

from openraft.

pepinns avatar pepinns commented on July 2, 2024

The feature flag seems like a good plan though. I must admit I didn't find a good place in the codepath to check for this case.
Do you think the solution is on the Follower side codepath? Or in the leader's handling of the Conflict Message?

from openraft.

drmingdrmer avatar drmingdrmer commented on July 2, 2024

@pepinns
Yes it is dangerous: Replacing a node with another empty one may cause data loss.
Assumes the leader is N1, followers are N2,N3,N4,N5;

  • A log(x) that is replicated by N1 to N2,N3 is considered committed.
  • At this point, if N3 is replaced with an empty node, and at once the leader N1 is crashed. Then N5 may elected as a new leader with granted vote by N3,N4;
  • Then the new leader N5 will not have log x.

The standard way is call change_membership() to remove the crashed node, then start a new empty node, finally call change_membership() again to add the new empty node back to the cluster.

Another non-standard tricky way is to set up a cluster of 6 nodes. The quorum is 4 nodes thus you can replace one node with empty state without losing data.

I'll open a PR to show you where to add the patch to address the panicking issue.

from openraft.

drmingdrmer avatar drmingdrmer commented on July 2, 2024

@pepinns
The log reversion issue should best be addressed when the conflicting event is reported to the progress:

from openraft.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.