Describe the bug The debug_assert in the progress entry module ca

👋 Thanks for opening this issue! Get help or engage by: <ul dir

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

debug_assert causes leader to panic about openraft HOT 7 CLOSED

pepinns commented on July 2, 2024

debug_assert causes leader to panic

from openraft.

Comments (7)

pepinns commented on July 2, 2024 1

Thanks for clarifying, I was overlooking the fact that a log is committed before it's replicated everywhere.

from openraft.

github-actions commented on July 2, 2024

👋 Thanks for opening this issue!

Get help or engage by:

/help : to print help messages.
/assignme : to assign this issue to you.

from openraft.

drmingdrmer commented on July 2, 2024

Thank you for your detailed report. By the current design it should not be
considered as a bug.

Because the removed and then restarted node has unexpectedly changed its data,
the leader can not guarantee handling such condition correctly.

In a distributed system like Openraft, a node must guarantee the persisted
data wont change. Otherwise it is a severe bug and the entire system should be
shutdown at once to prevent further data damage.

However, we understand that in certain scenarios, like testing or
troubleshooting, you may need to wipe out all data and restart the node.

To address your specific use case, I propose adding a feature flag that relaxes
the replication progress checks. This would allow the leader node to continue data
replication to the follower, even in scenarios where all data has been wiped out
and the node restarted.

Please let me know if you have further questions.

from openraft.

pepinns commented on July 2, 2024

I guess I'm thinking of this slightly differently, in my case all the nodes are known in advance and configured as a cluster. These clusters are then deployed as an isolated unit of 5 machines.

I was thinking of production scenarios, where a node has to be replaced. In production the nodes could be VMs which can die at any time. Being able to bring a new one online to replace a failed one would make this easier to operate, and does seem to work if we don't panic in the leader.
With the panic, ( if handled and shutdown is initiated), then the rest of the cluster is fine, but the leader ends up shutting down abruptly and restarting.

Is there something I'm not considering here that makes this operational plan invalid or dangerous?

from openraft.

pepinns commented on July 2, 2024

The feature flag seems like a good plan though. I must admit I didn't find a good place in the codepath to check for this case.
Do you think the solution is on the Follower side codepath? Or in the leader's handling of the Conflict Message?

from openraft.

drmingdrmer commented on July 2, 2024

@pepinns
Yes it is dangerous: Replacing a node with another empty one may cause data loss.
Assumes the leader is N1, followers are N2,N3,N4,N5;

A log(x) that is replicated by N1 to N2,N3 is considered committed.
At this point, if N3 is replaced with an empty node, and at once the leader N1 is crashed. Then N5 may elected as a new leader with granted vote by N3,N4;
Then the new leader N5 will not have log x.

The standard way is call change_membership() to remove the crashed node, then start a new empty node, finally call change_membership() again to add the new empty node back to the cluster.

Another non-standard tricky way is to set up a cluster of 6 nodes. The quorum is 4 nodes thus you can replace one node with empty state without losing data.

I'll open a PR to show you where to add the patch to address the panicking issue.

from openraft.

drmingdrmer commented on July 2, 2024

@pepinns
The log reversion issue should best be addressed when the conflicting event is reported to the progress:

#903

from openraft.

debug_assert causes leader to panic about openraft HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent