Comments (7)
LogId is also used on a follower to determine if a log entry is identical to the follower's local log entry at the same index, when a follower receives append-entries request.
Sure, but considering that Raft doesn't allow to have two leaders for the same term
, the term
comparison must be a sufficient condition - no need for node_id
there.
May be I can introduce a feature to disable/enable to contain a node-id in LogId. 🤔
I wouldn't complicate the API and core algorithms. Either-or, I'd say.
We have plenty of optimization potential elsewhere (e.g., get rid of Vec<T>
and similar on the API, replacing them with impl IntoIterator<T>
or similar to prevent unnecessary memory allocation, where not required, etc.).
I did not know you have such a big node-id :( Is it possible using a u32 node-id and store the 16B real id in a Node?
Indeed, I'm thinking about doing exactly that. So far, we didn't use the Node
, but maybe it's a better way; simply store Raft node index in the node_id
and use Node
to store the actual node ID (we use 16B GUIDs; that's a long story). That would actually make some other operations simpler (restoring a consensus domain replica in a catastrophic scenario on a completely unrelated node w/o the need to "renumber" everything).
from openraft.
👋 Thanks for opening this issue!
Get help or engage by:
/help
: to print help messages./assignme
: to assign this issue to you.
from openraft.
I added a node_id
to LeaderId
because LeaderId(term, node_id)
reduces conflict during an election:
Given 3 nodes, if all of them start to elect at the same time, with the same term
, without node_id
, it is very likely there is no leader established in this term
:
a
voted for itself and sent vote requests tob
,c
;b
voted for itself and sent vote requests toa
,c
;c
voted for itself and sent vote requests toa
,b
;
Because they all have already voted for themselves, they just reject every received vote request from other nodes. This is the way the standard raft does
But with node_id
, finally there will be a leader elected in this term
. Because LeaderId(term=1, node_id=2) > LeaderId(term=1, node_id=1)
, the node with greatest node_id
will finally succeed.
In any case, removing the comparison for testing purposes doesn't have any effect on test outcome - all green.
It has to compare node_id
too here, because there might have been multiple leaders in the same term, e.g., LeaderId(term=3,node_id=a)
then a new LeaderId(term=3,node_id=b)
established. In this case, a log of LeaderId(term=3,node_id=a)
being replicated by LeaderId(term=3,node_id=b)
does not imply the log is committed.
It looks I have to add some tests for such a case:(
The only location in code where
LeaderId::node_id
is evaluated seems to be inEngine::update_progress()
, where it would likely suffice to compare the term (the same term cannot be used by two different leaders).
This is the only location where it is explicitly used. When handling a vote request, a node just compares votes with req.vote >= self.vote
, this is the most important usage of it.
from openraft.
Hm, I still don't get it. The standard Raft uses only term
, but no node_id
for its election and it works correctly. Raft guarantees by its election process that there cannot be two leaders elected for the same term
. So how come there are two leaders with the same term
in openraft
?
In the first case you described (all start voting for self at the exact time), there would be no leader elected for this term, which would trigger another election with a new term. The likelihood that even the second election again occurs at exact the same time on all nodes (and fails too) is virtually zero.
When I understand you correctly, you only store node_id
to speed up the election? If so, I don't think it makes sense to optimize the "bad" case (new election occurring once in a while) by penalizing the "good" case all the time (e.g., additional info to be logged; occurring all the time).
The vote handling compares log ID (i.e., term + log index). That should be sufficient, since the same term
cannot be used by two different nodes. I don't see how the node_id
should influence anything there.
from openraft.
Hm, I still don't get it. The standard Raft uses only
term
, but nonode_id
for its election and it works correctly. Raft guarantees by its election process that there cannot be two leaders elected for the sameterm
. So how come there are two leaders with the sameterm
inopenraft
?
In standard raft, a leader is not just identified by term
. A leader is identified by a tuple LeaderId: (term, voted_for)
, e.g., a node has voted for node-1 in term 3 has its LeaderId
stored as (term=3, voted_for=1)
.
The LeaderId
in standard raft is a partial-order value, because voted_for=1 !> voted_for=2
and voted_for=1 !< voted_for=2
. I had a slide explaining the conflict when candidate y
in term 5 tries sending vote request to another node in term 5:
And this partial-order LeaderId
allows only one leader to be granted in every term in the standard raft, term
, without voted_for
can be used as a leader identifier.
In openraft LeaderId(term, node_id)
is a total-order value, i.e., LeaderId(term=3, node_id=2) > LeaderId(term=3,node_id=1)
, which makes it possible to have multiple leaders in one term. But still, at any time, there is only one leader is granted by a quorum. E.g., LeaderId(term=3,node_id=1)
is granted by node-1 and node-2(3 nodes cluster), then LeaderId(term=3,node_id=2)
is granted by node-1 and node-2. Because LeaderId
is totally ordered, a node can grant LeaderId(term=3,node_id=2)
even when it has already granted LeaderId(term=3,node_id=1)
.
In the first case you described (all start voting for self at the exact time), there would be no leader elected for this term, which would trigger another election with a new term. The likelihood that even the second election again occurs at exact the same time on all nodes (and fails too) is virtually zero.
Right, the chance there is another round of conflict is rare. But still, it takes two rounds of election. The time gap between these two rounds will be considered service downtime(about 2 RTT).
When I understand you correctly, you only store
node_id
to speed up the election?
Generally speaking, yes:). And it simplifies the logic. Adding another field will save several lines of code that handle term
.
If so, I don't think it makes sense to optimize the "bad" case (new election occurring once in a while) by penalizing the "good" case all the time (e.g., additional info to be logged; occurring all the time).
Hm... to me, with or without a node_id
in a log entry, the storage or transport efficiency can be well optimized when needed, e.g., extracting term, node_id
and storing them separately. But the election conflict can not be solved without node_id
in the LeaderId
.
from openraft.
Wow, what a long explanation :-), thanks.
I was imprecise, sorry. Of course, (one) voted-for node ID needs to be stored and the node_id
is sent as part of the Vote
message.
So we can conclude that the node_id
storage is for speeding up the election. I can live with that, I just have to find a way how to optimize log storage that we don't have to store the additional node_id
information for each log entry unnecessarily. Maybe something like not storing the term
/node_id
per log entry at all and rather generate "switch" log entries when there is a change.
Alternatively, looking at the code again, it seems like Entry
could use a simpler LogId
only containing term
and index
, without the node_id
. The node_id
seems to be only relevant for replication reply, where it can be easily computed based on the last voted_for
for Matching
. For Conflict
, it can be ignored, since it's not used. Of course, the default implementation of RaftLogReader
would be more complex/impossible and quite a few changes would be needed elsewhere...
I'll probably go with the above mentioned storage optimization in the log ("switch" entries) or we'll store our 16B node IDs for each log entry anyway.
from openraft.
Alternatively, looking at the code again, it seems like
Entry
could use a simplerLogId
only containingterm
andindex
, without thenode_id
. Thenode_id
seems to be only relevant for replication reply, where it can be easily computed based on the lastvoted_for
forMatching
. ForConflict
, it can be ignored, since it's not used. Of course, the default implementation ofRaftLogReader
would be more complex/impossible and quite a few changes would be needed elsewhere...
LogId
is also used on a follower to determine if a log entry is identical to the follower's local log entry at the same index, when a follower receives append-entries request.
I have a LogIdList structure that contains every log id from which a new LeaderId
starts:
https://github.com/datafuselabs/openraft/blob/main/openraft/src/engine/log_id_list.rs
In fact, in my opinion, such a structure reflects more precisely the data structure in raft.
Logs should be a 2 dimension list entry = logs[leader_id][index]
.
May be I can introduce a feature to disable/enable to contain a node-id in LogId. 🤔
I'll probably go with the above mentioned storage optimization in the log ("switch" entries) or we'll store our 16B node IDs for each log entry anyway.
I did not know you have such a big node-id :(
Is it possible using a u32 node-id and store the 16B real id in a Node
?
Membership
stores every NodeId
and Node
for applications:
openraft/openraft/src/membership/membership.rs
Lines 73 to 87 in e31b0f9
from openraft.
Related Issues (20)
- Feature: Monoio Runtime HOT 3
- Add sync primitives to `AsyncRuntime` trait HOT 1
- Add `AsyncRuntime::oneshot` HOT 2
- Propose stream-based Replication of Log Entries HOT 2
- Dynamic Cluster Membership HOT 5
- Ditch `async_trait` in favor of the official `async_fn_in_trait` feature HOT 6
- Refactor: Move `client_resp_channels` from `LeaderData` to new `tx_...`-field in `RaftCore` HOT 2
- Linearizable read with `ReadIndex` HOT 1
- about single node HOT 6
- Cluster Bootstrap and Node Initialization Sequence HOT 4
- Tracking issue for examples update HOT 1
- Update example `raft-kv-memstore` to use `storage-v2` HOT 2
- Update example `raft-kv-rocksstore` to use `storage-v2` HOT 1
- Move examples into the workspace HOT 1
- Install Snapshot v1 api HOT 3
- Observe state changes in a Raft node HOT 6
- Split metrics into data metrics and server metrics HOT 5
- About automatic remove HOT 3
- Backing up the WAL HOT 6
- Raft Core Panicking HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openraft.