asonnino / hotstuff Goto Github PK

View Code? Open in Web Editor NEW

102.0 102.0 41.0 7.55 MB

Implementation of the HotStuff consensus protocol.

License: Apache License 2.0

Rust 74.63% Python 25.37%

byzantine-fault-tolerance consensus research

hotstuff's People

Contributors

Stargazers

Watchers

hotstuff's Issues

Verify TC when handling TC

In handle_tc, the validity of the TC is not checked:

hotstuff/consensus/src/core.rs

Lines 400 to 408 in 9e9c286

 async fn handle_tc(&mut self, tc: TC) -> ConsensusResult<()> { 

 if tc.round < self.round { 

 return Ok(()); 

 } 

 self.advance_round(tc.round).await; 

 if self.name == self.leader_elector.get_leader(self.round) { 

 self.generate_proposal(Some(tc)).await; 

 } 

 Ok(())

If I am not wrong, any malicious node could send a TC, making correct nodes moving to the next round.

Limit number of payloads per node

A bad node may make us store a lot of crap. There is currently no limit to how many payload they can send us, and we will store them all.

CLI can not parse the command when multiple nodes are provided

Hi, clap cannot parse the command client [...] --nodes IP1:PORT1 IP2:PORT2 when multiple IP:PORT are provided.

hotstuff/node/src/client.rs

Line 36 in 40180b4

#[clap(short, long, value_parser, value_name = "[Addr]")]

I think adding multiple=true to this line would solve the issue.

Proposer proposes more than one block after TC

I noticed that generate_proposal sometimes gets called multiple times for a given round. I think that it might be caused by the fact that every node can broadcast a TC. So if N nodes broadcast a TC, it will be handled N times and generate_proposal called N times as well.

Implement epochs

The committee contains an epoch number, but it is never used to reject messages from wrong epochs.

A message transfer between two honest nodes can be canceled

In two places of the code, we call ReliableSender::broadcast() and wait for 2f+1 answers.

hotstuff/mempool/src/quorum_waiter.rs

Lines 70 to 83 in ce476e2

 // Wait for the first 2f nodes to send back an Ack. Then we consider the batch 

 // delivered and we send its digest to the consensus (that will include it into 

 // the dag). This should reduce the amount of synching. 

 let mut total_stake = self.stake; 

 while let Some(stake) = wait_for_quorum.next().await { 

 total_stake += stake; 

 if total_stake >= self.committee.quorum_threshold() { 

 self.tx_batch 

 .send(batch) 

 .await 

 .expect("Failed to deliver batch"); 

 break; 

 } 

 }

hotstuff/consensus/src/proposer.rs

Lines 94 to 121 in ce476e2

 let handles = self 

 .network 

 .broadcast(addresses, Bytes::from(message)) 

 .await; 

 // Send our block to the core for processing. 

 self.tx_loopback 

 .send(block) 

 .await 

 .expect("Failed to send block"); 

 // Control system: Wait for 2f+1 nodes to acknowledge our block before continuing. 

 let mut wait_for_quorum: FuturesUnordered<_> = names 

 .into_iter() 

 .zip(handles.into_iter()) 

 .map(|(name, handler)| { 

 let stake = self.committee.stake(&name); 

 Self::waiter(handler, stake) 

 }) 

 .collect(); 

 let mut total_stake = self.committee.stake(&self.name); 

 while let Some(stake) = wait_for_quorum.next().await { 

 total_stake += stake; 

 if total_stake >= self.committee.quorum_threshold() { 

 break; 

 } 

 }

This function returns handlers that cancel the transfer when they are dropped. Since we wait for 2f+1 answers before dropping the handlers, it is possible that f honest nodes do not receive the message.

I suggest adding this line :
tokio::spawn(async move { while let Some(_) = wait_for_quorum.next().await {} }); at the end of these two code blocks so the handlers do not get dropped.

I ran two local benchmark of the actual code base after adding this line and obtained 3x more Committed -> logs. (1500 vs 4600)

Implement shared randomness

What's the best approach to implement shared randomness? It will be used to elect the leader for async fallback, and chained-VABA.

Ensure safety when crash-recovering

Safety-critical information such as the last voted round are currently not persisted to storage.

Separate loopback channels from net channel

The core of consensus and mempool both use the same channel for loopback messages and net messages. This is a problem: a bad node may format their messages as "loopback" to avoid some checks.

Code documentation

Add doc strings and better comments to the rust code.

questions about environment

What are your versions of python and fabric? I run fab local in a new tmux session and it will crash, which means the session will be killed. Thanks.

Add the highest QC to timeout votes

Adding the highest known QC to timeout votes allows honest nodes to synchronize faster in case of leader failures.

Write a Wiki

Write a wiki to help getting started and understand how the code is structured.

Protect Votes aggregator and Mempool's synchronizer from DoS

A bad node may make us run out of memory by sending many votes with different round numbers (as long as they are bigger than our current round) or with different digests. We will store all of them in memory and clean them up only upon moving to the next round.

A similar issue appears in the mempool's synchronizer. A bad node may send us many different blocks for the same round (ie. blocks with different payloads), and we will try to synchronizer the block data with other nodes.

Restart and Synchronize Issue

Hi, we have been using this library for a consensus scenario. But there seems to be some issues about restarting a node.

In our scenario, we run 4 nodes for consensus.
Then we stop one of them for approximately 0.5~1 hours.
Then we restart the node.

Afterwards, the node runs for a lot of synchronization blocks and gets stuck. Moreover, it finally drags down all other three nodes, and the whole system hangs.

What could be the problem and do u have a solution for this case?

Thanks~

result of benchmark

We tested a lot of group rate under this tx_size and max_payload_size、 faults(branch 3-chain), but many of the results were 0, and we don't know why this is happening?

+ CONFIG:
 Committee size: 100 nodes
 Input rate: 2,010 tx/s
 Transaction size: 1,000 B
 Faults: 33 nodes
 Execution time: 0 s

 Consensus timeout delay: 5,000 ms
 Consensus sync retry delay: 100,000 ms
 Consensus max payloads size: 1,000 B
 Consensus min block delay: 100 ms
 Mempool queue capacity: 1,200,000 B
 Mempool sync retry delay: 100,000 ms
 Mempool max payloads size: 256,000 B
 Mempool min block delay: 500 ms

 + RESULTS:
 Consensus TPS: 0 tx/s
 Consensus BPS: 0 B/s
 Consensus latency: 0 ms

 End-to-end TPS: 0 tx/s
 End-to-end BPS: 0 B/s
 End-to-end latency: 0 ms

But once in a while, the throughput is displayed. This is also the case for 4, 10, 40, and 100 node tests.

+ CONFIG:
 Committee size: 100 nodes
 Input rate: 3,350 tx/s
 Transaction size: 1,000 B
 Faults: 33 nodes
 Execution time: 21 s

 Consensus timeout delay: 5,000 ms
 Consensus sync retry delay: 100,000 ms
 Consensus max payloads size: 1,000 B
 Consensus min block delay: 100 ms
 Mempool queue capacity: 1,200,000 B
 Mempool sync retry delay: 100,000 ms
 Mempool max payloads size: 256,000 B
 Mempool min block delay: 500 ms

 + RESULTS:
 Consensus TPS: 836 tx/s
 Consensus BPS: 836,069 B/s
 Consensus latency: 9,582 ms

 End-to-end TPS: 622 tx/s
 End-to-end BPS: 622,008 B/s
 End-to-end latency: 13,812 ms

Storage-bound implementation?

Is it possible to have an upper bound on how much storage is required to run a node for one epoch?

Implement mock storage

All unit tests that require the store currently create one instance of RockDB. We therefore have to be careful to initialise each of these instances with a different storage path or 'cargo test' cannot run tests in parallel.

It would be better to write a simple mock storage (an in-memory store) with the same interface as the current store that could be used for testing.

How you implement bcast?

Hey, I just wonder whether you just done it by:

// consensus/src/synchronizer.rs
let message =  NetMessage(Bytes::from(bytes), addresses);
network_channel.send(message).await;

If so, why a simple send() can achieve it, I found it in doc that the only parameter will just be handled as a message.
If not, then how can the protocol achieve it?

Looking forward to your kind reply 😃

Add highQC in timeout messages

We need to add the highQC in timeout messages to enable bounded time catchup for slow nodes

Mention dependency of librocksdb-sys on clang in README.md?

Since README.md provides quick start instructions, perhaps mention the dependency of librocksdb-sys on clang there.

During build at

Compiling librocksdb-sys v6.11.4

I got the errors

  --- stdout
  cargo:warning=couldn't execute `llvm-config --prefix` (error: No such file or directory (os error 2))
  cargo:warning=set the LLVM_CONFIG_PATH environment variable to the full path to a valid `llvm-config` executable (including the executable itself)

  --- stderr
  thread 'main' panicked at 'Unable to find libclang: "couldn\'t find any valid shared libraries matching: [\'libclang.so\', \'libclang-*.so\', \'libclang.so.*\', \'libclang-*.so.*\'], set the `LIBCLANG_PATH` environment variable to a path where one of these files can be found (invalid: [])"'

On Arch Linux, installing extra/clang (version 11.1.0-1) resolved the issue.

Memory-bound synchroniser?

Is it possible to make the synchroniser memory bound?

reading configuration for node

It seems that there is a typo in this following line when reading node parameters

hotstuff/benchmark/benchmark/config.py

Line 94 in d771d48

inputs += [json['consensus']['sync_retry_delay']]

Unify the mempool driver and the synchroniser

When the consensus core receives a new block, it first checks whether its mempool has the associated payload. If it doesn't, the mempool driver keeps the block and re-schedule execution when the mempool managed to get the payload from another node.

However, the synchroniser has no idea of this. So it can happen that the synchroniser tries to sync blocks that we already have in the mempool driver.

Adapt the benchmark for WAN

The benchmarking scripts currently support a single AWS region.

Sharing small payloads with other nodes timely

I am using this implementation as an "SMR module" in a prototype where the effective transaction rate is quite low. In this case, transactions "get stuck" in the mempool (more specifically, the Runner of the PayloadMaker) as follows: Since the number of transactions pending at a node is low, https://github.com/asonnino/hotstuff/blob/main/mempool/src/payload.rs#L49 never makes a payload; so the only time a payload is made (and the few pending transactions are shared with other nodes) is when the node becomes leader, which with many nodes might take a while and so the transactions experience quite some latency.

If instead even small payloads were shared with other nodes timely, then other leaders could propose these transactions, leading to better latency (albeit probably more overhead in communicating smaller payloads).

As a workaround, since in my use case throughput is really low, it is quick to hack https://github.com/asonnino/hotstuff/blob/main/mempool/src/payload.rs#L47 so that it makes and shares a new payload for every incoming transaction.

For a more universal solution, one might want a new mempool config parameter controlling a timeout to get a behavior like "keep adding transactions to the payload in the making; make and share the payload either once it is big or once the timeout for the oldest pending transaction has expired". Let me know your thoughts, I'd be happy to modify accordingly and submit a pull request.

Btw, great code base, thanks for sharing!

Separate Front and Mempool?

The mempool currently handles both incoming transactions (from clients) and incoming payloads (from other nodes). Should these two functions be separated and run in different tokio tasks?

Generic network

The mempool front is almost a copy-past of the network receiver; making it use the same network as the mempool and consensus should be a small change. Also take this opportunity to transfer the serialized payload to the core of mempool and consensus so that they can store it avoid a deserialize-serialize round.

Accounting for sync replies

We currently reply to any sync request we receive, which costs us resources (specifically for the worker). We need to do some accounting to prevent bad nodes from monopolizing our resources.

	async fn handle_tc(&mut self, tc: TC) -> ConsensusResult<()> {
	if tc.round < self.round {
	return Ok(());
	}
	self.advance_round(tc.round).await;
	if self.name == self.leader_elector.get_leader(self.round) {
	self.generate_proposal(Some(tc)).await;
	}
	Ok(())

	// Wait for the first 2f nodes to send back an Ack. Then we consider the batch
	// delivered and we send its digest to the consensus (that will include it into
	// the dag). This should reduce the amount of synching.
	let mut total_stake = self.stake;
	while let Some(stake) = wait_for_quorum.next().await {
	total_stake += stake;
	if total_stake >= self.committee.quorum_threshold() {
	self.tx_batch
	.send(batch)
	.await
	.expect("Failed to deliver batch");
	break;
	}
	}

	let handles = self
	.network
	.broadcast(addresses, Bytes::from(message))
	.await;

	// Send our block to the core for processing.
	self.tx_loopback
	.send(block)
	.await
	.expect("Failed to send block");

	// Control system: Wait for 2f+1 nodes to acknowledge our block before continuing.
	let mut wait_for_quorum: FuturesUnordered<_> = names
	.into_iter()
	.zip(handles.into_iter())
	.map(\|(name, handler)\| {
	let stake = self.committee.stake(&name);
	Self::waiter(handler, stake)
	})
	.collect();

	let mut total_stake = self.committee.stake(&self.name);
	while let Some(stake) = wait_for_quorum.next().await {
	total_stake += stake;
	if total_stake >= self.committee.quorum_threshold() {
	break;
	}
	}

asonnino / hotstuff Goto Github PK

hotstuff's People

Contributors

Stargazers

Watchers

Forkers

hotstuff's Issues

Recommend Projects

Recommend Topics

Recommend Org