pheymann / scala.bft Goto Github PK

Parallel Byzantine Fault Tolerance (R&D)

License: MIT License

Scala 100.00%

scala.bft's Introduction

Scala.bft (I played a bit around, soooo crapy code ahead :)

Note: I am working on a federated BFT approach which also utilises the idea of parallel pbft which I am introducing here. Maybe I find the time to write it down in a nice way and publish it in the near future.

This projects aims to provide a prototype of the Byzantine Fault Tolerance protocole and furthermore to introduce a mutli-leader approach to obtain parallelism and an improved performance.

BFT And Its Parallelization

This page gives a rough overview about Byzantine Fault Tolerance (BFT) and how the parallelization is done.

Fore a detailed description of the BFT protocol read the paper Practical Byzantine Fault Tolerance.

Introduction
BFT in a Nutshell
Multi-Leader Approach
Machine Learning to predict the best Partitioning

Short Introduction

The number of online services which are used in the daily life is largely increasing, just like the requirements the provider and customers lay down on them. Many of these systems consist of critical data, e.g. user information or monetary transactions, which specifically demand for:

higher robustness,
higher availability,
stronger consistency.

To deal with them the following two approaches were developed:

state machines (SM):
A service is represented as state machine consisting of a state S and a distinct number of operations o_i to modify the state.
replication technology:
A number of n replicas (mirrors) are created from a state machine acting like a single service to the client.

Thus, the robustness and availability can be increased as the service doesn't fail when on replica is going down. On the other hand it doesn't give any guarantees on the consistency of the underlying data, introducing two types of possible failures to the system:

fail-stop failure:
A replica may shutdown and doesn't receives the latest changes on the state. When coming back to production it runs in a different state as the rest of the system, violating the consistency. Reasons for that may by network separations, resource management (e.g. process doesn't get cpu for a certain amount of time), etc.
byzantine failure:
An even worse scenario leads to a replica which continues to work but doesn't follow its specification. Reasons for that may attacks on the systems, programming errors (bugs) or hardware glitches.

Therefore, the replicated state machine service has to become fault tolerant and be able to agree on a state consensus. One general concept for handling fail-stop and byzantine failures is the Byzantine Fault Tolerance protocol.

BFT In A Nutshell

A BFT system consists of a single leader L and a number of follower F machines with |F| = n - 1. When a client C requests an operation o_i, the request r_i is sent to the leader which starts the three consensus rounds shown in the following figure.

Pre-Prepare: the request r_i is delivered from L to all F (n)
Prepare: all replicas send a prepare message to all other machines and wait for 2f acknowledgments. Thereby, a total order can be guaranteed. (n^2)
Commit: again all replicas send a commit message to all other machines and wait for 2f + 1 acknowledgments. Now the system agreed on the same request and is able to execute it. (n^2)

Afterwards the replicas returning the result of the operation execution to the requesting client C. This is waiting for f + 1 responses.

As it is easy to see two main problems arise with the protocol:

resource consumption: to agree on one request 2n^2 + n messages are sent within the system. Furthermore 3f + 1 machines have to be provided to hold the consistency guarantees. Compared to f + 1 machines and n^2 + n (?) messages sent for a fail-stop protocol, this is a huge increase.
single leader bottleneck (performance): to gain the total order all messages have to be sent through the single leader making him the performance bottleneck. The leader L determines the overall system performance.

Multi-Leader Approach

There are a number of optimizations e.g. avoiding the consensus steps for read only requests, but they all do not reduce the overall number of messages during agreement or machines necessary to provide consistency. But it is possible to improve the performance introducing multiple leader and therefore parallelism into the protocol.

The basic assumption within BFT is that all requests are depending on each other because the state objects they try to access are fully dependent, thus making it necessary to obtain a total order and a single leader. But this hasn't to be true especially in real world applications. There state objects and therefore request might be grouped as they are geographically or access permission related. E.g. some objects of the state may by only accessable with certain rights for certain clients making them independent of changes done to other objects. Or they can be even fully independent as in a distributed hash-map. If you take this information into account it is possible to split up the global state S into multiple partitions each with its own partial order.

--image

Basic Architecture

The new parallelized BFT protocol consists of leaders L, followers F and partitions P with |L| = |F| = |P|. Every machine running it executes as many BFT instances as partitions exists and for the current state of the protocol every machine holds exactly one leader. Furthermore there will be just one leader per BFT instance. A future improvement should make it possible to add create machines with multiple or no leader at all.

Request Types

With the new architecture the requests have to be split up into two types.

Simple Request

A simple request only accesses state objects from the same partition, thus avoiding synchronisation between partitions and BFT instances.

Cross-Border Request

A cross-border request accesses state objects across multiple partitions, thus forcing to synchronize the related partitions. Synchronisation means a request r_i has to be in head position in all partitions queues Q of the partitions which shall be accessed. Here a queue Q_j is just a buffer for incoming requests.

Cross Border Execution request:
This instance of request r_0 is located in the queue of that partition the machine is the leader of. The CBE request will be executed.
Cross Border Synchronization request:
This instance of request r_0 is located in all queues of partitions which are accessed but where the current machine is not the leader of. They are only used the guarantee a deterministic execution of cross-border requests over all replicas.

Only when r_i is ready it can be used for consensus and execution. This way the consistency can be guaranteed. But it will happen that r_i is at head for some Q_j at time t and for other Q_k not, which introduces active waiting until HEAD(Q_k) = r_i for all k.

Deadlocks caused by Cross-Border Requests

Besides the introduced synchronization and therefore the loss of parallelism the cross-border request have another disadvantage; they can lead to deadlocks. It is possible that multiple cb requests access the replicas at the same time leading to the following situation:

In this scenario r_0 waits for r_1 to be executed and to release its CBS requests, and vis versa. Deadlock resolution strategies will be discussed later.

Machine Learning to predict the best Partitioning

As described above it is possible to partition the state of the state machine and achieve parallelism within the BFT protocol. But this leads to a new problem: How to find the best partitioning? For some applications it may be an easy answer, as non of there state objects is related to the other. Therefore they just can be distributed by achieving:

an equal number of objects in every partition
an equal access rates in the partitions
etc.

But for data models which aren't that simple the 'best' partitioning has to be predicted out of the underlying data and the application behaviour. A detailed description of a solution I developed will be given later. It is based on graph partitioning.

scala.bft's People

Stargazers

Watchers

scala.bft's Issues

Implement three consensus rounds

See parameter description here.

As described in the overview the consensus consists of three rounds:

pre-prepare: ((PRE_PREPARE, v, n, D(r)), r)

Leader sends the client's request r to all followers. When the request is delivered to all replicas the consensus proceeds with the prepare phase.
prepare: (PREPARE, v, n, D(r), Id(R_i))

All replicas send a PREPARE message to all other replicas waiting for 2f acknowledgments. When received the messages the consensus proceeds with the commit phase.
commit: (COMMIT, v, n, D(r), Id(R_i))

All replicas send a COMMIT message to all other replicas waiting for 2f + 1 acknowledgements. When the messages are received the request operation is executed. For this issue the execution code is only available as api call.

Apply functional programming to project structure

Use Scala cats library to apply functional programming design and pattern on the current state (feature/authentication). This includes:

applying interpreter pattern to encapsulate IO operations
use Actors for IO only

Implement remote replica message transmission

Message transmission is separated into two types:

consensus message: These messages are lightweight and can be sent by using the akka messages directly.
client request: These kind of messages can be heavyweight and need to be chunked into smaller units (size config). Therefore the request:
1. is represented as Byte Array and partitioned into n units regarding to the chunk size,
2. each unit gets a order number assigned describing its position in the original array, the client id and client sequence number,
3. the units are distributed over k sender actors and sent to the replicas. Multiple sender actors can serve one replica (config) to speed up transmission,
4. the remote replicas collect all units and bring them in order using the order numbers -> request array reconstructed. Thereby, a unit is accepted when client id and sequence number are valid.

Implement static remote replica discovery

~~This first implementation shall use seed replicas to integrate new instances into the running system. The protocol looks like follows:~~

~~1. there are k replicas R known (host and port defined in a config) and called seeds~~
~~2. when a new instances is added to the system it sends a join message to the seed nodes~~
~~3. the seeds send a cluster update message to all other replicas including the new instance~~

note: All replicas could send the cluster update message so that the probability of replicas which do not receive the message decreases (config).

~~The reference to the remote replicas is stored as remote actor references.~~

All seed replicas could be faulty and return wrong results. That doesn't work.

Instead:
Use a static solution using a predefined number of replicas with known hosts and ports.

Await object ignores duration within Specs

From time to time the duration is ignored while waiting for a result in the specs. Thus leading to failing tests/builds.

What happens: The Await.result instantaneously throws a TimeoutException.

19:04:50.243 [INFO ][a.event.slf4j.Slf4jLogger] Slf4jLogger started
19:04:50.243 [INFO ][c.g.p.s.b.c.LeaderConsensus] {1,0,[0]}.aborted
19:04:50.243 [DEBUG][c.g.p.s.b.c.PrepareRound] {1,0,[0]}.prepare.consensus.messages: 1
19:04:50.246 [DEBUG][c.g.p.s.b.c.CommitRound] {1,0,[0]}.commit.consensus.messages: 1
19:04:50.246 [DEBUG][c.g.p.s.b.c.CommitRound] {1,0,[0]}.commit.consensus.messages: 2
19:04:50.246 [INFO ][c.g.p.s.b.c.CommitRound] {1,0,[0]}.commit.consensus.reached
19:04:50.246 [INFO ][c.g.p.s.b.c.PrepareRound] {1,0,[0]}.prepare.consensus.reached
19:04:50.246 [INFO ][c.g.p.s.b.c.PrepareRound] {1,0,[0]}.prepare.start
19:04:50.246 [DEBUG][c.g.p.s.b.c.PrepareRound] {1,0,[0]}.prepare.start.consensus.reached
19:04:50.246 [INFO ][c.g.p.s.b.c.CommitRound] {1,0,[0]}.commit.start
19:04:50.246 [DEBUG][c.g.p.s.b.c.CommitRound] {1,0,[0]}.commit.start.consensus.reached

Message and Request authentication

Messages and request needs to be authenticated to hold the consistence guarantees. Thereby, authentication has to be separated into:

lightweight MACs (Message Authentication codes) for requests and messages which aren't of the view change type
heavyweight digital signature for view change messages

MAC Generation

prerequisite: every node (including clients) have a 16 Byte session key they share with each replica

MACs are created as follows:

compute a MD5 hash h_r from the client request r
concatenate h_r with a session key and apply MD5 again to get h_rk
use the 10 least significant bytes from h_rk

Scenarios:

messages are sent to a single client: just one MAC is needed using the clients session key
messages are sent to replicas: one MAC per replica/session key is generated; all are applied as vector to the message (or just add the MAC of those replica which will receive the message)

View Change Messages

They have to be signed with a strong cryptographic strategy like RSA.

Modify storage api to load storage implementation dynamically

As the storage system can be implemented in multiple ways (in-memory, persistent (RDB, key-value store, ...)) a flexible api is needed. This can be achieved by using akka actors. The current LogStorage extension shall not provide the storage code directly but send messages to the available implementation. Therefore, a number of messages are defined which have to be used by the storage modules:

start consensus for request: stores request r
add pre-prepare: stores pre-prepare message
add prepare: stores prepare message
add commit: stores commit message

The api has to wait until the storage process is completed for each message before continuing.

The determine the implementation during runtime the names of the actors receiving the above messages have to be set (configured) in code or config file.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.