Now that we have completed the basic engine, read and write segments from/to disk, and can operate using the plain GET/SET/DELETE API, we can move on to an initial replication scheme for the data.
To first allow for multiple replicas, we must add a controller for managing sockets and messages to and from other replicas. This controller will need
- a list of each other replica and their information (replica role, last message received timestamp, IP + port)
- a factory for creating sockets to a specific replica
- an enumeration of possible messages to send to others, would include ELECTION request/response, LOG/ACK request/response, HEARTBEAT request/response, ABORT request/response, and COMMIT request/response
for the v1 replication, we will use a simple scheme of primary-secondary async replication. For optimal performance, one scheme would be replicating segments, only replicating to secondaries when we flush to disk on primary. This would still be durable due to WAL, but would not be resilient to primary failover.
We could also log ship on each operation, and would be streaming a more frequent set of operations over the wire, but would be much more resilient (specifically to primary failover).
We will use a 2PC style algorithm:
- the primary will send a LOG to each replica with the operation
2.the replicas will write this to WAL and send an ACK
- the primary will wait on ACK from each replica before sending a COMMIT, as well as writing the operation to its tree.
- the replicas will append the operation to the in-memory tree.
If the 2PC round times out, then the primary will send an ABORT operation to each replica, which will undo the operation in the WAL and in-memory tree.
[P] ----- LOG (operation) ----> [S1, S2, ...]
[P] waits on <-- ACK --- for each [ ] in [ S1, S2, ...]
[P] ----- COMMIT ----> [S1, S2, ...]
if [P] timeouts on waiting on ACK:
[P] ---- ABORT ----> [S1, S2, ...]
[P] ---- WAIT ON <-- ABORT ACK --- for each [ ] in [S1, S2, ...] until timer passed
all secondaries which do not ABORT ACK are considered inconsistent until get an ABORT ACK
For leader election, we can also start with a simple scheme using the bully algorithm. We can start up the cluster and use a synchronous communication for leader election, passing the PIDs between each replica, before going to an asynchronous scheme after election. Each replica will configure itself as a primary or secondary based on the election result, and we can continue to rely on log shipping as a heartbeat, with additional periodic heartbeats sent to avoid re-election when the DB is idle.
[ ] <--- send PID ---> [ ] <--- send PID ---> [ ] <--- send PID ---> [ ] <--- send PID ---> [ ]
Note we are not using chaining for the leader election or replication, election will use a star topology (each replica receives leader election message from each other replica) and replication will use primary secondary architecture (log shipped from primary to each secondary
Election() {
SendPIDToAllOthers()
self.status = primary
while PIDsReceived < TotalReplicas & not_passed_election_timeout {
if thisPID > myPID { self.status = secondary }
}
if PIDsReceived == TotalReplicas {
self.active = true
}
}
HeartBeat() {
if time_since_msg_from_primary > threshold {initiate election}
}
For rolling back operations, we will need to either add a function to tree to rollback a key, which removes the node from most recent value for the key from the log segment (note it doesn't set to tombstoned, rather deletes entirely), or we will need to add an addtional value to TriSome, to include a ROLLEDBACK node, so that this value is skipped in the tree.