Giter Club home page Giter Club logo

probable-fiesta's Introduction

Probable Fiesta

A log-structued merge tree based key value store.

The goals of this project are providing a data store for

  1. simple key-value dictionary semantics
  2. atomic and durable operations
  3. serializability
  4. high throughput

The approach to solve these problems includes

  1. Write-ahead logging, to persist operations as they occur, prior to mutating state.
  2. Structured segments without duplicates, with merging of segments from most-to-least recently used to ensure sequential writes
  3. primary-secondary architecture, with log shipping for replication, writes served by the primary only, and asynchronous

This key-value store is intended to serve as a system cache, as the schema is not as structured as a relational database or even a hierarchial key-value store, but instead serves high throughput reads and writes of simple key-value pairs

probable-fiesta's People

Contributors

aashrayanand avatar

Watchers

 avatar

probable-fiesta's Issues

Leader election and Replication v1

Now that we have completed the basic engine, read and write segments from/to disk, and can operate using the plain GET/SET/DELETE API, we can move on to an initial replication scheme for the data.

To first allow for multiple replicas, we must add a controller for managing sockets and messages to and from other replicas. This controller will need

  1. a list of each other replica and their information (replica role, last message received timestamp, IP + port)
  2. a factory for creating sockets to a specific replica
  3. an enumeration of possible messages to send to others, would include ELECTION request/response, LOG/ACK request/response, HEARTBEAT request/response, ABORT request/response, and COMMIT request/response

for the v1 replication, we will use a simple scheme of primary-secondary async replication. For optimal performance, one scheme would be replicating segments, only replicating to secondaries when we flush to disk on primary. This would still be durable due to WAL, but would not be resilient to primary failover.

We could also log ship on each operation, and would be streaming a more frequent set of operations over the wire, but would be much more resilient (specifically to primary failover).

We will use a 2PC style algorithm:

  1. the primary will send a LOG to each replica with the operation
    2.the replicas will write this to WAL and send an ACK
  2. the primary will wait on ACK from each replica before sending a COMMIT, as well as writing the operation to its tree.
  3. the replicas will append the operation to the in-memory tree.

If the 2PC round times out, then the primary will send an ABORT operation to each replica, which will undo the operation in the WAL and in-memory tree.

[P] ----- LOG (operation) ----> [S1, S2, ...]
[P] waits on <-- ACK --- for each [ ] in [ S1, S2, ...]
[P] ----- COMMIT ----> [S1, S2, ...]

if [P] timeouts on waiting on ACK:
[P] ---- ABORT ----> [S1, S2, ...]
[P] ---- WAIT ON <-- ABORT ACK --- for each [ ] in [S1, S2, ...] until timer passed
all secondaries which do not ABORT ACK are considered inconsistent until get an ABORT ACK

For leader election, we can also start with a simple scheme using the bully algorithm. We can start up the cluster and use a synchronous communication for leader election, passing the PIDs between each replica, before going to an asynchronous scheme after election. Each replica will configure itself as a primary or secondary based on the election result, and we can continue to rely on log shipping as a heartbeat, with additional periodic heartbeats sent to avoid re-election when the DB is idle.

[ ] <--- send PID ---> [ ] <--- send PID ---> [ ] <--- send PID ---> [ ] <--- send PID ---> [ ]

Note we are not using chaining for the leader election or replication, election will use a star topology (each replica receives leader election message from each other replica) and replication will use primary secondary architecture (log shipped from primary to each secondary

Election() {
SendPIDToAllOthers()
self.status = primary
while PIDsReceived < TotalReplicas & not_passed_election_timeout {
if thisPID > myPID { self.status = secondary }
}
if PIDsReceived == TotalReplicas {
self.active = true
}
}

HeartBeat() {
if time_since_msg_from_primary > threshold {initiate election}
}

For rolling back operations, we will need to either add a function to tree to rollback a key, which removes the node from most recent value for the key from the log segment (note it doesn't set to tombstoned, rather deletes entirely), or we will need to add an addtional value to TriSome, to include a ROLLEDBACK node, so that this value is skipped in the tree.

Implement standalone client and server

Right now, the project is basically just a library of functions for maintaining LSM trees, and an internal API for CRUD on these LSMs. We don't have any real way of using it or invoking these outside of tests right now, so we need to implement some exes on top of this library.

Client: get CLI and construct frames to send to server, using simple frame protocol
Server: get frames from clients, de-construct and map to internal API, and invoke libraries for LSM

Separate print log and disk log

With the inclusion of multiple replicas and complications like leader election, we will need a separate disk log for logging any progress. We still want debugging log so we will need to include printlns too, so we will need to extend the log function to either disk or print log depending on the value of LOGGING (extend this from bool to a a triple of OFF/ON/DISK, or have an additional parameter for disk vs debug log)

Profile different segment sizes for read/write speed

We will need to measure the time to read and write segments of varying sizes, from and to disk, and determine what is the ideal segment size to use in practice.

Though this should ideally be configurable as part of table creation, we can still use this profiling to determine the default configuration.

Wrap log segment nodes in Option

Right now, we have Box members of log segment BST to represent left and right nodes, which leads to unnecessary allocations of our Nil type for left and right children. We can instead switch to wrapping this box in an option to not have to always allocate this.

Generic LSM tree

Currently, the LSM tree backing for this DB is constrained to string tuples only. This should eventually be made generic, and would mostly come down to:

  1. Using generics for the LSM definition
  2. The appropriate steps needed for marshaling non-string data types to from log segment files on disk

Restore log segments from WAL

PF uses a write-ahead log for ensuring durability of operations (see the member log_file of LsmTree in lsm.rs). This log file is written to for each operation, prior to mutating the latest log segment (stored in memory) to include that operation. It ensures we have an accurate copy of any operations in memory if the database were to crash.

Right now, we already write to this WAL, but we don't have the ability to restore this WAL to memory on start up. To allow for this, we will need to add support for a few different steps

  1. Tracking the most recently persisted operation in the WAL, and periodic truncation of the WAL: Once an operation in the WAL is persisted to disk, we should no longer include this in the WAL. To do this, we need to add some mechanism for tracking this operation, and add a background cleanup thread to truncate the WAL.
  2. Startup restore task: On startup of an existing LSM tree, we should add a step to check for the existence of a log file (equivalent to checking if the tree is a new or existing one), and if so, replay all of the operations in the WAL. With the current WAL file format, we would loop through each line of the WAL and execute a SET operation for this tuple.

To test restoring the WAL, we will want to include the following scenarios

  1. Validate restore existing LSM from WAL: Create and seed it with N items, then simulate a crash, and verify after restart that the LSM has N items in the latest log segment.
  2. Validate sequential writes after restore: Create and seed LSM with more than one log segment, and in the latest log segment include a key that had occurred in a prior log segment on disk. Simulate a crash, and verify after restart that the value for this key is what was in the latest log segment that had to be restored. Note that this test requires first completing this item to support persisting log segments to disk.

Persist log segments to disk

Currently, there is an in-memory structure for log segments (see LogSegment in tree.rs). These include any tuples added by the client during the process lifetime, but are volatile. We need to add additional support to persist this structure to a file). The things we need to work out are

  1. What sort of file format do we want to use? (the log segments themselves are implemented as binary search trees)

To implement this persistence of log segments, we need to:

  1. Add a max size for log segments to LSM tree.
  2. Add trigger on SET operations to LSM, to write the log segment to disk, and destroy the log segment in memory (on the next write only), once the max size has been reached.

The above goal is a good enough first step to resolve this issues, but later we may have more issues to consider for example

  1. What is the optimal format to store these log segments, in terms of compacting the files as much as possible
  2. What is the optimal format for quick reading and writing log segments to/from memory.
  3. What platform-specific issues do we need to investigate for these files (for now, will focus on a unix-compliant persistence mechanism, but later if we wanted to use this database on windows or some other platform, would there be any additional issues beyond different filesystem naming configurations e.g. different behavior reading/writing these files).

For validation, we can test the following scenarios:

  1. Trigger persisting log segment: Create LSM and seed past the max size threshold for a single log segment, once we pass the max size, verify the latest log segment is empty

Add query engine and operators

Right now, the storage system is in progress, but this shouldn't be exposed by the end database. Instead, we will need to implement some operators to wrap the storage system, as well as the corresponding client commands. We will use a simple querying interface consisting of GET/SET/DELETE only. To do this, we will need to:

  1. Add a query parsing front end, to read client input and compose to the corresponding operator and arguments
  2. Implement operators to handle each different command, and invoke the corresponding internal operation on the storage system (LSM tree)

Add segment ID to DiskSegment

Right now, for ordering disk segments, we parse the ID from the segment path on each comparison, need to change this to instead just parse it on creation and reclaiming of segments and have as a member of the structure.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.