A sharded fault-tolerent persistent K/V store that's hopefully fast.
This project is inspired by the MIT Distributed Systems class, but doesn't utilize any of their codebase.
references:
Basically the sharded key/value store consists of a shard orchestrator and a bunch of replication groups, each of which is a Raft cluster.
within Raft cluster
- node join
- start cluster
- request vote
- append entry
replication group
- Get <- fe
- Put <- fe
- Append <- fe
- Delete <- fe
- get shard <- rg
shard orchestrator
- client registration <- fe, rg
- Get Configuration <- fe, rg
- join <- fe
- leave <- fe
Idempotency is achieved by clientID + request sequence number.
shard orchestraor clients:
- Admin: unique ID, registered
- Client: unique ID, registered
- InternalClient: unique ID, registered
replication group clients:
- Client: same ID as corresponding shard orch client, passed in
- Transferer:
- Transferers in the same replication group should have the same ID, registered idempotently using groupID as idempotencyID
- sharding kicker Transferer in ShardOrchestrator should have ID 0, default
Shard Orchestrator decides mapping between replication group and shards.
Query API of the Shard Orchestrator returns a configuration according to version number (or maybe just the latest one). Query from a replication group should contains a list of current addrs within the group (for membership changing).
Join, Leave, Move should change the current configuration and increment the version number.
If no replication group currently exists in the configuration, the shard orchestrator virtually owns all shards. When the first replication group joins, shard orchestrator pushes all shards ownership to that replication group to get things started.
Client operation on a kv pair should start by querying the configuration then talk to the replication group.
Replication Groups periodically query the latest configuration and transfer shard with each other.
- version number
- map of GID to list of addr
- map of shard to GID
All nodes in a replication group periodically queries the shard orchestrator for latest configuration. On configuration change, update local configuration, ask raft protocol to do necessary shards out. This is step is only effective on a leader node in the raft cluster. Other state should ignore this request.
A shard out stores a raft log. Once committed, this log entry requires
- reads local kv within the shard
- makes shard transfer rpc call to target replication group with the data
- wait for success reply, then hide that shard in state machine, so future client request on that shard should fail
When receive a shard transfer rpc call
- checks for idempotency, return if the request has been seen
- stores a shard in log entry containing the data
- on process, put all the kv pairs in kvstore, own the shard
- return success
For shard in requests, replication group doesn't check against its local configuration and trust the caller making the right call. In case of discrepency (lagged local config or remote config), this should be ok after the next periodical query.
APIs
- Join
- Leave
- Move
- Query
A replication groups manages one or more logical shards of the whole dataset, utilizing Raft protocol to reach consensus within a group.
Supporting Operations
-
Get
returns nil when key doesn't exist, not throwing error
-
Put
stores a key-value pair, overwrite old value on putting with existing key
-
Append
append passed in value to the end of existing value, equivalent to Put if key does not previously exist
-
Delete
delete key from store, pass silently if key not exist, not throwing error
Consensus protocol. Each node in the cluster manages a KVStore state machine to replicate the key/value pairs stored by client.
Available commands for clent
- get
- put
- append
- delete
Commands for other replication group
-
getShard
callee returns and deletes local shard (or postpone delete maybe??)
-
ownShard
callee takes ownership of the passed in shard
- put
- get
- append
- delete
- raft
- import old raft implementation
- optimization
- membership change
- log compaction
- client request idempotency cache and cleaning up
- redesign and refactor RPC calls flow
- single replication group
- KV store as state machine
- implementation
- test
- shard kv machine
- integration
- replication group node
- refactor existing raft implementation
- fix and refactor old tests
- client
- change to ack finished request seq
- integration tests
- KV store as state machine
- shard orchestrator
- configuration & config machine
- API
- rpc set up
- client
- frontend client
- internal client
- integration
- sharding replication group state machine
- API and communication between replication groups and shard orchestrator
- shard transfer between replication groups
- persistence
- deployment ready and demo