Milestone1

Goal for Milestone-1 is to provide a minimum viable product for enthusiastic users and early adopters.

Rdms indexes to support CRUD API.
- Rdms instances can be composed using <K, V, M, D>:
  - K, Key type.
  - V, Value type.
  - M, Memory-index-type.
  - D, Disk-index-type.
- Use atomic, mutex, spin-lock, copy-on-write mechanisms to support API level concurrency for readers and writers.
- Allow concurrent writer threads to access Rdms index¹.
- Allow concurrent reader threads to access Rdms index².
CRUD operations on DB.
- Write api: set(), set_cas(), delete().
- Iter api: get(), range(), reverse(), iter().
- Scan api: full_scan().
Centralized-version-control-mechanism for values, tracking issue.
- Implement associated Delta type for each index-value type.
- diff() API to create a delta between old and new value.
- merge() API to create old-value from delta and new-value.
- ..._with_versions() flavor for read-api to fetch all the versions of index-value.
WAL Write-Ahead-Logging.
- To ensure Durability guarantee.
- Shards to match I/O Concurrency, one file for each shard.
- Shareable writer-handle for batching, improves SSD throughput.
- Rotate log files for each shard.
- Configurable journal-limit for each log-file.
Piece-wise full table scan, tracking issue.
- Best case, if memory-index can allow concurrent full-table-scan without any additional penalties.
- Better case, if memory-index can allow concurrent full-table-scan without blocking writes.
- Good case, if memory-index can allow concurrent full-table-scan without blocking write for too-long (< 1ms).
Latch and spin for non-blocking single-writer, multi-reader concurrency, tracking-issue.
Log-Structured-Merge
- Multiple levels of index.
- Use memory-optimized data structure for mem-index.
- Use disk-optimized data structure for disk-index.
- count() can only be approximate.
Memory data-structures.
- Left-Leaning-Red-Black tree.
  - Fastest single threaded write path for a tree data-structure.
  - Fast read performance.
  - Good case scenario for piece-wise full table scan.
- LLRB with multi-version-concurrency-control.
  - Concurrent reads along with single threaded write.
  - Fast read performance.
  - Better case scenario for piece-wise full table scan.
Disk data-structures.
- Read-Only-BTree.
  - Build tree from an iterator.
  - Fully packed, with > 95% node-utilization.
  - Value co-located in leaf-node or stored in separate value-log file.
  - Deltas stored in separate value-log file.
Key traits for following types:
- u32, i32, u64, i64, u128, i128.
- char, String, Vec.
- Array type.
Value traits for following types:
- u32, i32, u64, i64, u128, i128.
- char, String, Vec.
- Array type.
Empty Value type.
Unit test-case target 80% code coverage.
Rustdoc Documentation under doc.rs
First alpha-release.
Nightly release process, tracking issue.

Footnotes:

Only when memory-index-type (M) and disk-index-type (D) supports read-concurrency.
Only when memory-index-type (M) and disk-index-type (D) supports write-concurrency.

ROBT: Do we need checksum for disk blocks ?

Recent abstractions in compute design and architecture, has delegated memory/network/disk corruption detection and recovery to H/W later and Firmware layer. Nevertheless it is worth brainstorming this problem in application space.

Wal: validate write-ahead-logging with multiple shards.

Dgm: m0_max_limit.

Intelligently handle maximum footprint limit for m0 level. Once the footprint() reaches 90% of the limit, trigger a auto_commit() and slow down the write ingestion rate.

Replace Empty type with never type (!)

We use Empty type across rdms type mechanics to fill implementation gaps. Although it works well, we would like to consider using the ! (never_type) instead of Empty. One immediate challenge is
the conversion constraint that we have for instance trait Index { type O: From<ffi::OsString> }.

LLRB: Implement purge()

An purge() API make sense in LSM mode. The behavior should be such that:

For all keys, go through the log of all mutations, and remove all mutations
from the lateset delete operation and everything before that.
If the latest mutation on a key is delete operation, then remove that entry
from the tree.

A variant of this API, purge_key() should should apply the above behaviour for
a single specified key.

Milestone2

Goal for milestone-2 is to ensure production ready product, with beta testing.

Beta-release checklist

Following test-suite, listed one per line below, shall be run on rdms
artifacts.

ROBT: file-format documentation.

Create a separate write-up documenting ROBT file-format, and clean up module level documentation.

Latch-And-Spin mechanism.

Read-Write-Spinlock implements the idea of latch-and-spin mechanism normally used for non-blocking concurrency.

Blocking concurrency can have impact on latency. When operations that require rw-exclusion is going to be quick and short, we can use non-blocking primitives like latch-and-spin.

What is Latch and spin ?

In typical multi-core processors, concurrent read operations are always safe and consistent. But it becomes unsafe, when there is a writer concurrently modifying data while readers are loading it from memory.

Latch-and-lock mechanism can be used when we want to allow one or more concurrent writer(s) along with readers.

Imagine a door leading into a room. This door has some special properties:

The door has a latch and a lock.
A reader can enter the room only when the door is un-locked and un-latched.
A writer can enter the room only when the door is un-locked and, un-latched and, there are no other reader or writer in the room.
Once the door is latched by a writer, no other writer or reader can enter the room because of (1) and (2) properties. But all readers who are already inside the room can finish their job and then exit.
A writer can enter the room only after locking the door, which the writer can do only after all the readers have exited.
When trying to acquire read-permission or write-permission, the caller thread shall spin until all the conditions from (1) to (5) are met. Once a thread acquires necessary permission it can continue to finish its job and then release the permission.

Piece-wise full table scan

Rdms indexes can support full table scan, by providing an iterator handle to iterate over every single entry in sort order. Since by design rdms also provide concurrent write operation, while the scan is in progress, this may lead to unstable scan if not implemented properly.

One mechanism used by Rdms to implement stable scan is piece-wise scan. Again this mechanism is applicable only when entries are indexed in lsm mode. The basic idea is to do range scan in short batches, say 1000 entries for each batch, and drop the read-handler after the batch is read. There by, avoiding resource pressure on index and/or holding its rw-locks. To begin with caller should compute the seq.no. range (start, end] that should be picked from the index and a begin-key:

start-seqno means all entries and versions whose seqno is before the start-seqno is already seen by the caller.
end-seqno is the index-sequence-number just before starting the full-table-scan.
begin-key, for every batch, would be the last batch's last entry-key. It is implied that the first batch's begin-key would be Bound::Unbounded.
Only entries and versions whose seqno is between (start, end] sequence number are returned.
Only entries that come after the begin-key are returned.
Once the iteration for a batch exceeds a limit the iterator is dropped and new batch is started.

robt: test loading index with corrupted index/root files.

Define/Implement 3-way diff and merge for doc values.

Cascading thread fail.

rdms index, by default, support concurrent thread safe API. Also some components within rdms spawn threads. This means, both locally spawned threads and rdms-handles that are used by application threads can suffer failure, in which case the failure must cascade across all the thread.

IMHO this is a bit early to work out a cascading failure mechanics. We may first have characterize the index functions and performance to implement the right method for cascading.

Reader APIs for value versions.

This issue is applicable only when rdms indexes are
configured to manage older versions for each entries.

Following are the read APIs defined by the Reader trait.

get()
iter()
range()
reverse()

Above APIs are the faster variant of reading into indexes,
which means, returned Entry may or may-not include all its
previous versions (mutations).

Add a variant of these APIs like:

get_versions()
iter_versions()
range_versions()
reverse_versions()

where the returned Entry shall always include all available
previous version, but can be slower due to disk-lookup and
merge cost.

Dgm: Verify mono-cutoff, lsm-cutoff, tombstone-cutoff.

Unlike Robt instances, Dgm has multiple levels of disk snapshot. And cutoff is only applied to the final and highest disk-level. Validation for this needs to be done by parsing the disk files outside the Dgm instance.

LLRB: Implement undo().

An undo() API make sense in LSM mode. The behavior should be such that:

The last Create/Update operation can be un-done, so that entry points to the mutation prior to that, including the seqno of that mutation.
The last Delete operation can be un-done, so that entry points to the Create/Update mutation prior to that, including the seqno of that mutation.
If there is only one mutation in the entry's log and if that mutation is a Create or Update, then remove the entry from the tree.
Back-to-back deleted operation will collapse into no-op. This means, undo() will affect the first delete operation and behaves similar to (2).

ShRobt: Crash recovery for sharded-robt index.

Sharded-robt index is built once and used as immutable index there after. During the build phase, if there is a crash, there needs to be a recovery plan.

shrobt: test loading index with corrupted index/root files

Dgm: Crash recover for disk-greater-than-memory index.

Dgm index commits memory snapshots into disk periodically, and compacts multiple snapshots periodically. When there is a crash we need a crash-recovery plan.

Depends on #24 and #25

Error: Implement fmt::Display trait.

Subsequently replace {:?} formatting with {} across the source base.

Robt: Crash recovery for Robt index.

Robt index is built once and used as immutable index there after. During the build phase, if there is a crash, there needs to be a recovery plan.

Version control for values

Rdms, when configured with version control, won't destory older values for index entries. In other words, from the first mutation that CREATE a {key,value} entry in the index, to all subsequent DELETE and UPDATE mutations, can be preserved in the index.

There are two flavours to version control, centralised and distributed.

centralised version control assume a single instance of serialized writes into index. This implies that, for every version of value, its parent version is its previous version stored in the index, and shall remain so for rest of the index's life. Rdms's centralised version control has following properties:

Value-type should provide an associated Delta-type D.
If P in parent value, C is child value.
Value-type shall provide a diff-algorithm to compute D = P - C.
Value-type shall provide a merge-algorithm to compute P = C - D.

distributed version control assumes multiple instance of an index, replicated across different nodes, and atleast two or more instances can injest write operations. This has few additional requirements on the version control system:

Every version must be identified with unique SHA.
Every version must identify its parent version via its SHA.
Value-types must support 3-way merge.

distributed version control won't be part of Milestone-1

Llrb mvcc writes can be slow when value is large size.

A potential solution would be to Box and duplicate the reference when doing copy-on-write for Node type.

LLRB: Review footprint calculation.

Given that footprint calculation need to be managed across set/set_cas/delete and its entry variant along with managing previous delta versions, it gets quite complicated. Review the present implementation for correctness and document the same.

core: IndexIter optimizations.

Using trait object gives convenience of abstraction. But performance can be an issue. Once such design choice is core::IndexIter trait object.

Validate the performance penalty in using IndexIter and if possible replace it with concrete types/traits.

LlrbDepth: Test cases needed for this type.

LlrbDepth calculates minimum, maximum, average and percentile of leaf-node depth in the LLRB tree.

Would be nice to have some rigorous test cases for this type.

Package management.

Several crates like llrb, wral, robt are moved back into rdms crate, keeping it fairly independent modules. We are still planning to maintain them as independent crate for foreseeable future. But these crates have proper package management in-place. We have to check their independent crate from and make sure that they are migrated to rdms crate.

bin/rdms, can be used for performance testing and nightly testing.
check.sh, needs to be migrated using bin/rdms as the validation program.
perf.sh, needs to be migrated using bin/rdms as the validation program.

Replication: Implement sequence-number tree.

When replication has to be supported for Rdms indexes, a separate sequence-number tree need to be implemented as part of the every level.

Configurable ITER_LIMIT for Llrb and Mvcc types.

Nightly continuous integration.

At present we have reached a state where there are enough unit test suite and a large source base to build momentum for this project. This is a tracking issue to figure out a nightly-continuous-integration process, targeted for Milestone-1

Steps of executions.

Checkout all the relevant repositories, from github, for rdms along with the latest commits.
Generate library artifacts, by compiling them in
- release mode.
- debug mode.
Invoke a sequence of command line programs that should typically validate the artifacts generated in step (2).
- Each command shall return a success exit code 0, or a failure code.
- Each command shall log messages into stdout and stderr that must by captured and saved in a file.
Nightly shall parse the stdout/stderr messages for warnings and error cases.
Finally, nigthly should be able to give a "green" or "yellow" or "red" flag, which respectively mean as "pass", "warn", "fail"
Additionally, Nightly shall commit the result status and stdout/stderr file into a git-repository.

Organisation of nightly-repository

    rdms-nightly/
        bin/nightly-script
        out/<yyyy>/<yyyy-mm-dd>-nightly.log
        out/<yyyy>/<yyyy-mm-dd>-nightly.status

out/<yyyy>/<yyyy-mm-dd>-nightly.log, shall contain the stdout/stderr messages from nightly script. out/<yyyy>/<yyyy-mm-dd>-nightly.status, shall contain parsed result from out/<yyyy>/<yyyy-mm-dd>-nightly.log.

ROBT: mmap implementation.

Right now mmap is implemented for index-file. robt can be configured to use a separate value-log-file and mmap feature is not implemented for value-log file. We should experiment an implementation of mmap for value-log file and measure read performance.

prataprc / rdms Goto Github PK

rdms's Issues

Beta-release checklist

Recommend Projects

Recommend Topics

Recommend Org