Giter Club home page Giter Club logo

rdms's Issues

Milestone1

Goal for Milestone-1 is to provide a minimum viable product for enthusiastic users and early adopters.

  • Rdms indexes to support CRUD API.
    • Rdms instances can be composed using <K, V, M, D>:
      • K, Key type.
      • V, Value type.
      • M, Memory-index-type.
      • D, Disk-index-type.
    • Use atomic, mutex, spin-lock, copy-on-write mechanisms to support API level concurrency for readers and writers.
    • Allow concurrent writer threads to access Rdms index1.
    • Allow concurrent reader threads to access Rdms index2.
  • CRUD operations on DB.
    • Write api: set(), set_cas(), delete().
    • Iter api: get(), range(), reverse(), iter().
    • Scan api: full_scan().
  • Centralized-version-control-mechanism for values, tracking issue.
    • Implement associated Delta type for each index-value type.
    • diff() API to create a delta between old and new value.
    • merge() API to create old-value from delta and new-value.
    • ..._with_versions() flavor for read-api to fetch all the versions of index-value.
  • WAL Write-Ahead-Logging.
    • To ensure Durability guarantee.
    • Shards to match I/O Concurrency, one file for each shard.
    • Shareable writer-handle for batching, improves SSD throughput.
    • Rotate log files for each shard.
    • Configurable journal-limit for each log-file.
  • Piece-wise full table scan, tracking issue.
    • Best case, if memory-index can allow concurrent full-table-scan without any additional penalties.
    • Better case, if memory-index can allow concurrent full-table-scan without blocking writes.
    • Good case, if memory-index can allow concurrent full-table-scan without blocking write for too-long (< 1ms).
  • Latch and spin for non-blocking single-writer, multi-reader concurrency, tracking-issue.
  • Log-Structured-Merge
    • Multiple levels of index.
    • Use memory-optimized data structure for mem-index.
    • Use disk-optimized data structure for disk-index.
    • count() can only be approximate.
  • Memory data-structures.
    • Left-Leaning-Red-Black tree.
      • Fastest single threaded write path for a tree data-structure.
      • Fast read performance.
      • Good case scenario for piece-wise full table scan.
    • LLRB with multi-version-concurrency-control.
      • Concurrent reads along with single threaded write.
      • Fast read performance.
      • Better case scenario for piece-wise full table scan.
  • Disk data-structures.
    • Read-Only-BTree.
      • Build tree from an iterator.
      • Fully packed, with > 95% node-utilization.
      • Value co-located in leaf-node or stored in separate value-log file.
      • Deltas stored in separate value-log file.
  • Key traits for following types:
    • u32, i32, u64, i64, u128, i128.
    • char, String, Vec.
    • Array type.
  • Value traits for following types:
    • u32, i32, u64, i64, u128, i128.
    • char, String, Vec.
    • Array type.
  • Empty Value type.
  • Unit test-case target 80% code coverage.
  • Rustdoc Documentation under doc.rs
  • First alpha-release.
  • Nightly release process, tracking issue.

Footnotes:

  1. Only when memory-index-type (M) and disk-index-type (D) supports read-concurrency.
  2. Only when memory-index-type (M) and disk-index-type (D) supports write-concurrency.

ROBT: Do we need checksum for disk blocks ?

Recent abstractions in compute design and architecture, has delegated memory/network/disk corruption detection and recovery to H/W later and Firmware layer. Nevertheless it is worth brainstorming this problem in application space.

Dgm: m0_max_limit.

Intelligently handle maximum footprint limit for m0 level. Once the footprint() reaches 90% of the limit, trigger a auto_commit() and slow down the write ingestion rate.

Replace Empty type with never type (!)

We use Empty type across rdms type mechanics to fill implementation gaps. Although it works well, we would like to consider using the ! (never_type) instead of Empty. One immediate challenge is
the conversion constraint that we have for instance trait Index { type O: From<ffi::OsString> }.

LLRB: Implement purge()

An purge() API make sense in LSM mode. The behavior should be such that:

  1. For all keys, go through the log of all mutations, and remove all mutations
    from the lateset delete operation and everything before that.
  2. If the latest mutation on a key is delete operation, then remove that entry
    from the tree.

A variant of this API, purge_key() should should apply the above behaviour for
a single specified key.

Milestone2

Goal for milestone-2 is to ensure production ready product, with beta testing.

  • ACID compliance.
  • Memory datastructures.
    • Skip-list
      • Good concurrent read performance.
      • Support concurrent writes.
      • Suitable for managing a cache of frequently read entries.
      • Best case Piece-wise full table scan.
      • Won't support reverse-iteration.
  • Disk datastructures.
    • Append-Only-BTree.
      • Useful when index-ops have very low write ratio.
      • Compaction cycle can affect disk-amplification.
      • Lesser write-amplification.
      • Value colocated in leaf-node or stored in separate value-log file.
      • Deltas stored in separate value-log file.
    • Integrate Disk based indexes with WAL for durability guarantee.
  • Key traits for following types:
    • u8, i8, u16, i16, usize, isize.
    • f32, f64 - custom type with total-ordering.
    • (A, B), (A, B, C).
  • Value traits for following types:
    • u8, i8, u16, i16, usize, isize.
    • f32, f64 - custom type with total-ordering.
    • (A, B), (A, B, C).
  • Unit test-case target 90% code coverage.
  • Rustdoc documentation.
  • Rustbook documentation.
  • Beta release process.

Beta-release checklist

Following test-suite, listed one per line below, shall be run on rdms
artifacts.

  • All daily unit test cases, skipping ignore testcases.
  • All daily unit test cases, including ignore testcases.
  • Fuzzy testing for Llrb memory-index.
  • Fuzzy testing for Mvcc memory-index.
  • Fuzzy testing for Robt disk-index.
  • Fuzzy testing for rdms index.
  • Fuzzy testing for Wal logging.
  • Fuzzy testing for SkipScan algorithm.
  • Fuzzy testing for LSM lookup algorithm.
  • Performance benchmark Llrb memory-index.
  • Performance benchmark Mvcc memory-index.
  • Performance benchmark Robt disk-index.
  • Performance benchmark Rdms index.
  • Performance benchmark Wal logging.
  • Performance benchmark SkipScan algorithm.
  • Performance benchmark LSM lookup algorithm.
  • Valgrind tests and failures for all applicable test suites and performance suites.

Latch-And-Spin mechanism.

Read-Write-Spinlock implements the idea of latch-and-spin mechanism normally used for non-blocking concurrency.

Blocking concurrency can have impact on latency. When operations that require rw-exclusion is going to be quick and short, we can use non-blocking primitives like latch-and-spin.

What is Latch and spin ?

In typical multi-core processors, concurrent read operations are always safe and consistent. But it becomes unsafe, when there is a writer concurrently modifying data while readers are loading it from memory.

Latch-and-lock mechanism can be used when we want to allow one or more concurrent writer(s) along with readers.

Imagine a door leading into a room. This door has some special properties:

  1. The door has a latch and a lock.
  2. A reader can enter the room only when the door is un-locked and un-latched.
  3. A writer can enter the room only when the door is un-locked and, un-latched and, there are no other reader or writer in the room.
  4. Once the door is latched by a writer, no other writer or reader can enter the room because of (1) and (2) properties. But all readers who are already inside the room can finish their job and then exit.
  5. A writer can enter the room only after locking the door, which the writer can do only after all the readers have exited.
  6. When trying to acquire read-permission or write-permission, the caller thread shall spin until all the conditions from (1) to (5) are met. Once a thread acquires necessary permission it can continue to finish its job and then release the permission.

Piece-wise full table scan

Rdms indexes can support full table scan, by providing an iterator handle to iterate over every single entry in sort order. Since by design rdms also provide concurrent write operation, while the scan is in progress, this may lead to unstable scan if not implemented properly.

One mechanism used by Rdms to implement stable scan is piece-wise scan. Again this mechanism is applicable only when entries are indexed in lsm mode. The basic idea is to do range scan in short batches, say 1000 entries for each batch, and drop the read-handler after the batch is read. There by, avoiding resource pressure on index and/or holding its rw-locks. To begin with caller should compute the seq.no. range (start, end] that should be picked from the index and a begin-key:

  • start-seqno means all entries and versions whose seqno is before the start-seqno is already seen by the caller.
  • end-seqno is the index-sequence-number just before starting the full-table-scan.
  • begin-key, for every batch, would be the last batch's last entry-key. It is implied that the first batch's begin-key would be Bound::Unbounded.
  • Only entries and versions whose seqno is between (start, end] sequence number are returned.
  • Only entries that come after the begin-key are returned.
  • Once the iteration for a batch exceeds a limit the iterator is dropped and new batch is started.

Cascading thread fail.

rdms index, by default, support concurrent thread safe API. Also some components within rdms spawn threads. This means, both locally spawned threads and rdms-handles that are used by application threads can suffer failure, in which case the failure must cascade across all the thread.

IMHO this is a bit early to work out a cascading failure mechanics. We may first have characterize the index functions and performance to implement the right method for cascading.

Reader APIs for value versions.

This issue is applicable only when rdms indexes are
configured to manage older versions for each entries.

Following are the read APIs defined by the Reader trait.

  • get()
  • iter()
  • range()
  • reverse()

Above APIs are the faster variant of reading into indexes,
which means, returned Entry may or may-not include all its
previous versions (mutations).

Add a variant of these APIs like:

  • get_versions()
  • iter_versions()
  • range_versions()
  • reverse_versions()

where the returned Entry shall always include all available
previous version, but can be slower due to disk-lookup and
merge cost.

Dgm: Verify mono-cutoff, lsm-cutoff, tombstone-cutoff.

Unlike Robt instances, Dgm has multiple levels of disk snapshot. And cutoff is only applied to the final and highest disk-level. Validation for this needs to be done by parsing the disk files outside the Dgm instance.

LLRB: Implement undo().

An undo() API make sense in LSM mode. The behavior should be such that:

  1. The last Create/Update operation can be un-done, so that entry points to the mutation prior to that, including the seqno of that mutation.
  2. The last Delete operation can be un-done, so that entry points to the Create/Update mutation prior to that, including the seqno of that mutation.
  3. If there is only one mutation in the entry's log and if that mutation is a Create or Update, then remove the entry from the tree.
  4. Back-to-back deleted operation will collapse into no-op. This means, undo() will affect the first delete operation and behaves similar to (2).

Robt: Crash recovery for Robt index.

Robt index is built once and used as immutable index there after. During the build phase, if there is a crash, there needs to be a recovery plan.

Version control for values

Rdms, when configured with version control, won't destory older values for index entries. In other words, from the first mutation that CREATE a {key,value} entry in the index, to all subsequent DELETE and UPDATE mutations, can be preserved in the index.

There are two flavours to version control, centralised and distributed.

centralised version control assume a single instance of serialized writes into index. This implies that, for every version of value, its parent version is its previous version stored in the index, and shall remain so for rest of the index's life. Rdms's centralised version control has following properties:

  1. Value-type should provide an associated Delta-type D.
  2. If P in parent value, C is child value.
  3. Value-type shall provide a diff-algorithm to compute D = P - C.
  4. Value-type shall provide a merge-algorithm to compute P = C - D.

distributed version control assumes multiple instance of an index, replicated across different nodes, and atleast two or more instances can injest write operations. This has few additional requirements on the version control system:

  1. Every version must be identified with unique SHA.
  2. Every version must identify its parent version via its SHA.
  3. Value-types must support 3-way merge.

distributed version control won't be part of Milestone-1

LLRB: Review footprint calculation.

Given that footprint calculation need to be managed across set/set_cas/delete and its entry variant along with managing previous delta versions, it gets quite complicated. Review the present implementation for correctness and document the same.

core: IndexIter optimizations.

Using trait object gives convenience of abstraction. But performance can be an issue. Once such design choice is core::IndexIter trait object.

Validate the performance penalty in using IndexIter and if possible replace it with concrete types/traits.

Package management.

Several crates like llrb, wral, robt are moved back into rdms crate, keeping it fairly independent modules. We are still planning to maintain them as independent crate for foreseeable future. But these crates have proper package management in-place. We have to check their independent crate from and make sure that they are migrated to rdms crate.

  • bin/rdms, can be used for performance testing and nightly testing.
  • check.sh, needs to be migrated using bin/rdms as the validation program.
  • perf.sh, needs to be migrated using bin/rdms as the validation program.

Nightly continuous integration.

At present we have reached a state where there are enough unit test suite and a large source base to build momentum for this project. This is a tracking issue to figure out a nightly-continuous-integration process, targeted for Milestone-1

Steps of executions.

  1. Checkout all the relevant repositories, from github, for rdms along with the latest commits.
  2. Generate library artifacts, by compiling them in
    • release mode.
    • debug mode.
  3. Invoke a sequence of command line programs that should typically validate the artifacts generated in step (2).
    • Each command shall return a success exit code 0, or a failure code.
    • Each command shall log messages into stdout and stderr that must by captured and saved in a file.
  4. Nightly shall parse the stdout/stderr messages for warnings and error cases.
  5. Finally, nigthly should be able to give a "green" or "yellow" or "red" flag, which respectively mean as "pass", "warn", "fail"
  6. Additionally, Nightly shall commit the result status and stdout/stderr file into a git-repository.

Organisation of nightly-repository

    rdms-nightly/
        bin/nightly-script
        out/<yyyy>/<yyyy-mm-dd>-nightly.log
        out/<yyyy>/<yyyy-mm-dd>-nightly.status

out/<yyyy>/<yyyy-mm-dd>-nightly.log, shall contain the stdout/stderr messages from nightly script. out/<yyyy>/<yyyy-mm-dd>-nightly.status, shall contain parsed result from out/<yyyy>/<yyyy-mm-dd>-nightly.log.

ROBT: mmap implementation.

Right now mmap is implemented for index-file. robt can be configured to use a separate value-log-file and mmap feature is not implemented for value-log file. We should experiment an implementation of mmap for value-log file and measure read performance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.