cojen / tupl Goto Github PK

The Unnamed Persistence Library

License: GNU Affero General Public License v3.0

Java 100.00%

acid concurrency cursor database durable java key-value-store nosql persistence raft replication sorting storage-engine transaction transactional transactions

tupl's Introduction

No longer maintained. Instead, use Cojen/Maker for bytecode generation.

Java bytecode generation and disassembly tools

tupl's People

Contributors

Stargazers

Watchers

tupl's Issues

Optimize View.count

The current implementation scans over each entry in the given range. An optimized version utilizes the counts stored in the bottom internal nodes, and it also avoids loading all the keys within the range. Define an internal TreeCursor.countTo method, which is similar to the skip methods. For supporting the optional high key, position another cursor there. Stop counting when the leaf node of the second cursor is reached.

Support cache priming

If the process cleanly exits, persist info regarding the cache contents into a special data structure. Upon startup, the data structure is examined and rebuilds the cache as it was at the last shutdown. Also support manually persisting this data, for snapshots and applications which require direct shutdown control.

Dumping node ids directly doesn't preserve their relationships, and so this technique won't work. Tree traversal will work, using a cursor that only touches cached nodes. For every leaf node, pick a representative key to persist. Choose the middle key, and then prune it to be as small as possible while still being distinct from its nearest neighbors. Startup then finds these keys, indirectly creating the cached node structure as before.

Add Index.insertFence and removeFence methods

This feature boosts findNearby performance at the edges. Implementation inserts a ghost.

RFC: Simplified (potentially slower) split management

A lot of special cases in Tupl mini deal with Splits.

The initial goal is to be able to measure search performance, which should be indicative of other performance opportunities or lack thereof.

Here is my suggestion:

A traversal down a tree always fixes a split at the level it was found (potentially moving it up one level). This can be done safely because there exists a moment when a node and its parent latches are both held.

There are lots of pros and cons from a performance point of view and lots variations in implementation details that come in as well. But it should be neutral to search performance in the absence of concurrent inserts.

My question is about correctness, am I missing something? The split bubbling up to the root node one interesting case, but I don't see it being a blocker.

Here is some example pseudo-code of how a simplified search would look like in that model:

    while (!node->isLeaf()) {
        const auto in = static_cast<InternalNode*>(node);        
        const auto childIt = findChildNode(*in, key);

        Node* const childNode = *childIt;

        {
            Latch::scoped_exclusive_lock childLock{**childIt};

            if (childNode.split()) {
                bubbleSplitUpOneLevel(childNode, node);
                continue; // Parent lock is still held
            }

            // Destructor releases Parent's Latch
            std::swap(parentLock, childLock); 
        }

        CursorFrame frame;// TODO: bind with {*in, childIt};        
        node = childNode;
    }

Rebalance nodes before splitting

In the average case, b-tree utilization is 70%, but it can be as bad as 50%. The current page splitting logic detects in-order inserts and achieves near 100% utilization in that case. An additional enhancement would rebalance a full node with its least utilized neighbor before splitting. This logic needs to latch up the tree, and so it needs to give up quickly when latch acquisition fails and split as usual.

Node rebalancing can cause corruption

Problem has been observed with internal nodes, when using large keys and values.

Support quick index truncation

Persistent workflow ensures operation completes. Need to decide how this fits with transactional operations and how to make it perform well.

Use shared latches for cursor search and unbind operations

Currently, exclusive latches are used to bind and unbind frames to nodes. This reduces concurrency and adds lock contention overhead. In most cases, a full exclusive latch is not required. Instead, introduce a new latch state, stored in bit 30. When set, the thread can bind/unbind the frame. The implementation can simply spin, and potentially yield every so often while spinning.

Iterating over the cursor frames will need to be guarded by an exclusive latch, but iteration is usually performed as a result of making data changes in the node. An exclusive latch must already be held.

Snapshot during compaction

If a snapshot is in progress when a compaction is too, the file can shrink and discard pages intended for the snapshot. If snapshot direction was reversed, this would reduce the likelihood of any problem being observed. Regardless, when the page array is truncated, the dropped pages need to be copied into any snapshots first.

Support temporary mode

A database opened in temporary mode never writes any redo log entries, and it never checkpoints. It only uses the database file to extend the storage beyond the cache limit. When reopening a temporary database, all changes are discarded.

An existing database, which was created in the regular mode, can be opened in temporary mode. Any changes will not persist when the database is opened again. This provides an alternative way of supporting issue #21, read-only mode. It allows a database to be opened and inspected, without risk of changing anything.

Use concurrent queue algorithms for binding/unbinding cursor frames

This will eliminate latches when cursors are reset and only shared latches would be required when doing find and move operations.

Reduce shared commit lock duration

Shared commit lock is held for a long time, especially when a fragmented value is being created. When the checkpointer requests an exclusive lock, all other requests for the shared lock are blocked.

Introduce a new type of lock, which is hooked into the FileIO or PageArray class. When an I/O operation is performed, assume it will take a long time and allow any shared lock waiters to barge ahead of the exclusive lock request.

Support counted b-trees

Internal node pointers are 8 bytes, but they should be reduced to 6 bytes. For bottom internal nodes, this frees up 2 bytes per child node reference, which is sufficient to maintain a count. Leaf nodes cannot have enough entries to overflow a 2-byte counter.

Stripe the node usage list

The node usage list is roughly ordered from least to most recently used and is guarded by an exclusive latch. This is a concurrency bottleneck when the storage layer is very fast, especially when fronted by a secondary cache. With 64 threads on 32 CPU cores, each thread spends 95% of its time waiting for the latch. The usage list needs to be striped/partitioned to eliminate the bottleneck.

Question about TreeCursorFrame.popAll

In the code segment listed below, it's not clear when actualNode != node.

Is there an expectation that acquiring the exclusive latch will change the value that is loaded?

   static void popAll(TreeCursorFrame frame) {
        outer: do {
            Node node = frame.mNode;
            while (true) {
                if (node == null) {
                    // Frame was not bound properly, suggesting that cursor is
                    // being cleaned up in response to an exception. Frame
                    // cannot be latched, so just go to the parent.
                    frame = frame.mParentFrame;
                    continue outer;
                }
                node.acquireExclusive();
                Node actualNode = frame.mNode;
                if (actualNode == node) {
                    frame = frame.pop();
                    node.releaseExclusive();
                    continue outer;
                }
                node.releaseExclusive();
                node = actualNode;
            }
        } while (frame != null);
    }
}

Corruption caused by large keys

First, limit the size of the key to ensure that at least two entries fit in a node. Largest possible fragmented value encoding is 16 bytes.

Even with these limits imposed, inserting large blank keys creates verification errors.

Support key prefix compression

The node encoding format already has placeholders for key compression, by managing prefixes up to 255 bytes in length. Finishing the implementation is a bit tricky when considering node splits and merges.

Support snapshot into a temp Database instance

Snapshots are intended only for copying to a local or remote file. It's not terribly difficult to open a snapshot as a read-only Database without performing the full copy. The longer the snapshot is open, however, more disk space is consumed.

Potential snapshot deadlock

Released in version 1.2.1, snapshots can copy from the internal cache. This creates a deadlock between a thread in the "capture" method and the thread writing the snapshot. The caller of capture is holding an exclusive latch on a node, and is waiting for the snapshot writer to write it. But the snapshot writer cannot get the shared latch while the capture method is holding it exclusively.

Skip checkpoint threshold if checkpointer is behind

If checkpoints run once a second, but it takes more than one second to complete, immediately begin the next checkpoint.

Support index size estimation

Support index (or view) size estimation via a new cursor method which sums up the number of entries and bytes occupied. Multiply these values by the number of entries in the parent node, up to the root. The values are stored into a special stats container object, and the estimate is bounded by a key range. By averaging over several random probes, the estimate can be improved.

Allow database to opened when page size differs from default

If the page size is not explicitly specified, it defaults to 4096. When opening a database with a different page size, an exception is thrown. If not specified, the page size should default to what the existing database uses.

Enhance dsync mode

Checkpoints must issue fsync calls, but this offers little control over how the writes are performed. For high write volume apps, checkpoints bog things down too much. I'd rather have the write load be spread around. Using dsync mode improves consistency, but by itself it's too slow. Using async I/O for writes improves write throughput, by ensuring that the drive always has a queue of operations to apply. Implementation can rely on threads, Java 7 async file channel, or use JNA to access OS-specific features.

Customizable PageArray

The PageArray class is internal, but it provides the lowest abstraction for page management. By supporting custom implementations, features like compression and striping can be implemented without requiring changes to Tupl itself.

Online Index compaction, reducing the b-tree overhead

The b-trees can get fragmented over time. Compaction re-balances and deletes nodes.

Define viewSlice method

Now that skip operations can rely on counts, add a viewSlice method to the View interface.

Corruption caused by concurrent reverse deletes

Thread0 is inserting entries in random order into index0
Thread1 is iterating over index0 and moving entries to index1
Thread2 is iterating over index1 in reverse and moving entries to index2

Sometimes this happens:

PANIC: Closing database due to unhandled exception: java.lang.AssertionError: Already in NodeMap: LeafNode: {id=290175, cachedState=3, isSplit=false, availableBytes=1364, extremity=0b0, latchState=org.cojen.tupl.Node@7cca494b[State = -2147483648, empty queue]}, LeafNode: {id=290175, cachedState=0, isSplit=false, availableBytes=4, extremity=0b0, latchState=org.cojen.tupl.Node@7ba4f24f[State = 0, empty queue]}, 290175
at org.cojen.tupl.LocalDatabase.nodeMapPut(LocalDatabase.java:2706)
at org.cojen.tupl.LocalDatabase.nodeMapPut(LocalDatabase.java:2685)
at org.cojen.tupl.LocalDatabase.doMarkDirty(LocalDatabase.java:3170)
at org.cojen.tupl.LocalDatabase.markDirty(LocalDatabase.java:3045)
at org.cojen.tupl.TreeCursor.notSplitDirty(TreeCursor.java:3626)
at org.cojen.tupl.TreeCursor.store(TreeCursor.java:2382)
at org.cojen.tupl.TreeCursor.doFindAndStore(TreeCursor.java:2116)
at org.cojen.tupl.TreeCursor.findAndStore(TreeCursor.java:2084)
at org.cojen.tupl.Tree.store(Tree.java:355)

When verifying the database:

Verification failure: index=1593915264256456714, node=290175, level=4: Child keys are not less than parent key: BottomInternalNode: {id=547076, cachedState=0, isSplit=false, availableBytes=7, extremity=0b0, latchState=org.cojen.tupl.Node@25f38edc[State = 0, empty queue]}

In most cases, the node in the assertion is not the one found by verification. The following causes have been ruled out:

Version 1.3 is not to blame. Problem was reproduced with version 1.2.
Node recycling is not to blame. Disabling it didn't prevent the problem.

It appears that a leaf node was deleted, but for some reason it was still referenced by the parent node. The reference also still exists in the node map, which is at least consistent.

Make Transaction class be an interface

Custom transaction implementations aren't possible because the class cannot be subclassed. It contains too many internal references to the database. Earlier concerns against defining an interface were with respect to performance. Considering that a check is made whenever a transaction instance is supplied to a method anyhow, an instanceof and cast operation won't add much (if any) overhead.

This change will require a version bump (1.3), because existing code will have been compiled to call class methods instead of interface methods. Java design fail, in my opinion. IncompatibleClassChangeError

Reduce cost of page deletion during fsync

By the time the checkpoint workflow calls fsync, all pages are now marked clean and are being aggressively marked dirty by new writes. This causes pages to be deleted (and recycled), which in turn causes page queue nodes to be written. The fsync operation stalls I/O, especially writes. Since most nodes are clean and reside in the LRU region of the usage list, evict and use them for page queue writes.

Support read-only mode

Implement HashIndex

A hash-based index layered over a b-tree should perform similar to extendible hashing. It only makes sense to use such an index for relatively large keys, certainly larger than the hash code size.

Use a simple x31 hash with a scramble step to generate a 64-bit hash code. Encode the original key alongside the value, for key retrieval and collision detection. In the event of a collision, simply add one to the hash code and try again. This probing technique is cache-friendly.

findNearby performance improvement

When findNearby hits a node extremity, it blindly pops up to the parent node. Before doing so, attempt to latch parent node. If successful, search the parent node. If any position other than the same extremity, then original child node search was correct and parent latch can be released. Otherwise, pop child node and search up again.

Checkpoint must ensure file length is large enough

When pages are allocated, they don't grow the file until the page is written. As a result, the free lists contain pages which don't really exist. When pages are removed from the free list, a check is made to ensure they are within the proper file bounds, which fails. The check can also be performed during initialization, when the headers are read from the file. In this case, an EOFException is thrown from the file read itself.

Either ensure the file length is large enough (which assumes it gets padded with zeros), or treat requests for out of bounds pages as blank. Or, the file is expanded during initialization to it's expected length. I prefer the approach where the file length is expanded before the checkpoint.

Snapshot should read from cache

Snapshots only read from the data file. Clean pages in the cache should be consulted first. If dirty pages can be written too, then snapshot can support non-durable databases too.

Implement cached PageArray

A writeback cached page array should allocate all memory off heap, and track dirty pages in two sets. After sync'ng the pages, the sets swap. This PageArray is more than a cache, it's also backed by a file. All writes to the underling file should be asynchronous, preferably using direct I/O.

Direct pointer access is also provided, and the pointers must be stable until the page is explicitly evicted. With this feature, the primary database cache can be very small, typically using less than 1MB of Java heap. The PageArray cache can be as large as possible without incurring any GC penalty.

This issue supersedes #13.

Don't generate tiny redo log files

If the database is used only for non-transactional operations, the redo log files serve no purpose. Instead, create new redo log files on demand, but create them eagerly if transactional operations have been observed. Switch back to lazy mode if the redo log file wasn't used.

The recovery code assumes that the set of redo log files is contiguous, with no gaps. A missing redo log file causes recovery to stop. Recovery needs to be modified such that it examines all discovered redo log files.

Support page-level compression

The easiest way to do this is to define a new CompressedPageArray class. It maps logical page ids to compressed pages, stored in another database. The logical and compressed databases must have different page sizes to be effective. The best ratio needs to be determined, but I think 512 byte pages would work best for the compressed database. If the logical page size is less than 4096, then ratio should probably be swapped -- compressed page should 8 times larger than logical page size.

Reverse Order Lock Acquisition

I'm looking at TreeCursor.resetNode(Node root) (copied below for reference)

IIUC, we are popping the stack traversing from a leaf toward the root, isn't acquiring the locks in this reverse prone to deadlock? At the point reset has been called from TreeCursor.find there are no other latches held that make this safe.

        while (true) {
            Node node = frame.acquireExclusive();
            TreeCursorFrame parent = frame.pop();

            if (parent == null) {
                // Usually the root frame refers to the root node, but it
                // can be wrong if the tree height is changing.
                if (node != root) {
                    node.releaseExclusive();
                    root.acquireExclusive();
                }
                return frame;
            }

            node.releaseExclusive();
            frame = parent;
        }

Remove support for prefix compression

The key encoding format reserves a bit to indicate that prefix compression has been applied, but the feature has been rejected. The code can be simplified by removing all of the 0x3f and mask operations. Later, this bit can be used for supporting larger keys. The single byte header will support keys 1..128 bytes in length, and the double byte header allows for keys up to 32767 bytes. This change requires a version bump.

Also see issue #5.

Support transaction auto-commit mode

When a transaction is in auto-commit mode, it generates no undo log entries. This can be a useful performance enhancement if the last operation in the transaction is a large update or delete. Switch to auto-commit immediately before the final operation and then no undo log entry is generated.

This behavior can be emulated somewhat by switching the lock mode to UNSAFE after having acquired the lock on the record. Position the cursor to the record to update/delete, and then switch the lock mode to UNSAFE. The update/delete operation will not generate any undo log entry. The transaction will not fully commit at this point, nor will it release any locks. If the transaction is replayed at this point during recovery, it will break the atomicity of the transaction. The unsafe operation will commit but nothing else will.

The use of the UNSAFE trick can be made safe if the entire operation holds the commit lock. Perhaps the right approach is to introduce a new store-commit operation instead of a new transaction mode.

Define Transaction.expose method

Define a method to downgrade all exclusive locks held by a transaction or scope. Changes become visible outside the transaction, but they can still be undone.

Improve internal node utilization

Support temporary indexes

A temporary index has no name, and it is deleted automatically when no longer referenced or after recovery. Deletion can be performed in a background thread to prevent recovery from being stalled.

If temporary index is given a name, it becomes a normal index. A normal index cannot be made temporary by renaming it to null. This is illegal. A new method can be introduced which supports converting a normal index to a temporary one.

Reduce overhead of checkpoint by sweeping through dirty nodes twice

This approach smooths out the "bump" when issuing a checkpoint. Also, it allows dirty nodes to move to MRU position when accessed. A new state is introduced, "mutable". At checkpoint, sweep through all dirty nodes, write them out, and mark them as mutable. When a mutable node is altered, its state becomes dirty again, but it doesn't acquire a new identifier. After first checkpoint sweep, switch active state (0/1). All dirty nodes with old state are written first when modified, just like the current design. Mutable nodes with old state don't need to be written, but a new identifier must still be acquired. Checkpoint now sweeps through node list again, forcing all dirty nodes to be written and marked clean. Those which are mutable are also marked clean, but without the additional write.

Use 6-byte pointers in internal nodes

Node pointers are currently 8 bytes, but fragment nodes use 6-byte pointers. The largest database is limited by the block size and 6-byte pointer size, and so 8-bytes pointers just waste space.

Support fully direct mode

A fully direct mode doesn't have an explicit cache, but it does still have Node instances. Instead of pointing to pages of memory allocated by the process, they point to a mapped file. The mapping is actually to a block device, avoiding remapping problems when the file grows. The operating system is free to cache the mapping itself.

In this mode, node read and write operations don't really do anything. When a node is dirtied, the original contents are copied to the new page. This is a true copy-on-write mode.

Must be using direct access mode.
No cache is allocated, other than the Node instances themselves.
All Node page pointers are initialized to null.
File mapping is performed using JNA, to map a 64-bit size.
Node.read sets the page pointer to id * pageSize + fileStartPointer
Node dirtying copies memory from old page to new page.
Node.write only calls prepareWrite, which only verifies that the page isn't split.
Node.doEvict calls prepareWrite and sets the page pointer to null.
Node compaction performs double copy instead of pointer swap.
Calls to p_length must ask database for page size.

This mode cannot be used when encryption is enabled. Cached pages are required to hold decrypted content.

Snapshots as currently implemented will not work with direct mode.

Optimize ghost deletion

Deletion of ghosts when a transaction commits currently positions a cursor for each deleted entry. No cursor is required because:

Each page is likely still dirty.
Pages won't split as a result of deleting the ghost.
Pages aren't likely to require a merge, because the ghost entry is small.

The implementation just needs to do a simple search down the tree, asserting the nodes are dirty, and only the leaf node needs an exclusive latch. If any nodes aren't dirty, attempt to latch upwards and bind a frame if necessary.

Online Database compaction, to reduce the file size

After a bunch of deletes, the file doesn't actually shrink. Although the space gets re-used, a larger file makes snapshots take longer.

Define bounded Cursor skip methods

Define these Cursor methods, for improving the performance of bounded skips:

skipGe(long amount, byte[] limitKey)
skipGt(long amount, byte[] limitKey)
skipLe(long amount, byte[] limitKey)
skipLt(long amount, byte[] limitKey)

Implementation should take advantage of counts encoded in bottom internal nodes.

Define Cursor hint redo ops

When a cursor is used to perform update runs or bulk inserts, this usage is not reflected in the redo log. Instead, recovery applies the operations in the slow fashion -- search for each entry.

Define hint operations, triggered when cursor is used again for iteration (next, prev, etc) and findNearby. Cursor assigns itself a sequence-generated or random id, 32-bit should suffice. Id is reset to zero when cursor is fully reset. When id is not zero, all redo operations generated by cursor are tagged with the id.

When redo log records are applied, cursors are re-used if the operation was tagged. Store them in a simple hashtable. The collision chain pointer can be stored in the Cursor instance itself.

Because this is only a hint, redo processor should only ever call findNearby on re-used cursors. This allows it to be safely wrong.

cojen / tupl Goto Github PK

tupl's Introduction

tupl's People

Contributors

Stargazers

Watchers

Forkers

tupl's Issues

Recommend Projects

Recommend Topics

Recommend Org