Giter Club home page Giter Club logo

Comments (3)

benjaminwinger avatar benjaminwinger commented on July 23, 2024 1

Updated TODO List, roughly in order of priority:

  • Fix the bug related to disk array PIPs which I just encountered after adding a third copy to the 60M benchmark I've been running. (#3329)
  • Optimize InMemHashIndex to not allocate same number of slots as the persistent one to reduce the memory footprint. The extra cost when merging probably makes it necessary to store the hash in the InMemHashIndex slots to avoid repeatedly re-computing them when dividing up in memory slots. (#3373).
  • Run PrimaryKeyIndex::prepareCommit in parallel
  • Simplify local storage to replace unordered_map based local storage for small insertions/deletions with HashIndexBuilder (a small performance sacrifice for simplicity is acceptable here; additionally unifying the implementations will help with small copies, given that the current merge algorithm is not well optimized for small copies and will write every single primary slot regardless of whether or not they needed to be modified).
  • Allow PrimaryKeyIndex/HashIndex and DiskArray to initialize themselves, rather than relying on createEmptyHashIndexFiles (which uses the InMemDiskArrayBuilder, see below).
  • Clean up the InMemDiskArrayBuilder used by InMemHashIndex to no longer be based on DIskArray since it really just needs to be a custom std::deque with a reasonable block size. Or at minimum remove BaseInMemDiskArray.
  • Experiment with compacting the slots during splitting and maintaining a linked list of empty overflow slots. This will reduce disk usage and may improve performance since negative lookups (required to insert) have to scan the entire slot and all of its overflow slots, so if we can remove unnecessary overflow slots, it will speed up both lookups and inserts.

from kuzu.

benjaminwinger avatar benjaminwinger commented on July 23, 2024

Progress of the Hash Index Rewrite

Baseline

Note: some of these numbers are subject to change as I'm writing this from memory; I'll re-produce the results and update this later to be more accurate

The benchmark I've been using is a 60 000 000 node copy of consecutive integers from a csv (produced by seq 60000000), to a table containing a single primary key int64 column. This is done using data stored in /tmp (tmpfs), and written to a database also in /tmp, to remove disk performance as a variable.

On the current master this takes roughly 2.2 3.8 seconds (release mode) on my 6 core 12 threads Ryzen 5500 machine.

Baseline Flamegraph

flamegraph-baseline

Flamegraphs unfortunately don't visualize well via github here, and many names are truncated. Downloading the svg file and opening in browser (in firefox anyway, not sure about others) will let you zoom in to different sections.

The flamegraph of the copy operation (using a RelWithDebInfo build) shows that the bulk of this work is done in the copy from CSV and HashIndexBuilder::append (both parallel, noting that the profile includes the total time, so parallel tasks are over-represented in comparison to serial ones). Flushing takes almost no time (0.18%, see note above about this being done in tmpfs).

Initial fairly naive implementation

Naive Implementation Flamegraph

flamegraph

(Edit: this is taking roughly 18 seconds in release mode)

I've started off by implementing a version which breaks down the hash index insertion into three parts.

  1. Entries are inserted into the HashIndexBuilder, as in the baseline. The HashIndexBuilder is kept at a size that will match the eventual size of the on-disk index so that the primary slots are identical.
  2. Instead of flushing the HashIndexBuilder directly to disk, the in-memory data is merged with the on-disk data
    2.1. The on-disk index is resized to fit the new total number of entries
    2.2. Each primary slot is updated one by one with the corresponding entries from the builder.

This really just has one optimization compared to using the existing insertIntoPersistentIndex function: it inserts entries in a single pass, writing once per slot instead of once per entry.

Compared to the baseline, there are two significant new sections in the flamegraph:

  • Writes to disk are all done through the WAL file, so flushing the wal file and clearing the buffer manager eviction queue are a significant non-parallel task (tall chunk in the centre of the flamegraph).
  • Step (2) from above (HashIndex::mergeBulkInserts) is done during IndexBuilder::finalize, which is serial, and a significant cost seen on the left side of the graph (due to the over-representation of parallel tasks, I think this is probably the bulk of the work).

Optimizations

The bulk of the work in HashIndex::mergeBulkInserts being done here appears to be writes through the buffer manager. By my calculations the current number of writes are: 33 writes per page when resizing (the header and page are updated once for each of the 16 elements in the page, and when the page is added the PIP is updated once), and 16 writes per page when updating slots (one per slot). This also adds to the cost of clearing the eviction queue, since an entry is added for each write.
It should only be necessary to write on average slightly more than once per page. The data gets written once, the PIPs get cached and written only when full (or at the end) and the header gets cached and written at the end.

The initial work on this is promising, even if it hasn't changed much yet:

Flamegraph after caching pages when updating

flamegraph2

The above flamegraph is from some WIP changes which cache the page from the primary slot disk array when iterating over and updating the slots, with the result being that the cost of `reserve` and `appendOverflowSlot` are now the dominant parts of `HashIndex::mergeBulkInserts`.

Currently working on (in various stages of completeness; these should hopefully make the initial copy comparable to the old implementation):

  • Caching the index header used for write transactions in BaseDiskArrayInternal (#3109)
  • Caching the last incomplete PIP (when writing) in-memory in the BaseDiskArrayInternal
  • Adding a write iterator to BaseDiskArrayInternal to allow cached sequential updates to elements while only pinning/unpinning pages when moving from one page to another (including pushBacks to speed up resize). (done in my branch).

TODO:

  • Bypass the WAL file when writing pages new to the current transaction to the disk array (done in e092969, but only enabled for the hash index since it wastes pages when rolling back (see #3146))

  • Try running HashIndex::prepareCommit in parallel (it's necessary to synchronize the jobs before this is done so that all of the entries get added to the in-memory index, so it's hard to do this before IndexBuilder::finalize; I briefly experimented using openmp, but there was so much contention from the buffermanager locking that the performance was significantly worse).

  • The HashIndexBuilder doesn't actually need to have the same primary slot layout as the final on-disk version. As long as the slot function is the same, we can take advantage of the fact that all entries for a given primary slot on-disk will come from the same primary slot in-memory, it's just that there may be multiple destination slots for the same source slot. If the hash value is stored alongside the keys in-memory, it should be fairly fast to filter out the entries matching the current disk slot.

  • The HashIndexBuilder is probably not well-optimized for its new purpose. Before it was necessary to have a layout that works well when flushed to disk, but that's no longer necessary. I'd like to experiment with using a different structure, e.g. std::vector<SlotEntry<T>> for each primary slot (the builder doesn't really need overflow slots and could also get away with variable-sized buffers for each primary slot).

Will be updated when I've gotten further optimizations fully implemented

WAL File Replaying

It does appear that with some optimizations the majority of the time is now being spent clearing the WAL and/or flushing the buffer manager. In its current state, it's taking about 10 seconds (fourth item in table below): the first ~two seconds are spent creating the in-memory HashIndexBuilder, the next 1.5s (timed) doing PrimaryKeyIndex::prepareCommit, and the remaining ~6 seconds are presumably flushing the buffer manager and clearing the WAL (measured at 4.5s). I don't think it's represented very well in the flamegraphs. Presumably bypassing the WAL file will help (for the first copy anyway).

I've also opened #3137, which reduces the cost of clearing the WAL in this benchmark to 3.2s.

Performance Summary

This will probably get a little messy since I'm listing the changes cumulatively, however there are some things like #3109 which are being merged first. See commits for details

Changes (cumulative) Copy time in tmpfs for 60 million INT64 from CSV Second Copy of 60 million INT64
Baseline (4d21128) 3.8s -
Naive resize + single pass update (e8bb1f9) ~17.1s -
With a write iterator caching the page being updated (6b4d510) ~11.5s -
Caching the last DiskArray APPageIdx when appending (caused an unnecessary read from the WAL file for each slot; should be possible to cache the last PIP when writing so this may end up being redundant) e16b1e0 ~10.6s -
With WAL replayer FileInfo cache (5eb42af) ~9.2s -
After rebase (including #3109, though the write iterator mostly already handled that) (fed83ba) ~8.8s -
With WAL file bypassing (e092969) ~5.0s -
Caching PIP updates in the diskArray (e45ef54) ~4.8s -
A quick test of mergeBulkInserts using openmp; a new intermediate operator that takes the hash indexes with the bulk inserts stored in memory and flushes them in parallel before adding the COPY TABLE record to the WAL should work as a more permanent solution (not included below) ~4.0s -
With multi-copy (9594367 slight regression, or maybe just random variation) ~4.9s ~14.5s

from kuzu.

benjaminwinger avatar benjaminwinger commented on July 23, 2024

Moved this out of the previous todolist comment. I was thinking about this in the context of what was originally the last bullet point there (changing splitSlots to avoid re-hashing the same entries multiple times). It's not the highest priority yet, mostly because it will be complicated, but I think this might be reasonable to implement.

Possible implementation of single pass split and merge

When the hash index is non-empty and the copy is much larger than the current size of the index. HashIndex::splitSlots currently will rehash the same slots multiple times in this case and it might be possible to re-write it to do just one pass over the slots.
Further, it should be possible to combine this with the mergeBulkInserts function to do merging and splitting in a single pass.

Entries that get moved to a new slot could be moved into the InMemHashIndex. There would be roughly as many entries removed during rehashing as there are inserted from the InMemHashIndex for any given slot.
E.g. (assuming hash is just the modulus operator)
In Memory

Slot 0 Slot 1
4 5
6 7

On Disk (note that in this version of the merging algorithm, slots 2 and 3 get added when they are filled, not ahead of time like the current algorithm)

Slot 0 Slot 1 Slot 2 Slot 3
0 1
2 3

When processing the disk slot 0, entry 0 is left alone, entry 2 is kicked out and replaced with entry 4, while entry 2 can take entry 4's place in the in-memory index. Later entry 2 and entry 6 will both be added to slot 2. In this case though, it would actually make sense to process slot 0 and slot 2 simultaneously, like the splitSlots method in #3325, which would remove the need to move anything from disk into memory in this case, but in some cases it will still be necessary to do so. E.g.

In Memory

Slot 0 Slot 1 Slot 2 Slot 3
8 9 6 7
12 13 10 11
16 17 14 15

On Disk (note that in the desired final configuration for 6 slots, the nextSplitSlotId would be 2, and slots would be hashed modulo 4 with anything ending up in 0 or 1 being re-hashed modulo 8).

Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
0 1
2 3
4 5

In this case, Slot 0 and Slot 2 could still be processed together, allowing entry 2 to be written to slot 2, while entry 8 and entry 16 go into slot 0, however entry 4 needs to be moved out of slot 0 on-disk into slot 0 in-memory as it will later need to be moved into slot 4 on-disk (along with entry 12).
Handling two slots at a time (when slots are being split) should reduce the likelihood that the in memory index will need to be resized, as otherwise I think you end up moving more entries from disk into memory than the other way around.

Third example for the case where the on-disk index is larger and we start with a slot that doesn't need to be split:

In Memory

Slot 0 Slot 1
10 11
12 13
14 15

On Disk

Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
0 1 2
4 3
8 5 6
7
9

In this case, Slot 0 was already split into slot 2 and slot 1 is over-full by comparison. Slot 0 gets processed first by itself and 4 gets moved into memory. 10 and 14 will later be stored in slot 2 and 4 will later be stored in slot 4.
Then slot 1 and slot 3 get processed, 3, 7, 11 and 15 get moved into slot 3 on disk, 1 and 9 stay put, and 5 gets moved into memory and will later go into slot 5 with 13.

Maybe it would be possible to choose a second slot to process with the already split slot 0 so that we can avoid having to increase the size of the in-memory slot 0 since 4 gets moved into memory and nothing gets moved to the disk when processing slot 0. Slot two would be the obvious choice, since slots 1 and 3 will be processed next, so I think that would be roughly (1 << currentLevel) + slotId (and once slotId == nextSplitSlotId, the second slot would be a new empty slot).

from kuzu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.