Comments (3)
Updated TODO List, roughly in order of priority:
- Fix the bug related to disk array PIPs which I just encountered after adding a third copy to the 60M benchmark I've been running. (#3329)
- Optimize
InMemHashIndex
to not allocate same number of slots as the persistent one to reduce the memory footprint. The extra cost when merging probably makes it necessary to store the hash in theInMemHashIndex
slots to avoid repeatedly re-computing them when dividing up in memory slots. (#3373). - Run
PrimaryKeyIndex::prepareCommit
in parallel - Simplify local storage to replace unordered_map based local storage for small insertions/deletions with HashIndexBuilder (a small performance sacrifice for simplicity is acceptable here; additionally unifying the implementations will help with small copies, given that the current merge algorithm is not well optimized for small copies and will write every single primary slot regardless of whether or not they needed to be modified).
- Allow PrimaryKeyIndex/HashIndex and DiskArray to initialize themselves, rather than relying on
createEmptyHashIndexFiles
(which uses theInMemDiskArrayBuilder
, see below). - Clean up the InMemDiskArrayBuilder used by InMemHashIndex to no longer be based on DIskArray since it really just needs to be a custom
std::deque
with a reasonable block size. Or at minimum removeBaseInMemDiskArray
. - Experiment with compacting the slots during splitting and maintaining a linked list of empty overflow slots. This will reduce disk usage and may improve performance since negative lookups (required to insert) have to scan the entire slot and all of its overflow slots, so if we can remove unnecessary overflow slots, it will speed up both lookups and inserts.
from kuzu.
Progress of the Hash Index Rewrite
Baseline
Note: some of these numbers are subject to change as I'm writing this from memory; I'll re-produce the results and update this later to be more accurate
The benchmark I've been using is a 60 000 000 node copy of consecutive integers from a csv (produced by seq 60000000
), to a table containing a single primary key int64 column. This is done using data stored in /tmp
(tmpfs), and written to a database also in /tmp
, to remove disk performance as a variable.
On the current master this takes roughly 2.2 3.8 seconds (release mode) on my 6 core 12 threads Ryzen 5500 machine.
Flamegraphs unfortunately don't visualize well via github here, and many names are truncated. Downloading the svg file and opening in browser (in firefox anyway, not sure about others) will let you zoom in to different sections.
The flamegraph of the copy operation (using a RelWithDebInfo build) shows that the bulk of this work is done in the copy from CSV and HashIndexBuilder::append
(both parallel, noting that the profile includes the total time, so parallel tasks are over-represented in comparison to serial ones). Flushing takes almost no time (0.18%, see note above about this being done in tmpfs).
Initial fairly naive implementation
(Edit: this is taking roughly 18 seconds in release mode)
I've started off by implementing a version which breaks down the hash index insertion into three parts.
- Entries are inserted into the
HashIndexBuilder
, as in the baseline. The HashIndexBuilder is kept at a size that will match the eventual size of the on-disk index so that the primary slots are identical. - Instead of flushing the
HashIndexBuilder
directly to disk, the in-memory data is merged with the on-disk data
2.1. The on-disk index is resized to fit the new total number of entries
2.2. Each primary slot is updated one by one with the corresponding entries from the builder.
This really just has one optimization compared to using the existing insertIntoPersistentIndex
function: it inserts entries in a single pass, writing once per slot instead of once per entry.
Compared to the baseline, there are two significant new sections in the flamegraph:
- Writes to disk are all done through the WAL file, so flushing the wal file and clearing the buffer manager eviction queue are a significant non-parallel task (tall chunk in the centre of the flamegraph).
- Step (2) from above (
HashIndex::mergeBulkInserts
) is done duringIndexBuilder::finalize
, which is serial, and a significant cost seen on the left side of the graph (due to the over-representation of parallel tasks, I think this is probably the bulk of the work).
Optimizations
The bulk of the work in HashIndex::mergeBulkInserts
being done here appears to be writes through the buffer manager. By my calculations the current number of writes are: 33 writes per page when resizing (the header and page are updated once for each of the 16 elements in the page, and when the page is added the PIP is updated once), and 16 writes per page when updating slots (one per slot). This also adds to the cost of clearing the eviction queue, since an entry is added for each write.
It should only be necessary to write on average slightly more than once per page. The data gets written once, the PIPs get cached and written only when full (or at the end) and the header gets cached and written at the end.
The initial work on this is promising, even if it hasn't changed much yet:
The above flamegraph is from some WIP changes which cache the page from the primary slot disk array when iterating over and updating the slots, with the result being that the cost of `reserve` and `appendOverflowSlot` are now the dominant parts of `HashIndex::mergeBulkInserts`.Currently working on (in various stages of completeness; these should hopefully make the initial copy comparable to the old implementation):
- Caching the index header used for write transactions in
BaseDiskArrayInternal
(#3109) - Caching the last incomplete PIP (when writing) in-memory in the
BaseDiskArrayInternal
- Adding a write iterator to
BaseDiskArrayInternal
to allow cached sequential updates to elements while only pinning/unpinning pages when moving from one page to another (including pushBacks to speed up resize). (done in my branch).
TODO:
-
Bypass the WAL file when writing pages new to the current transaction to the disk array (done in e092969, but only enabled for the hash index since it wastes pages when rolling back (see #3146))
-
Try running
HashIndex::prepareCommit
in parallel (it's necessary to synchronize the jobs before this is done so that all of the entries get added to the in-memory index, so it's hard to do this beforeIndexBuilder::finalize
; I briefly experimented using openmp, but there was so much contention from the buffermanager locking that the performance was significantly worse). -
The HashIndexBuilder doesn't actually need to have the same primary slot layout as the final on-disk version. As long as the slot function is the same, we can take advantage of the fact that all entries for a given primary slot on-disk will come from the same primary slot in-memory, it's just that there may be multiple destination slots for the same source slot. If the hash value is stored alongside the keys in-memory, it should be fairly fast to filter out the entries matching the current disk slot.
-
The HashIndexBuilder is probably not well-optimized for its new purpose. Before it was necessary to have a layout that works well when flushed to disk, but that's no longer necessary. I'd like to experiment with using a different structure, e.g.
std::vector<SlotEntry<T>>
for each primary slot (the builder doesn't really need overflow slots and could also get away with variable-sized buffers for each primary slot).
Will be updated when I've gotten further optimizations fully implemented
WAL File Replaying
It does appear that with some optimizations the majority of the time is now being spent clearing the WAL and/or flushing the buffer manager. In its current state, it's taking about 10 seconds (fourth item in table below): the first ~two seconds are spent creating the in-memory HashIndexBuilder, the next 1.5s (timed) doing PrimaryKeyIndex::prepareCommit
, and the remaining ~6 seconds are presumably flushing the buffer manager and clearing the WAL (measured at 4.5s). I don't think it's represented very well in the flamegraphs. Presumably bypassing the WAL file will help (for the first copy anyway).
I've also opened #3137, which reduces the cost of clearing the WAL in this benchmark to 3.2s.
Performance Summary
This will probably get a little messy since I'm listing the changes cumulatively, however there are some things like #3109 which are being merged first. See commits for details
Changes (cumulative) | Copy time in tmpfs for 60 million INT64 from CSV | Second Copy of 60 million INT64 |
---|---|---|
Baseline (4d21128) | 3.8s | - |
Naive resize + single pass update (e8bb1f9) | ~17.1s | - |
With a write iterator caching the page being updated (6b4d510) | ~11.5s | - |
Caching the last DiskArray APPageIdx when appending (caused an unnecessary read from the WAL file for each slot; should be possible to cache the last PIP when writing so this may end up being redundant) e16b1e0 | ~10.6s | - |
With WAL replayer FileInfo cache (5eb42af) | ~9.2s | - |
After rebase (including #3109, though the write iterator mostly already handled that) (fed83ba) | ~8.8s | - |
With WAL file bypassing (e092969) | ~5.0s | - |
Caching PIP updates in the diskArray (e45ef54) | ~4.8s | - |
A quick test of mergeBulkInserts using openmp; a new intermediate operator that takes the hash indexes with the bulk inserts stored in memory and flushes them in parallel before adding the COPY TABLE record to the WAL should work as a more permanent solution (not included below) |
~4.0s | - |
With multi-copy (9594367 slight regression, or maybe just random variation) | ~4.9s | ~14.5s |
from kuzu.
Moved this out of the previous todolist comment. I was thinking about this in the context of what was originally the last bullet point there (changing splitSlots to avoid re-hashing the same entries multiple times). It's not the highest priority yet, mostly because it will be complicated, but I think this might be reasonable to implement.
Possible implementation of single pass split and merge
When the hash index is non-empty and the copy is much larger than the current size of the index. HashIndex::splitSlots
currently will rehash the same slots multiple times in this case and it might be possible to re-write it to do just one pass over the slots.
Further, it should be possible to combine this with the mergeBulkInserts function to do merging and splitting in a single pass.
Entries that get moved to a new slot could be moved into the InMemHashIndex. There would be roughly as many entries removed during rehashing as there are inserted from the InMemHashIndex for any given slot.
E.g. (assuming hash is just the modulus operator)
In Memory
Slot 0 | Slot 1 |
---|---|
4 | 5 |
6 | 7 |
On Disk (note that in this version of the merging algorithm, slots 2 and 3 get added when they are filled, not ahead of time like the current algorithm)
Slot 0 | Slot 1 | Slot 2 | Slot 3 |
---|---|---|---|
0 | 1 | ||
2 | 3 |
When processing the disk slot 0, entry 0 is left alone, entry 2 is kicked out and replaced with entry 4, while entry 2 can take entry 4's place in the in-memory index. Later entry 2 and entry 6 will both be added to slot 2. In this case though, it would actually make sense to process slot 0 and slot 2 simultaneously, like the splitSlots method in #3325, which would remove the need to move anything from disk into memory in this case, but in some cases it will still be necessary to do so. E.g.
In Memory
Slot 0 | Slot 1 | Slot 2 | Slot 3 |
---|---|---|---|
8 | 9 | 6 | 7 |
12 | 13 | 10 | 11 |
16 | 17 | 14 | 15 |
On Disk (note that in the desired final configuration for 6 slots, the nextSplitSlotId would be 2, and slots would be hashed modulo 4 with anything ending up in 0 or 1 being re-hashed modulo 8).
Slot 0 | Slot 1 | Slot 2 | Slot 3 | Slot 4 | Slot 5 |
---|---|---|---|---|---|
0 | 1 | ||||
2 | 3 | ||||
4 | 5 |
In this case, Slot 0 and Slot 2 could still be processed together, allowing entry 2 to be written to slot 2, while entry 8 and entry 16 go into slot 0, however entry 4 needs to be moved out of slot 0 on-disk into slot 0 in-memory as it will later need to be moved into slot 4 on-disk (along with entry 12).
Handling two slots at a time (when slots are being split) should reduce the likelihood that the in memory index will need to be resized, as otherwise I think you end up moving more entries from disk into memory than the other way around.
Third example for the case where the on-disk index is larger and we start with a slot that doesn't need to be split:
In Memory
Slot 0 | Slot 1 |
---|---|
10 | 11 |
12 | 13 |
14 | 15 |
On Disk
Slot 0 | Slot 1 | Slot 2 | Slot 3 | Slot 4 | Slot 5 |
---|---|---|---|---|---|
0 | 1 | 2 | |||
4 | 3 | ||||
8 | 5 | 6 | |||
7 | |||||
9 |
In this case, Slot 0 was already split into slot 2 and slot 1 is over-full by comparison. Slot 0 gets processed first by itself and 4 gets moved into memory. 10 and 14 will later be stored in slot 2 and 4 will later be stored in slot 4.
Then slot 1 and slot 3 get processed, 3, 7, 11 and 15 get moved into slot 3 on disk, 1 and 9 stay put, and 5 gets moved into memory and will later go into slot 5 with 13.
Maybe it would be possible to choose a second slot to process with the already split slot 0 so that we can avoid having to increase the size of the in-memory slot 0 since 4 gets moved into memory and nothing gets moved to the disk when processing slot 0. Slot two would be the obvious choice, since slots 1 and 3 will be processed next, so I think that would be roughly (1 << currentLevel) + slotId
(and once slotId == nextSplitSlotId
, the second slot would be a new empty slot).
from kuzu.
Related Issues (20)
- Feature: Automatically alias column names in Kรนzu result when using `a.*`
- Feature: Auto-detect min/max values in DataFrame column to downcast integers HOT 1
- Feature: Date column is incorrectly detected when copying from Pandas DataFrame HOT 5
- Bug: plan printing issue
- Performance issue with LDBC SNB Interactive Queries HOT 1
- Optimization: `COUNT(*)` query takes a long time to complete
- Feature: support SKIP rows in LOAD FROM for COPY HOT 2
- segfault error with `BEGIN TRANSACTION ... COMMIT` HOT 1
- UX: Allow read-only CLI when another process is holding the lockfile HOT 3
- Feature: In CLI get output of query in a bash script friendly format.
- Feature: Support `SKIP` option in parquet/pandas dataframe reader. HOT 1
- CLI: Provide better error message when trying to open `READ_ONLY` connection when lock file is in use
- CLI: Remove case-sensitivity for CLI flags HOT 1
- Bug: Python UDF returns wrong result HOT 1
- Bug: Wrong result on "suffix" function with repetitive string
- Bug: Heap use after free on substring function with negative starting index HOT 1
- Bug: Kuzu gives wrong results with Python/Java API HOT 4
- CLI: Beautify CLI table output to be more similar to how we show a query plan
- CLI: Display column data types in CLI table
- Bug: Unexpected segmentation fault on empty table
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuzu.