kuzudb / kuzu Goto Github PK

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.

License: MIT License

C++ 79.01% ANTLR 0.40% C 0.85% Python 5.84% Shell 0.01% Cypher 7.26% CMake 0.99% Makefile 0.11% JavaScript 1.49% Rust 2.07% Java 1.98%

database cypher graph embeddable graph-database graphdb neo4j nosql embedded olap

kuzu's Introduction

Kùzu

Kùzu is an embedded graph database built for query speed and scalability. Kùzu is optimized for handling complex join-heavy analytical workloads on very large databases, with the following core feature set:

Flexible Property Graph Data Model and Cypher query language
Embeddable, serverless integration into applications
Columnar disk-based storage
Columnar sparse row-based (CSR) adjacency list/join indices
Vectorized and factorized query processor
Novel and very fast join algorithms
Multi-core query parallelism
Serializable ACID transactions

Kùzu started as a research project at University of Waterloo and is now being developed primarily by Kùzu Inc., a spinoff company from University of Waterloo. Kùzu is available under a permissible license. So try it out and help us make it better! We welcome your feedback and feature requests.

Installation

Language	Installation
Python	`pip install kuzu`
NodeJS	`npm install kuzu`
Rust	`cargo add kuzu`
Java	jar file
C/C++	precompiled binaries
CLI	precompiled binaries

To learn more about installation, see our Installation page.

Getting Started

Refer to our Getting Started page for your first example.

Build from Source

You can build from source using the instructions provided in the developer guide.

Contributing

We welcome contributions to Kùzu. If you are interested in contributing to Kùzu, please read our Contributing Guide.

License

By contributing to Kùzu, you agree that your contributions will be licensed under the MIT License.

Citing Kùzu

If you are a researcher and use Kùzu in your work, we encourage you to cite our work. You can use the following BibTeX citation:

@inproceedings{kuzu:cidr,
  author =  {Xiyang Feng and
             Guodong Jin and
             Ziyi Chen and
             Chang Liu and
             Semih Saliho\u{g}lu},
  title={K\`uzu Graph Database Management System},
  booktitle={CIDR},
  year={2023}
}
@misc{kuzu-github,
  author =  {Xiyang Feng and
             Guodong Jin and
             Ziyi Chen and
             Chang Liu and
             Semih Saliho\u{g}lu},
  title = {{K\`uzu Database Management System Source Code}},
  howpublished = {\url{https://github.com/kuzudb/kuzu}},
  month        = nov,
  year         = 2022
}

Contact Us

You can contact us at [email protected] or join our Discord community.

kuzu's People

Contributors

Stargazers

Watchers

kuzu's Issues

logical_operator -> logical::orperator, physical_operator -> physical::operator

Add Iterator for DataChunks to read tuples out of a DataChunks

Currently, the Sink operator only keeps track of the number of result tuples in the final output. But sometimes it is not enough to do proper tests by only comparing the number of tuples.
To better support tests (and maybe debugging), we need to add an iterator for DataChunks, which can output its content one tuple (a vector of values) at a time. So we can directly compare the contents of the result DataChunks with expected results or print out results tuples.

The returned values for one tuple is copied into a Tuple object, which holds a vector of Literal, and provides a toString() function.

Usage example:

 auto vectorTypes = dataChunks.getVectorTypes();
 Tuple tuple(vectorTypes);
 DataChunksIterator dataChunksIterator(dataChunks);
 while (dataChunksIterator.hasNextTuple()) {
     dataChunksIterator.getNextTuple(tuple);
     cout << tuple.ToString() << endl;
 }
...

 dataChunksIterator.setCurrentDataChunks(newDataChunks);

...

Find an approach to handling exceptions in expressions

Simple Hash Table for Hash Join

The goal of this issue is to introduce a simple non-parallel hash table for future integration with hash join. A parallel version is gonna be extended based on this implementation.

Storage Layout

The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks) and directory (htDirectory).
Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling.
For each tuple, the storage layout is: |key|payloads|prev|.
Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks.

Materialization of build side tuples

Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).

Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c). which starts from SCAN(a), then EXTEND(b) and EXTEND(c). thus, c might be a list chunk.
Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e). the hash join happens on b.id, and the build side is the subquery (a)->(b)->(c) , which starts from SCAN(a), then EXTEND(b). thus, b might be a flat chunk.

During materialization, there are two cases for a list chunk:

if it's the key, we would duplicate tuples in build chunks;
else, each list vector is treated as a variable-sized value in the tuple.

Interface

Currently, the hash table only supports hash on a single NodeID field, and we assume this is the main usage scenario for hash join, too.

    HashTable(MemoryManager& memoryManager, vector<PayloadInfo>& payloadInfos);
    // add a DataChunks into the hash table
    void addDataChunks(
        DataChunk& keyDataChunk, uint64_t keyVectorIdx, DataChunks& payloadDataChunks);
    // initialize and construct the hash table: allocate the HT directory
    void buildHTDirectory();

Usage example:

...
    while (right->hasNextMorsel()) {
        // fetch all data chunks from the right side, and add to ht
        right->getNextTuples();
        hashTable->addDataChunks(*rightKeyDataChunk, 0, buildDataChunks);
    }
    hashTable->buildHTDirectory();
...

Memory Management

Large intermediate memory allocation and de-allocation should be managed by a unified memory manager, which is separate from our buffer manager for now.
The memory manager can limit the usage of memory under a specified maxMemory size.
If the allocation is beyond the limit, the memory manager should perform one of the following:

throw an exception, which will be caught by the server and kill the current query gracefully.
evict unused memory blocks, and allow spilling to disk.

Currently, we add a simple memory manager, which is only responsible for allocating memory blocks (the releasing is done by the requester through deconstructing the block handle), if the allocation is beyond the max memory size limit, an exception will be thrown.

DO NOT USE THIS MEMORY MANAGER FOR OTHER OPERATORS FOR NOW
A more advanced one will be integrated later and described in a separate issue.

Go over each file and eliminate any unnecessary includes to lower compilation time

Parser module refactor

RelType -> RelLabel
Split big cpp file e.g. transformer.cpp
Optimize bazel build dependencies
Get rid of setter method and use constructor instead
Get rid of getter method by making member public (if we know the member will only be read)
avoid using iterator in for loop
understand if move(string) is necessary or not

Skipped Cypher Features

This issue lists cypher features we skipped during parser implementation
Graph Pattern

Multi Edge Type
- [e1:knows|:likes]
Multi Node Label
- (a:Person:Student)

Expression

List Operator
- [0, 1, 2, 3]
- [0.. 3]
- x IN [0, 1, 2, 3]
Case
- CASE ... WHEN ... THEN ... ELSE ... END
Parameter
- $param
List Comprehension
- [x IN range(0, 10) WHERE x % 2 = 0 | x^3 ]
Pattern Comprehension
- MATCH (a:Person {name: 'Keanu Reeves'}) RETURN [(a)-->(b) WHERE b:Movie | b.released] AS years
RelationshipsPattern
- WHERE NOT (r:Recipe)-[:INCLUDES]->(excluded:Ingredient)
Hex Integer
Exponent Decimal (scientific notation)

Test coverage for EscapedChar

Consider inplace vector operations and result vector reuse in Expressions

remove getDataType() specialized functions from Column/Lists

Update the expression evaluation to minimize branching given NULLs

For now, in order to call an operation, we ensure the operation both operands are not NULL first.

Replace name or any string representation with idx at Binder

This also includes change the mapping from name->dataChunkPos to idx->dataChunkPos.
Eventually we shouldn't deal with string representation except for parser and binder module

setup coveralls with bazel

Fix Setting expression's result owner state for expressions with mixed flat and unflat subtree expressions

morselDescSingleLabel and ScanSingleLabel looks like special cases of morselDescMultiLabel and ScanMutliLabel respectively.

if so, morsel and scan can be simplified a lot.

Change DataChunk and ValueVector ownership to use unique_ptr

outDataChunk field repetaed in both rel_property_reader and adj_list_extend

cleaner error messages in loader.

Clean up temp files when tests are done in the testing framework

Currently, the testing framework will create a temp directory to write out loaded graph binary data. It should be cleaned up after tests are all done.

List of Features for Phase 1

Just to keep our thoughts a bit organized, I am listing the set of features that needs to be implemented before we can show the system to someone:

Phase 1 Features

Rewrite of (WHERE a.isProperty) to (WHERE a.isProperty == TRUE)

a.isProperty needs to be rewritten as a.isProperty == TRUE.
Currently this is not applied as a filter.

Update the test under queries/filtered/paths.test from:

-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent == TRUE
---- 6

to:

-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent
---- 6

And ensure it passes.

Self-loop edge

Consider following example MATCH (a:Person)-[r:KNOWS]->(a:Person),

current enumerator generate plan as Scan(Person)-Extend(Knows edge to Person Label) without understanding fromNode and toNode should be the same node. A filter on nodeID should be append

Use selector when reading from overflow pages.

remove the routing function calls in graph.h

Remove ValueVector constructor(s) taking elementsSize as a parameter.

ValueVector constructor(s) with elementSize are currently created inside our operators and are passed the elementSize. Instead Columns and Lists need to store both elementSize and dataType and the dataType need to be passed to the ValueVector(s).

Change share_ptr to unique_ptr in enumerator

Current enumerator uses share_ptr for LogicalOperator because multiple operators may share the same prev operator.
Consider following example (a)->(b)->(c), when enumerating size = 2, E(a) and E(c) will both point to S(b).

The right thing to do is to use raw pointer during enumeration. When enumerator outputs, we copy each plan and use unique_ptr.

This is not a performance degrading because eventually we will replace enumerator with optimizer. And optimizer only output one single plan. And enumerator is for internal usage, copying multiple plans is fine.

Pass unique_ptr instead of raw_ptr when using antlr4 generated parser

Add SEMI/OUTER/ANTI join types to hash join

Add support of SEMI, OUTER, and ANTI join types to the hash join operator.

Optimize layout for storing unstructured property

currently, a single unstructured property can span over 2 pages - avoid this.
see if we can do better than linear search on a list of unstructured properties of a node.

a virtual hasNextMorsel() doesn't make sense when only Scan specializes it. Can it be removed ?

all operators pay the penality of calling a virtual function.

Split the appendDataChunks function in hash_join_build

The function appendDataChunks() is a little bit lengthy. It's better to find a good way to split the function to make it more readable.

change the data chunk pos to write to for ColumnExtends

Update mapToPhysical for ColumnExtends to use dataChunkPos from boundVarName instead of numChunks - 1

Replace gf_ArithmeticOperator with Antlr Token

A small rewrite is applied to oC_AddOrSubtractExpression (and other arithmeticExpressions). Figure out how antlr keeps track of tokens so we can use exact grammar as open cypher

Change shared_ptr object in Expressions to pass const ref instead

Refactor QueryNode And QueryRel as LogicalExpression

Microbenchmark on computing hashes in the hash table build phase

During the hash table build phase, there are two options for computing hashes:

Compute hash values when we're appending input data chunks. The computation is performed directly on the input key data chunk, and result hashes are stored into htBlocks as a field of tuples.
Compute hash values when we're building the directory. The computation happens when we're reading key from each tuple in htBlocks. (We don't need store hashes as a field of tuples).

Currently, we take the latter approach. While it's unclear whether the first approach can bring us more benefits. Might be worth doing microbenchmarks.

Unnecessary Scans/Reads -> Extra I/O

Consider the following query:

MATCH (a)-e1->(b)
RETURN e1.time;

Currently, we consider two possible QVOs [a, b] and therefore generate two plans:

P1: Scan(a) → Extend(b) → ScanEdgeProperty(e1.time) → Project([e1.time])

P2: Scan(b) → Extend(a) → ScanEdgeProperty(e1.time) → Project([e1.time])

The extension to (b) comes from the fact that we tend to cover all of the query edges and vertices blindly regardless of what the query actually requests. The two plans should actually be:

P1: Scan(a) → ScanEdgeProperty(e1.time) → Project([e1.time])

P2: Scan(b) → ScanEdgeProperty(e1.time) → Project([e1.time])

In other cases, where we don’t actually need any properties all together, we should simply get the size from the Lists header and therefore not pin/unpin unnecessarily. The query:

MATCH (a)-e1->(b)
RETURN count(*);

Should generate:

P1: Scan(a) → GetListSize(e1) → GROUP BY (COUNT(*))
P1: Scan(b) → GetListSize(e1) → GROUP BY (COUNT(*))

We need to be more aware of the actual properties needed. ID is just another property we do a join on.

make pageToFrameMap of type vector<Frame *> and remove swizzle/unswizzle terminology

Integrate ANTLR4.9.1 to bazel build to avoid generating cpp target manually

Current antlr4 cpp target is generated and maintained manually by @andyfengHKU. This can be automated by using antlr4 bazel rules once antlr4.9.1 is supported.

Cleaner API for enumerator

Give cleaner APIs for appendOperator(Plan&) which ideally takes only a plan as a parameter. And plan contains all necessary information.

Match any label and label pruning

Match (n) is currently interpreted as n has label -1 (match everything). The right thing to do is to replace -1 with vector which should be all node labels in the system
In optimization, we should do label pruning to handle the above case

Create a LogicalToPhysicalPlanMapper instead of mapLogicalToPhysical

HashJoin Enumeration

Currently we enumerate HashJoin when left and right plan has at least 2 Extend. This works if the two extends are many-to-many. In the case of many-to-one (column) extend, we should not enumerate HashJoin (Plan still work with HashJoin but it can always be replaced with Extend).

The right thing to do is when left and right has at least 2 many-to-many Extends, we enumerate a HashJoin.

Design for EXISTS constraint required.

Change the naming of "a" to "a.id" if "a" only represents nodeID vector

Fix Server serialization

The execution of a serialized LogicalPlan is the server is no longer supported and the interface will be fixed once we add an enumerator.

Understand copy elision

Suppose

void functionA() {
   setT(move(functionB()))
}
unique_pre<T> functionB()

the move() disables copy elision. Understand what's the consequence of this and if this is the right thing to do

Separate constants from types in types.h

Currently, in types.h, we have system-level types such as gf_string_t, nodeID_t, etc. And we also have constants such as MAX_VECTOR_SIZE, NUM_BYTES_PER_NODE_ID, etc.
It's more clear to separate constants from types into a separate constants.h file when we have more system-level constants avaiable.

Look into changing locking in SCAN_NODE_ID to use CAS and busy spinning

Understand where to append ColumnExtend And ScanProperty Operators.

ColumnExtend And ScanProperty operator should be applied before the list get flattened to avoid extra IO.