Giter Club home page Giter Club logo

kuzudb / kuzu Goto Github PK

View Code? Open in Web Editor NEW
1.1K 23.0 77.0 162.64 MB

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.

Home Page: https://kuzudb.com/

License: MIT License

C++ 79.01% ANTLR 0.40% C 0.85% Python 5.84% Shell 0.01% Cypher 7.26% CMake 0.99% Makefile 0.11% JavaScript 1.49% Rust 2.07% Java 1.98%
database cypher graph embeddable graph-database graphdb neo4j nosql embedded olap

kuzu's Introduction


Github Actions Badge discord twitter

Kùzu

Kùzu is an embedded graph database built for query speed and scalability. Kùzu is optimized for handling complex join-heavy analytical workloads on very large databases, with the following core feature set:

  • Flexible Property Graph Data Model and Cypher query language
  • Embeddable, serverless integration into applications
  • Columnar disk-based storage
  • Columnar sparse row-based (CSR) adjacency list/join indices
  • Vectorized and factorized query processor
  • Novel and very fast join algorithms
  • Multi-core query parallelism
  • Serializable ACID transactions

Kùzu started as a research project at University of Waterloo and is now being developed primarily by Kùzu Inc., a spinoff company from University of Waterloo. Kùzu is available under a permissible license. So try it out and help us make it better! We welcome your feedback and feature requests.

Installation

Language Installation
Python pip install kuzu
NodeJS npm install kuzu
Rust cargo add kuzu
Java jar file
C/C++ precompiled binaries
CLI precompiled binaries

To learn more about installation, see our Installation page.

Getting Started

Refer to our Getting Started page for your first example.

Build from Source

You can build from source using the instructions provided in the developer guide.

Contributing

We welcome contributions to Kùzu. If you are interested in contributing to Kùzu, please read our Contributing Guide.

License

By contributing to Kùzu, you agree that your contributions will be licensed under the MIT License.

Citing Kùzu

If you are a researcher and use Kùzu in your work, we encourage you to cite our work. You can use the following BibTeX citation:

@inproceedings{kuzu:cidr,
  author =  {Xiyang Feng and
             Guodong Jin and
             Ziyi Chen and
             Chang Liu and
             Semih Saliho\u{g}lu},
  title={K\`uzu Graph Database Management System},
  booktitle={CIDR},
  year={2023}
}
@misc{kuzu-github,
  author =  {Xiyang Feng and
             Guodong Jin and
             Ziyi Chen and
             Chang Liu and
             Semih Saliho\u{g}lu},
  title = {{K\`uzu Database Management System Source Code}},
  howpublished = {\url{https://github.com/kuzudb/kuzu}},
  month        = nov,
  year         = 2022
}

Contact Us

You can contact us at [email protected] or join our Discord community.

kuzu's People

Contributors

acquamarin avatar aesir777 avatar alexander-beedie avatar andyfenghku avatar anuchak avatar ashleyhx avatar aziz-mu avatar benjaminwinger avatar g31pranjal avatar gaurav8297 avatar hououou avatar kasunastony avatar lehners avatar manh9203 avatar mewim avatar msebanc avatar mxwli avatar neeraj9 avatar otoolemichael avatar prrao87 avatar queryproc avatar ray6080 avatar rfdavid avatar riolku avatar russell-liu avatar semihsalihoglu-uw avatar ted-wq-x avatar wenhoujx avatar yiyun-sj avatar zaddach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kuzu's Issues

Add Iterator for DataChunks to read tuples out of a DataChunks

Currently, the Sink operator only keeps track of the number of result tuples in the final output. But sometimes it is not enough to do proper tests by only comparing the number of tuples.
To better support tests (and maybe debugging), we need to add an iterator for DataChunks, which can output its content one tuple (a vector of values) at a time. So we can directly compare the contents of the result DataChunks with expected results or print out results tuples.

The returned values for one tuple is copied into a Tuple object, which holds a vector of Literal, and provides a toString() function.

Usage example:

 auto vectorTypes = dataChunks.getVectorTypes();
 Tuple tuple(vectorTypes);
 DataChunksIterator dataChunksIterator(dataChunks);
 while (dataChunksIterator.hasNextTuple()) {
     dataChunksIterator.getNextTuple(tuple);
     cout << tuple.ToString() << endl;
 }
...

 dataChunksIterator.setCurrentDataChunks(newDataChunks);

...

Simple Hash Table for Hash Join

The goal of this issue is to introduce a simple non-parallel hash table for future integration with hash join. A parallel version is gonna be extended based on this implementation.

Storage Layout

The storage of a simple hash table is divided into two parts: tuple blocks (htBlocks) and directory (htDirectory).
Tuple blocks hold all materialized tuples, and the directory contains slots with pointers. Inside tuple blocks, each tuple reserves a pointer space pointing to a previous tuple within the same hash slot, in this way, we implement a chain-based collision handling.
For each tuple, the storage layout is: |key|payloads|prev|.
Inside the tuple payloads, we might have variable-length items, whose content will be stored separately in the overflowBlocks.

Materialization of build side tuples

Hash (based on the join key) can happen on both a flat vector and a list. For the join key, we only consider equality of nodeID for now (no join on property values, and no multiple join conditions).

Query example for join on a flat vector: (a)->(b)->(c)->(d)->(e) WHERE a.id=1 AND e.id=1. this is a bidirectional query, hash join happens on c.id. and the build side is the subquery (a)->(b)->(c). which starts from SCAN(a), then EXTEND(b) and EXTEND(c). thus, c might be a list chunk.
Query example for join on a list: (a)->(b)->(c), (b)->(d)->(e). the hash join happens on b.id, and the build side is the subquery (a)->(b)->(c) , which starts from SCAN(a), then EXTEND(b). thus, b might be a flat chunk.

During materialization, there are two cases for a list chunk:

  • if it's the key, we would duplicate tuples in build chunks;
  • else, each list vector is treated as a variable-sized value in the tuple.

Interface

Currently, the hash table only supports hash on a single NodeID field, and we assume this is the main usage scenario for hash join, too.

    HashTable(MemoryManager& memoryManager, vector<PayloadInfo>& payloadInfos);
    // add a DataChunks into the hash table
    void addDataChunks(
        DataChunk& keyDataChunk, uint64_t keyVectorIdx, DataChunks& payloadDataChunks);
    // initialize and construct the hash table: allocate the HT directory
    void buildHTDirectory();

Usage example:

...
    while (right->hasNextMorsel()) {
        // fetch all data chunks from the right side, and add to ht
        right->getNextTuples();
        hashTable->addDataChunks(*rightKeyDataChunk, 0, buildDataChunks);
    }
    hashTable->buildHTDirectory();
...

Memory Management

Large intermediate memory allocation and de-allocation should be managed by a unified memory manager, which is separate from our buffer manager for now.
The memory manager can limit the usage of memory under a specified maxMemory size.
If the allocation is beyond the limit, the memory manager should perform one of the following:

  • throw an exception, which will be caught by the server and kill the current query gracefully.
  • evict unused memory blocks, and allow spilling to disk.

Currently, we add a simple memory manager, which is only responsible for allocating memory blocks (the releasing is done by the requester through deconstructing the block handle), if the allocation is beyond the max memory size limit, an exception will be thrown.

DO NOT USE THIS MEMORY MANAGER FOR OTHER OPERATORS FOR NOW
A more advanced one will be integrated later and described in a separate issue.

Parser module refactor

  • RelType -> RelLabel
  • Split big cpp file e.g. transformer.cpp
  • Optimize bazel build dependencies
  • Get rid of setter method and use constructor instead
  • Get rid of getter method by making member public (if we know the member will only be read)
  • avoid using iterator in for loop
  • understand if move(string) is necessary or not

Skipped Cypher Features

This issue lists cypher features we skipped during parser implementation
Graph Pattern

  • Multi Edge Type
    • [e1:knows|:likes]
  • Multi Node Label
    • (a:Person:Student)

Expression

  • List Operator
    • [0, 1, 2, 3]
    • [0.. 3]
    • x IN [0, 1, 2, 3]
  • Case
    • CASE ... WHEN ... THEN ... ELSE ... END
  • Parameter
    • $param
  • List Comprehension
    • [x IN range(0, 10) WHERE x % 2 = 0 | x^3 ]
  • Pattern Comprehension
    • MATCH (a:Person {name: 'Keanu Reeves'}) RETURN [(a)-->(b) WHERE b:Movie | b.released] AS years
  • RelationshipsPattern
    • WHERE NOT (r:Recipe)-[:INCLUDES]->(excluded:Ingredient)
  • Hex Integer
  • Exponent Decimal (scientific notation)

List of Features for Phase 1

Just to keep our thoughts a bit organized, I am listing the set of features that needs to be implemented before we can show the system to someone:

Phase 1 Features

  • Unstructured Properties (Storage, QP)
  • Nulls with null bits (QP)
  • Remove existing work on Updates (Storage, Frontend)
  • Remove LOAD CSV (for batch updates), CREATE, DELETE, REMOVE
  • Variable-length joins with minimum and maximum (min...max). (QP) (#525 the faster version could go to phase 2)
  • exists(property) (Frontend, QP)
  • Sorting for cyclic queries (QP) + WCO joins.
  • Aggregations (With special operators for aggregating over nodes) (QP, Frontend)
  • DISTINCT, UNION, and UNION ALL
  • LIMIT, SKIP
  • ORDER BY (ASC/DESC)
  • WITH clause
  • Optional MATCH (Frontend, QP)
  • Profiler/Explain (Frontend)
  • CLI to issue queries
  • Scalable RETURN all unstructured properties of a node/rel. #416
  • primary key #216
  • Cross Product
  • Memory-only columns/adjacency lists.
  • A robust memory or buffer manager that will allow us to not fail when an operator uses a lot of memory. Change any operator that accumulates data to use buffer manager. Make some disk-based FactorizedTable and OrderBy.-
  • String Functions #538
  • Numeric Functions #539
  • List data type & List functions (#535)
  • Loader and Storage TODOs #504
  • Add a way to ingest parameters to queries.
  • Main Client Drivers? We could ask JPMorgan but I think it makes sense to have Java and Python to start with
  • Transactions Phase 1: #560
  • DDL Support: CREATE NODE/REL TABLE and COPY FROM CSV
    • DROP TABLES
    • SHOW TABLES;
    • DESCRIBE TABLE
  • Exists & Semi Hash Join
  • Optional Match & Left Hash Join
  • Multiple Join Nodes
  • Binary ASP Join
  • Multiway ASP Join (Worst Case Optimal Joins)
  • IndexScan (scan of nodes based on primary key (and refactor of hash indices))
  • Configure bmSize and numThreads from API and CLI (#773)
  • Pip installament: pip install kuzudb
  • Multi-thread execution in Python APIs (#561)
  • User Documentation: How to set up and use the system, examples of queries, supported Cypher clauses.
  • Performance optimizations: This includes a broad range of optimizations we might want to do from expression rewriting, to buffer manager optimizations, to data ingest optimizations, among others we will think of. This and benchmarks against other systems item are related because we want to ensure our scalability and performance is good enough compared to others. So we need to obtain performance until we look good against other systems. An example is this: #583.
  • Testing of the system: This involves setting up many end-to-end tests and can also be part of setting up a benchmarking infrastructure with a cron job that keeps track of the system's daily performance. I like what was desribed here.)
  • Usability Features #433
  • Website construction
  • Logo design

Rewrite of (WHERE a.isProperty) to (WHERE a.isProperty == TRUE)

a.isProperty needs to be rewritten as a.isProperty == TRUE.
Currently this is not applied as a filter.

Update the test under queries/filtered/paths.test from:

-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent == TRUE
---- 6

to:

-NAME OneHopKnowsFilteredTest3
-QUERY MATCH (a:person)-[e1:knows]->(b:person) WHERE a.isStudent
---- 6

And ensure it passes.

Self-loop edge

Consider following example MATCH (a:Person)-[r:KNOWS]->(a:Person),

current enumerator generate plan as Scan(Person)-Extend(Knows edge to Person Label) without understanding fromNode and toNode should be the same node. A filter on nodeID should be append

Change share_ptr to unique_ptr in enumerator

Current enumerator uses share_ptr for LogicalOperator because multiple operators may share the same prev operator.
Consider following example (a)->(b)->(c), when enumerating size = 2, E(a) and E(c) will both point to S(b).

The right thing to do is to use raw pointer during enumeration. When enumerator outputs, we copy each plan and use unique_ptr.

This is not a performance degrading because eventually we will replace enumerator with optimizer. And optimizer only output one single plan. And enumerator is for internal usage, copying multiple plans is fine.

Microbenchmark on computing hashes in the hash table build phase

During the hash table build phase, there are two options for computing hashes:

  • Compute hash values when we're appending input data chunks. The computation is performed directly on the input key data chunk, and result hashes are stored into htBlocks as a field of tuples.
  • Compute hash values when we're building the directory. The computation happens when we're reading key from each tuple in htBlocks. (We don't need store hashes as a field of tuples).

Currently, we take the latter approach. While it's unclear whether the first approach can bring us more benefits. Might be worth doing microbenchmarks.

Unnecessary Scans/Reads -> Extra I/O

Consider the following query:

MATCH (a)-e1->(b)
RETURN e1.time;

Currently, we consider two possible QVOs [a, b] and therefore generate two plans:

P1: Scan(a) → Extend(b) → ScanEdgeProperty(e1.time) → Project([e1.time])

OR

P2: Scan(b) → Extend(a) → ScanEdgeProperty(e1.time) → Project([e1.time])

The extension to (b) comes from the fact that we tend to cover all of the query edges and vertices blindly regardless of what the query actually requests. The two plans should actually be:

P1: Scan(a) → ScanEdgeProperty(e1.time) → Project([e1.time])

OR

P2: Scan(b) → ScanEdgeProperty(e1.time) → Project([e1.time])

In other cases, where we don’t actually need any properties all together, we should simply get the size from the Lists header and therefore not pin/unpin unnecessarily. The query:

MATCH (a)-e1->(b)
RETURN count(*);

Should generate:

P1: Scan(a) → GetListSize(e1) → GROUP BY (COUNT(*))
P1: Scan(b) → GetListSize(e1) → GROUP BY (COUNT(*))

We need to be more aware of the actual properties needed. ID is just another property we do a join on.

Cleaner API for enumerator

Give cleaner APIs for appendOperator(Plan&) which ideally takes only a plan as a parameter. And plan contains all necessary information.

Match any label and label pruning

  • Match (n) is currently interpreted as n has label -1 (match everything). The right thing to do is to replace -1 with vector which should be all node labels in the system
  • In optimization, we should do label pruning to handle the above case

HashJoin Enumeration

Currently we enumerate HashJoin when left and right plan has at least 2 Extend. This works if the two extends are many-to-many. In the case of many-to-one (column) extend, we should not enumerate HashJoin (Plan still work with HashJoin but it can always be replaced with Extend).

The right thing to do is when left and right has at least 2 many-to-many Extends, we enumerate a HashJoin.

Fix Server serialization

The execution of a serialized LogicalPlan is the server is no longer supported and the interface will be fixed once we add an enumerator.

Understand copy elision

Suppose

void functionA() {
   setT(move(functionB()))
}
unique_pre<T> functionB()

the move() disables copy elision. Understand what's the consequence of this and if this is the right thing to do

Separate constants from types in types.h

Currently, in types.h, we have system-level types such as gf_string_t, nodeID_t, etc. And we also have constants such as MAX_VECTOR_SIZE, NUM_BYTES_PER_NODE_ID, etc.
It's more clear to separate constants from types into a separate constants.h file when we have more system-level constants avaiable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.