Comments (6)
Hi @maxdemarzi, I added indexes as you suggested, but for some reason, it makes the performance worse, not better, and I'm not sure why. Also, I'll clarify each of your points below:
- I understand that storing the property in lowercase can help avoid unnecessary lowercasing within the Cypher runtime, but I'm also doing the same lowercasing in Kùzu, so it's an apples to apples (and in my view, a fair) comparison
- The query run times reported are the average over many runs, not just a single "cold" run. This is part of a benchmark test suite, see the results below.
- The throughput of the queries is high because I'm not running each query just once -- the queries are run as part of a
pytest-benchmark
suite, with a warmup, and if you see the full benchmark results below, the shorter-time queries are run as many as 50+ times, for both databases, with the longest running queries running a minimum of 5 times - I set a warmup cycle of 5 runs, so the run times being measured are "hot", and the standard deviations on these runs are not larger than the actual run time of a few milliseconds, so again, because each short-running query is running lots and lots of times across both DBs, I think this is a fair comparison.
$ pytest benchmark_query.py --benchmark-min-rounds=5 --benchmark-warmup-iterations=5 --benchmark-disable-gc --benchmark-sort=fullname
=========================================================== test session starts ============================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=5)
rootdir: /code/kuzudb-study/neo4j
plugins: Faker-19.2.0, anyio-3.7.1, benchmark-4.0.0
collected 9 items
benchmark_query.py ......... [100%]
--------------------------------------------------------------------------------- benchmark: 9 tests ---------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_query1 1.5733 (253.20) 1.6033 (63.31) 1.5928 (215.76) 0.0131 (7.05) 1.6004 (227.62) 0.0201 (41.80) 1;0 0.6278 (0.00) 5 1
test_benchmark_query2 0.5663 (91.13) 0.5889 (23.26) 0.5770 (78.17) 0.0095 (5.12) 0.5746 (81.72) 0.0163 (33.91) 2;0 1.7331 (0.01) 5 1
test_benchmark_query3 0.0362 (5.83) 0.0527 (2.08) 0.0394 (5.34) 0.0043 (2.33) 0.0376 (5.34) 0.0040 (8.33) 2;2 25.3731 (0.19) 19 1
test_benchmark_query4 0.0410 (6.60) 0.0566 (2.24) 0.0435 (5.89) 0.0032 (1.72) 0.0425 (6.04) 0.0016 (3.42) 2;2 23.0038 (0.17) 23 1
test_benchmark_query5 0.0062 (1.0) 0.0267 (1.05) 0.0074 (1.0) 0.0021 (1.15) 0.0070 (1.0) 0.0005 (1.0) 1;5 135.4661 (1.0) 88 1
test_benchmark_query6 0.0177 (2.84) 0.0253 (1.0) 0.0197 (2.67) 0.0019 (1.0) 0.0192 (2.73) 0.0014 (2.81) 7;5 50.6911 (0.37) 45 1
test_benchmark_query7 0.1517 (24.41) 0.1685 (6.66) 0.1556 (21.07) 0.0058 (3.11) 0.1538 (21.87) 0.0007 (1.46) 1;2 6.4286 (0.05) 7 1
test_benchmark_query8 3.1052 (499.72) 3.1835 (125.71) 3.1393 (425.27) 0.0333 (17.89) 3.1493 (447.93) 0.0535 (111.43) 2;0 0.3185 (0.00) 5 1
test_benchmark_query9 7.6747 (>1000.0) 7.7181 (304.78) 7.7004 (>1000.0) 0.0164 (8.82) 7.7041 (>1000.0) 0.0205 (42.60) 2;0 0.1299 (0.00) 5 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
======================================================= 9 passed in 97.45s (0:01:37) =======================================================
from kuzudb-study.
Results after adding the three indexes as per #29:
Neo4j vs. Kùzu multi-threaded
Query | Neo4j (sec) | Kùzu (sec) | Speedup factor |
---|---|---|---|
1 | 1.5928 | 0.1193300 | 13.3 |
2 | 0.5770 | 0.1259888 | 4.6 |
3 | 0.0394 | 0.0081799 | 4.8 |
4 | 0.0435 | 0.0078041 | 5.6 |
5 | 0.0074 | 0.0046616 | 1.6 |
6 | 0.0197 | 0.0127203 | 1.5 |
7 | 0.1556 | 0.0067574 | 23.0 |
8 | 3.1393 | 0.0191212 | 164.1 |
9 | 7.7004 | 0.0226162 | 340.7 |
Query 8, with the age
property indexed in Neo4j, is slightly better than before (the speedup in Kùzu drops from 180x to 164x), but for query 9, where we filter on the indexed age
property across multi-hop traversals, the index causes Neo4j's performance to regress and Kùzu's speedup, which was ~188x before, is now 340x. Can't explain why that is.
And I ran the benchmark multiple times, to obtain the same results.
The larger points in my blog posts are still valid, I feel -- with Kùzu, I don't have to "think" about what properties are being queried on beforehand, and the only properties being indexed in Kùzu are the primary key (id
), so in general, it's the faster of the two solutions across a range of query types, and for a range of throughputs.
from kuzudb-study.
When returning ids… on Neo you are returning the id property not the id(node) primary key. That may affect some query speeds.
As far as lowercase vs not. Try what I suggested and index the pre lowered property so you are comparing traversal speed not string modifications.
you can ask for the Profile in Neo4j to see where the query is taking longer.
I would also suggest using a real benchmarking tool like Gatling and running each query for at least 60 seconds after warm up.
from kuzudb-study.
But the goal of this study isn't to say that one tool is "better" or faster than the other. It's to apply a similar set of data preprocessing and querying techniques to answer questions about the data, without giving special treatment to Neo4j.
I've been a long time user of Neo4j, and the constant dancing around one has to do to with indexing, refactoring data models to answer specific types of queries, and such, are what are making it hard to justify running on production workloads on large datasets. I've been experimenting with other datasets at work, and the same speed issues apply (with or without indexing), whereas Kùzu just works.
My takeaway is that Neo4j performs the same role in graph DBMS as Postgres does in RDBMS -- they're both OLTP row/record-wise stores, better at handling transactions, and for OLAP workloads, Neo4j will face the same performance issues that Postgres does, when compared against their OLAP counterparts (ClickHouse or DuckDB for RDBMS, and Kùzu for GDBMS).
I appreciate your inputs, and am glad I went through this learning exercise. Closing this for now. Thanks!
from kuzudb-study.
Hello 👋,
As a person who has been asked to look into this. I think this should be left open a little white longer
from kuzudb-study.
@JoshInnis could you open a new issue with your findings when you have them? Thanks!
from kuzudb-study.
Related Issues (9)
- Generate edge data
- Convert CSV node/edge generation to parquet HOT 2
- Segfault when running query 2
- Running queries 3 and 4 with parameters doesn't work
- Queries 7 and 8 return inconsistent results between Neo4j and Kùzu HOT 3
- Queries 9 and 10 are segfaulting HOT 1
- Update text for docs HOT 1
- Switch to snappy compression in Polars
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuzudb-study.