Giter Club home page Giter Club logo

kuzudb-study's People

Contributors

acquamarin avatar andyfenghku avatar prrao87 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

kuzudb-study's Issues

Running queries 3 and 4 with parameters doesn't work

I'm passing a parameter to the following query in Kuzu and it doesn't seem to work as intended.

# Which are the top 5 cities in a particular country with the lowest average age in the network?
query = """
    MATCH (country:Country)
    WHERE co.country = $country
    AND EXISTS { MATCH (p:Person)-[:LivesIn]->(ci:City)-[*..2]->(country) }
    RETURN ci.city AS city, avg(p.age) AS averageAge
    ORDER BY averageAge LIMIT 5
"""

Kuzu doesn't seem to like using the WHERE xx AND EXISTS {MATCH yy} clause, despite this being valid Cypher in Kuzu as per the example notebook.

Output:

Traceback (most recent call last):
  File "/Users/prrao/code/kuzudb-study/kuzudb/query.py", line 85, in <module>
    main(CONNECTION)
  File "/Users/prrao/code/kuzudb-study/kuzudb/query.py", line 76, in main
    run_query3(conn, params=[("country", "Canada")])
  File "/Users/prrao/code/kuzudb-study/kuzudb/query.py", line 51, in run_query3
    response = conn.execute(query, parameters=params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/prrao/.pyenv/versions/3.11.2/lib/python3.11/site-packages/kuzu/connection.py", line 88, in execute
    self._connection.execute(
RuntimeError: Parser exception: Invalid input <MATCH (country:Country)
        WHERE co.country = $country
        AND EXISTS { MATCH (p:Person)-[:LivesIn]->(ci:City)-[*..>: expected rule oC_SingleQuery (line: 4, offset: 62)
"        RETURN ci.city AS city, avg(p.age) AS averageAge"

The source of the error isn't clear, which needs a deeper dive. Query 4 also suffers from the same issue.

Generate edge data

  • Edges for person-persons
  • Edges for person-locations
  • Edges for person-interests
  • Edges for city-state and state-country

Indexes

Can you try to run Neo4j with indexes?
For example Country.country, Person.age, City.city
Create two properties Person.lower_gender and Interest.lower_interest with the property already in lowercase and index both... or just set the data to be the toLower version of it and index/use that directly.

Also what does the throughput of these queries look like? Not just the time.

Also comparing queries that take so little time is not the most useful.
0.0086 seconds vs 0.0046 seconds is kinda pointless.

Queries 9 and 10 are segfaulting

Queries 9 and 10 run in Neo4j (relatively slowly), but they segfault in Kùzu. Is it something I'm doing wrong? It would help if there were more helpful error messages, but I can imagine there's something going on in the translation of Cypher logic into the C++ layer.

Ne4j

Query 9:
 
        MATCH (:Person)-[r1:FOLLOWS]->(influencer:Person)-[r2:FOLLOWS]->(:Person)
        WITH count(r1) AS numFollowers, influencer, r2
        WHERE influencer.age <= $age_upper AND numFollowers > 3000
        RETURN influencer.id AS influencerId, influencer.name AS name, count(r2) AS numFollows
        ORDER BY numFollows DESC LIMIT 5;
    

        Influencers below age 30 who follow the most people:
shape: (5, 3)
┌──────────────┬─────────────────┬────────────┐
│ influencerId ┆ name            ┆ numFollows │
│ ---          ┆ ---             ┆ ---        │
│ i64          ┆ str             ┆ i64        │
╞══════════════╪═════════════════╪════════════╡
│ 89758        ┆ Joshua Williams ┆ 40         │
│ 85914        ┆ Micheal Holt    ┆ 32         │
│ 8077         ┆ Ralph Floyd     ┆ 32         │
│ 1348         ┆ Brett Wright    ┆ 32         │
│ 70809        ┆ David Cooper    ┆ 31         │
└──────────────┴─────────────────┴────────────┘
        

Query 10:
 
        MATCH (:Person)-[r1:FOLLOWS]->(influencer:Person)-[r2:FOLLOWS]->(person:Person)
        WITH count(r1) AS numFollowers, person, influencer, r2
        WHERE influencer.age >= $age_lower AND influencer.age <= $age_upper AND numFollowers > 3000
        RETURN person.id AS personId, person.name AS name, count(r2) AS numFollowers
        ORDER BY numFollowers DESC LIMIT 5;
    

        Influencers below the age of 18-25 who can be considered 'influencers' in the network:
shape: (5, 3)
┌──────────┬────────────────────┬──────────────┐
│ personId ┆ name               ┆ numFollowers │
│ ---      ┆ ---                ┆ ---          │
│ i64      ┆ str                ┆ i64          │
╞══════════╪════════════════════╪══════════════╡
│ 88759    ┆ Kylie Chang        ┆ 5            │
│ 39355    ┆ Elizabeth Hamilton ┆ 4            │
│ 12104    ┆ Cheryl Coleman     ┆ 4            │
│ 27567    ┆ Michael Dominguez  ┆ 4            │
│ 31072    ┆ Jeanette Nolan     ┆ 4            │
└──────────┴────────────────────┴──────────────┘
        
Neo4j query script completed in 17.753386s

Kùzu

Query 9:
 
        MATCH (:Person)-[r1:Follows]->(influencer:Person)-[r2:Follows]->(:Person)
        WITH count(r1) AS numFollowers, influencer, r2
        WHERE influencer.age <= $age_upper AND numFollowers > 3000
        RETURN influencer.id AS influencerId, influencer.name AS name, count(r2) AS numFollows
        ORDER BY numFollows DESC LIMIT 5;
    
[1]    15847 bus error  python query.py

Convert CSV node/edge generation to parquet

CSV files are useful for debugging during initial data inspection, but to scale things up, it makes sense to just output all the node/edge generation data to parquet prior to ingesting into a graph.

Segfault when running query 2

I'm trying to run the following query that involves taking the result from one query, and passing it to another query in Kuzu, and it segfaults.

# In which city does the most-followed person in the network live?
query = """
    MATCH (follower:Person)-[:Follows]->(person:Person)
    WITH person, count(follower) as followers
    ORDER BY followers DESC LIMIT 1
    MATCH (person) -[:LivesIn]-> (city:City)
    RETURN person.name AS name, followers AS numFollowers, city.city AS city
"""

Output:

[1]    15084 segmentation fault  python query.py

The same query runs in Neo4j. Not sure about the best way to run this query in Kuzu to obtain a valid result.

Queries 7 and 8 return inconsistent results between Neo4j and Kùzu

Problem

Queries 7 and 8, which count the edges between persons and locations do not show consistent results between the two DBs. The other queries showed identical results, so it doesn't make sense why these two queries specifically don't add up.

Neo4j results

Query 7:
 
        MATCH (p:Person)-[:LIVES_IN]->(:City)-[:CITY_IN]->(s:State)
        WHERE p.age >= $age_lower AND p.age <= $age_upper AND s.country = $country
        WITH p, s
        MATCH (p)-[:HAS_INTEREST]->(i:Interest)
        WHERE tolower(i.interest) = tolower($interest)
        RETURN count(p) AS numPersons, s.state AS state, s.country AS country
        ORDER BY numPersons DESC LIMIT 1
    

            State in United States with the most users between ages 23-30 who have an interest in photography:
shape: (1, 3)
┌────────────┬────────────┬───────────────┐
│ numPersons ┆ state      ┆ country       │
│ ---        ┆ ---        ┆ ---           │
│ i64        ┆ str        ┆ str           │
╞════════════╪════════════╪═══════════════╡
│ 172        ┆ California ┆ United States │
└────────────┴────────────┴───────────────┘
            
Query 7 completed in 0.163613s

Query 8:
 
        MATCH (p1:Person)-[f:FOLLOWS]->(p2:Person)
        WHERE p1.personID > p2.personID
        RETURN count(f) as numFollowers
    
Number of second degree connections reachable in the graph:
shape: (1, 1)
┌──────────────┐
│ numFollowers │
│ ---          │
│ i64          │
╞══════════════╡
│ 1219517      │
└──────────────┘
Query 8 completed in 0.990533s

KùzuDB results

Query 7:
 
        MATCH (p:Person)-[:LivesIn]->(:City)-[:CityIn]->(s:State)
        WHERE p.age >= $age_lower AND p.age <= $age_upper AND s.country = $country
        WITH p, s
        MATCH (p)-[:HasInterest]->(i:Interest)
        WHERE lower(i.interest) = lower($interest)
        RETURN count(p.id) AS numPersons, s.state AS state, s.country AS country
        ORDER BY numPersons DESC LIMIT 1
    

            State in United States with the most users between ages 23-30 who have an interest in photography:
shape: (1, 3)
┌────────────┬────────────┬───────────────┐
│ numPersons ┆ state      ┆ country       │
│ ---        ┆ ---        ┆ ---           │
│ i64        ┆ str        ┆ str           │
╞════════════╪════════════╪═══════════════╡
│ 169        ┆ California ┆ United States │
└────────────┴────────────┴───────────────┘
            
Query 7 completed in 0.010869s

Query 8:
 
        MATCH (p1:Person)-[f:Follows]->(p2:Person)
        WHERE p1.id > p2.id
        RETURN count(f) as numFollowers
    
Number of second degree connections reachable in the graph:
shape: (1, 1)
┌──────────────┐
│ numFollowers │
│ ---          │
│ i64          │
╞══════════════╡
│ 1214477      │
└──────────────┘
Query 8 completed in 0.027979s

Summary of inconsistency:

Query Neo4j (count) Kùzu (count)
7 172 169
8 1219517 1214477

What could be going wrong??

Update text for docs

  • Remove description of queries from the neo4j and kuzudb directories -- the description of all 10 queries is already in the main page
  • Change query 8 to say "first-degree connections" as we're not doing a a 2-hop query (Neo4j takes too long)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.