ldbc / ldbc_finbench_docs Goto Github PK

View Code? Open in Web Editor NEW

16.0 17.0 6.0 18.88 MB

Specification of the LDBC Financial Benchmark

Home Page: https://ldbcouncil.org/ldbc_finbench_docs/ldbc-finbench-specification.pdf

License: Apache License 2.0

Makefile 0.19% TeX 96.23% Python 2.64% Shell 0.93%

ldbc finbench

ldbc_finbench_docs's People

Contributors

Stargazers

Watchers

Forkers

antstd qingfeng14 bingtong0 zjlabwd

ldbc_finbench_docs's Issues

[Audit] specify the validation rules

Three probable modes to valid the result,

ACID, validation, throughput test in separate
run result validation along with ACID
run benchmarking (throughtput measurement) along with ACID and result validation at the same time.

About how to valid the result, two probable ways to do,

self-validation by the driver
cross-validation

[Data] More data profiling of different timestamp and distinguished edge types required

More data profiling required:

different timestamp
distinguished edge types

put funcation result as a name

.e.g Complex-read12
result: SUM(loan.loanAmount) -> sumLoanAmount

[Data] Add annotation about the data simulation scenario

From @tian-zhong

An account only owned by a person or a company although the scenario exists that multiple persons or companies own a same account. We just ignore that
The transfer edges mean successful transfers in systems.

To be continued.

Simple read have some question

In the implementation phase:

simple read 4
result: there need have two results about numEdges & sumAmount

simple read 5
result: desc have a result about the transfer amount per day

[Driver] Read Write Query definition and weight measurement

Considering the implementation of the driver on how to mix queries, there are some tasks to do,

(maybe) narrow the definition of read write query
consider a way to measure the weight of each read write query specifically

[Read Write Query] relation between isolation level of SUT and RW query?

Complex and ReadWrite

Complex 8
Is the distance from srcId to dstId the shortest?

Complex11
How to compute final share? It's not exactly known by description.

Read Write 2
Is the transaction aborts condition the both vertexes are fast-in and fast-out pattern?

Read Write3
Is p1.id from srcId or dstId? Or do both?

Write Params Format

Write
It would be better to stay the same format like personId, the other format is Person.personId.

specify the read-write queries' description and provide diagrams

[Schema] Gurantee between Company entites?

A good point from @rickatultipa,

If Person entities can have "guarantee" relationships, Company entities also have that in corporate guaranteed loan scenarios.

[Spec Writing] optimize the generation of spec pdf file

there are some existing warning and error in the generation process from latex to pdf file.

Adjust the simple reads to match the output of complex read as query mix design

[Query] Add more queries that applies function expression like min(), max(), etc..

TCR 10 should be moved to Simple Reads

Move TCR 10 to simple reads After solving the problem the input of Simple Reads comes from the output of Complex Reads.

However, this pattern might be different in the next version based on new query profiling.

Simple Read

simple read 1:
The desc is account or person, I think person should be removed.
The type of result also should be removed.

simple read2:
The query only input one account vertex. It would be better if change id1 to id.
The pattern contains COUNT(edge1), but result doesn't contain numEdge1. It's better to be consistent.

simple read3:
It's better to have precision about blockRatio.

simple read6:
The type of result could be ID rather than [ID].

simple read7:
Same as read6. The type of result could be ID rather than [ID].

[Query] Merge the queries with similar patterns

[Data] Transfer failure should be considered in the benchmark, which is applied in malicious activity monitoring

From @tian-zhong

[Query] add delete query

[Data] Consider removing the timestamp attribute in `signIn` edge

[Spec Writing] section 1.3 describing the differences should be more clear

Some comments from Mingxi Wu @ TigerGraph,

A couple of comments of FinBench The LDBC Financial Benchmark
(version 0.0.1-SNAPSHOT)

1.3 Differences between FinBench and SNB
This section seems to mix the graph schema and query characteristics when it tries to differentiate FinBench and SNB.
I would suggest to separate the differences into schema shape differences, and query shape differences.

schema shape differences
a. it supports multiple edge.
b. dynamic attributes to mark entities (e.g., an account is marked as blocked)
c. quantity attribute + dynamic attribute on edges (e.g., transfer edge has quantity attribute amount, and the dynamic attribute timestamp)

query shape differences
a. variable length path that qualified by sum of the edge quantity attributes.
b. path qualification based on the path quantity attributes aggregation, either along one path or a set of paths.
c. ....

With the above taxonomy, we can have clear differentiation of different benchmark focus. It will also guide us to find different benchmark metrics and choke points.

Section 1.3 in the initial draft is written to make the proposal. So it describes the difference generally both about data and workload.

Besides section 1.3, we should rewrite section 1, the introduction section, further.

[Query] Considering if we need carve out the "short queries"

[Query] Round to keep 4 decimal places about blockRatio in Read Query #5 and #6

Round to keep 4 decimal places

Write have some question

In the implementation phase:

Write 4
Is the type useful in the pattern?

Write 8
What does the one-off means?

Write 9
Maybe the desc and params should be deposit, not repay, but it will be the same as Write 8.

Write 11
The person vertex don't have a property of isBlock.

[Format] Doc has some format or type errors.

Complex-read3 has a format error in params.

Please remove the special symbol.

Complex-read8 word ratio's font is different from other words.

Complex-read9

Result could't match query result description.

Complex-read12

Pattern description can't match result.

Refine the returned result

Result structure

Consider the exact returned result structure,

list of tuple
tuple of list
nested json

E.g.:

CR 8

Query template

add the groupby to the result. Consider the groupby description in SNB. E.g., CR 4 6 8 13
add a new row for result sort order.

simple-read and write query have some question

SimpleRead1

result: properties id todo

SimpleRead3

result: The result does not require a set, .e. g accounts.id([ID]) -> accountId(ID)

SimpleRead4

params: add startTime,endTime

SimpleRead6

result: The result does not require a set, .e. g companies.id([ID]) -> companyId(ID)

SimpleRead7

result: The result does not require a set, .e. g COLLECT(DISTINCT dstAccount.id)([ID]) -> dstAccountId(ID)

write3

params: add srcId,dstId

params: amount64-bit Integer has format problem

write5

params: add currentTime

write6

params: add currentTime

write8

params: amt -> amount, Complete spelling will be more consistent with other queries

write9

params: amt -> amount, Complete spelling will be more consistent with other queries

write11:

params: accountId -> mediumId

write12:

params:

personId1
personId2
currentTime

write14:

pattern:it also have apply edge

[Data] Annotate the one-to-many or many-to-many relationship between two vertex types along with the edge multiplicity

[Query] specifiy the ratios in Read Query #9

[Query] Add time range on the deposite relationship in Read Query #2

The edge2 specifies a time period by 2 timestamps, for the loan node, on the top right the corner the yellow one, it's more practical for us to add a same time span or period condition, thus making it more close to reality.

Define abstract edge to inherit common attributes to derived edge to support per_node_limit feature

[Schema] add balance attribute to vertex Loan.

Points from @yczhang1017

There should be a balance attribute in Loan vertex since there are repays and deposits between Accounts and Loans.

ReadWrite have some question

Read Write 1
Does the edge1 contain the historical one?
There doesn't provide attributes for transfer edge at params.

Read Write 2
It would be better if it add desc about fast-in and fast-out, though we can know it at ComplexRead7.
There doesn't provide attributes at params.

Read Write 3
The person vertex don't have a property of isBlock.
There doesn't provide attributes at params.

[Schema] Gurantee between Company entites?

Some good points from @rickatultipa,

The schema looks good in the initial draft, and there are a few tweaks may help expand the schema to cover broader-spectrum fintech scenarios:

If Person entities can have "guarantee" relationships, Company entities also have that in corporate guaranteed loan scenarios.

As you indicated Loan is a special kind of Account, for loan applications, Medium signin relationships are also tracked.

Account seems to be very general in the initial draft, we might want to consider 2 special kinds of accounts -- ATMs and POSes, as these 2 are frequently encountered in all card transactions scenarios.

Lastly, we might want to clarify the specification of all 10 types of relationships, particularly the attributes assigned to each type of relationship.
Hope my inputs make sense, cheers.
Ricky

[Query] thresholds and time sequencing in read query #8

The threshold for edge1 (transfer) and the threshold for edge2 ( withdraw ) should not be the same.
The timing of transfer and withdrawal is not significant as long as they are within the range of the inquiry time window.

[Test] A issue to test webhooks

[Spec Writing] Refine the diagrams of the queries in the formats and description

Complex-Read Query Have Some Question

Complex-read 4
result: add otherAccount

Complex-read 5
question:
1.If it have a path a->b->c->d, whether we need output a->b->c?
2.If it have a path a->b->c->b->d, is the path right?

Complex-read 6
result: add mid

Complex-read 7
question: If count(e2)==0, how to output result?

Complex-read 8
question: There maybe have many ratio1/ratio2/ratio3, how to distinguish it?

Complex-read 9
params: add id
question: I think there will have many pair of edge1 and edge2, how to choose it?

Complex-read 10
params: add start_time, end_time, truncation_limit, truncation_order

Complex-read 11
question: I didn't understand how to calculate final share.

Complex-read 12
title: Cycle -> Chain?
params: add start_time, end_time, truncation_limit, truncation_order
result: add Loan
pattern:
1.Do we want to consider the time of apply relationship?
2.How do we deal end if it find a cycle in the chain? .e.g，p1->p2->p3->p4->p2.

Complex-read 13
result: add Company

[Query] A high-complexity path query in Read Query #7

To make the result unique and not too complex, consider to find just those topN or nearest traces, and keep them sorted.

[Query] specify the return type of paths, e.g. Read Query #13 and #7

add relevance to each query

[Raw comments] Concerning data distribution and AP workload

Comments from @rickatultipa

After reviewing the slides of last data schema discussion meeting, there are some meaningful comments from @rickatultipa

Hi Shipeng:
I look at the draft and have a few comments:
Slide#9 on Preliminary Data Profiling Result: there are lots of isolated vertices, 21B out of 21.6B, 97% are isolated vertices, which also means only 0.656B vertices are meaningful (connected) and worthy to be ingested for graph analysis. When banks are processing graph data, they tend to purge those isolated data points and the logic is simple: isolated data carry little value in graph-powered network/behavior analysis. I understand you may have used sample data from your group, but I'm wondering if the data can be representative of the average financial industry.
Also, on Slide#9: Hub-vertex degree: Clearly, there are hotspot supernodes with degrees exceeding 100M. This reveals that the underpinning data modeling is questionable, because this will bog down any graph system who tries to traverse such supernode efficiently. On the other hand, consider alternative data modeling/schema that effectively lower the max degree. Over 100M is a bit too extreme -- if the max degrees are in the range of 1-2M, that's probably more practical. (As far as I know, most graph systems today can NOT even traverse hotspot nodes with 10,000+ degrees with a tight latency bound, say, 100ms.)
Slide#14: Regarding TP vs. AP workload, it makes sense to formally introduce some graph algorithms as typical AP workload, from relatively straightforward ones to really sophisticated/time-consuming ones, like PageRank, LPA, Louvain, Node2Vec, etc. And all we need to define are input parameters and expected output formats.
Let me know what you think, thanks.
Best

[Spec Writing] unify the diagram and figures of queries

Consider these aspects,

expression-oriented: describe the query from the angle of Query Language Expression
data-oriented: describe the query from the angle of query pattern(like how much neighbors from the seed will be touched)

Complex read have some question

In the implementation phase:

complex read 1
How to understand the result?
Does it have to be mediumId.size == numMedia and mediumType.size == numMedia?
We can add it at pattern and desc.

complex read 2
Does the result need to distinct loans about sumLoanAmount & sumLoanBalance ?

complex read 4
If there don't have a edge between src and dst, how to do it?

complex read 5
If there have more than one edge between two vertices, do we only outputs path once?

complex read 6
The pattern is contain 3 transfer, but the desc is more than 3 transfer.
Maybe we need to choose one.

complex read 8
Does upstream means vertex or edge or edge's amount?
The upstream of pattern is different from desc.

complex read 9
The params is missing Account.id.

complex read 10
The params is missing startTime & endTime.

complex read 11
How to compute the finalShare?

complex read 13
The desc is return companies and the sum of their transfer, but the pattern and result is sumEdge2Amount.
Maybe we need to add Company.id or Compnay.name at result.

About Sort:
It will be easier to understand if we add descending order at describe when it query need sort.

2 Observed Transaction Vanishes

cypher: The two variables should be the same

ldbc / ldbc_finbench_docs Goto Github PK

ldbc_finbench_docs's People

Contributors

Stargazers

Watchers

Forkers

ldbc_finbench_docs's Issues

Result structure

Query template

Comments from @rickatultipa

Recommend Projects

Recommend Topics

Recommend Org