ldbc / ldbc_finbench_docs Goto Github PK
View Code? Open in Web Editor NEWSpecification of the LDBC Financial Benchmark
Home Page: https://ldbcouncil.org/ldbc_finbench_docs/ldbc-finbench-specification.pdf
License: Apache License 2.0
Specification of the LDBC Financial Benchmark
Home Page: https://ldbcouncil.org/ldbc_finbench_docs/ldbc-finbench-specification.pdf
License: Apache License 2.0
Three probable modes to valid the result,
About how to valid the result, two probable ways to do,
More data profiling required:
.e.g Complex-read12
result: SUM(loan.loanAmount) -> sumLoanAmount
From @tian-zhong
To be continued.
In the implementation phase:
simple read 4
result: there need have two results about numEdges
& sumAmount
simple read 5
result: desc have a result about the transfer amount per day
Considering the implementation of the driver on how to mix queries, there are some tasks to do,
Complex 8
Is the distance from srcId to dstId the shortest?
Complex11
How to compute final share? It's not exactly known by description.
Read Write 2
Is the transaction aborts condition the both vertexes are fast-in and fast-out pattern?
Read Write3
Is p1.id from srcId or dstId? Or do both?
Write
It would be better to stay the same format like personId
, the other format is Person.personId
.
A good point from @rickatultipa,
If Person entities can have "guarantee" relationships, Company entities also have that in corporate guaranteed loan scenarios.
there are some existing warning and error in the generation process from latex to pdf file.
Move TCR 10
to simple reads After solving the problem the input of Simple Reads comes from the output of Complex Reads.
However, this pattern might be different in the next version based on new query profiling.
simple read 1:
The desc is account or person
, I think person
should be removed.
The type
of result also should be removed.
simple read2:
The query only input one account
vertex. It would be better if change id1
to id
.
The pattern contains COUNT(edge1)
, but result doesn't contain numEdge1
. It's better to be consistent.
simple read3:
It's better to have precision about blockRatio
.
simple read6:
The type of result could be ID
rather than [ID]
.
simple read7:
Same as read6. The type of result could be ID
rather than [ID]
.
From @tian-zhong
Some comments from Mingxi Wu @ TigerGraph,
A couple of comments of FinBench The LDBC Financial Benchmark
(version 0.0.1-SNAPSHOT)
1.3 Differences between FinBench and SNB
This section seems to mix the graph schema and query characteristics when it tries to differentiate FinBench and SNB.
I would suggest to separate the differences into schema shape differences, and query shape differences.schema shape differences
a. it supports multiple edge.
b. dynamic attributes to mark entities (e.g., an account is marked as blocked)
c. quantity attribute + dynamic attribute on edges (e.g., transfer edge has quantity attribute amount, and the dynamic attribute timestamp)query shape differences
a. variable length path that qualified by sum of the edge quantity attributes.
b. path qualification based on the path quantity attributes aggregation, either along one path or a set of paths.
c. ....With the above taxonomy, we can have clear differentiation of different benchmark focus. It will also guide us to find different benchmark metrics and choke points.
Section 1.3 in the initial draft is written to make the proposal. So it describes the difference generally both about data and workload.
Besides section 1.3, we should rewrite section 1, the introduction section, further.
Round to keep 4 decimal places
In the implementation phase:
Write 4
Is the type
useful in the pattern?
Write 8
What does the one-off
means?
Write 9
Maybe the desc and params should be deposit
, not repay
, but it will be the same as Write 8.
Write 11
The person vertex don't have a property of isBlock
.
Consider the exact returned result structure,
E.g.:
groupby
to the result. Consider the groupby description in SNB. E.g., CR 4 6 8 13SimpleRead1
result: properties id todo
SimpleRead3
result: The result does not require a set, .e. g accounts.id([ID]) -> accountId(ID)
SimpleRead4
params: add startTime,endTime
SimpleRead6
result: The result does not require a set, .e. g companies.id([ID]) -> companyId(ID)
SimpleRead7
result: The result does not require a set, .e. g COLLECT(DISTINCT dstAccount.id)([ID]) -> dstAccountId(ID)
write3
params: add srcId,dstId
params: amount64-bit Integer has format problem
write5
params: add currentTime
write6
params: add currentTime
write8
params: amt -> amount, Complete spelling will be more consistent with other queries
write9
params: amt -> amount, Complete spelling will be more consistent with other queries
write11:
params: accountId -> mediumId
write12:
params:
write14:
pattern:it also have apply edge
The edge2 specifies a time period by 2 timestamps, for the loan node, on the top right the corner the yellow one, it's more practical for us to add a same time span or period condition, thus making it more close to reality.
Points from @yczhang1017
There should be a balance attribute in Loan vertex since there are repays and deposits between Accounts and Loans.
Read Write 1
Does the edge1 contain the historical one?
There doesn't provide attributes for transfer edge at params.
Read Write 2
It would be better if it add desc about fast-in and fast-out
, though we can know it at ComplexRead7.
There doesn't provide attributes at params.
Read Write 3
The person vertex don't have a property of isBlock
.
There doesn't provide attributes at params.
Some good points from @rickatultipa,
The schema looks good in the initial draft, and there are a few tweaks may help expand the schema to cover broader-spectrum fintech scenarios:
- If Person entities can have "guarantee" relationships, Company entities also have that in corporate guaranteed loan scenarios.
- As you indicated Loan is a special kind of Account, for loan applications, Medium signin relationships are also tracked.
- Account seems to be very general in the initial draft, we might want to consider 2 special kinds of accounts -- ATMs and POSes, as these 2 are frequently encountered in all card transactions scenarios.
- Lastly, we might want to clarify the specification of all 10 types of relationships, particularly the attributes assigned to each type of relationship.
Hope my inputs make sense, cheers.
Ricky
Complex-read 4
result: add otherAccount
Complex-read 5
question:
1.If it have a path a->b->c->d, whether we need output a->b->c?
2.If it have a path a->b->c->b->d, is the path right?
Complex-read 6
result: add mid
Complex-read 7
question: If count(e2)==0, how to output result?
Complex-read 8
question: There maybe have many ratio1/ratio2/ratio3, how to distinguish it?
Complex-read 9
params: add id
question: I think there will have many pair of edge1 and edge2, how to choose it?
Complex-read 10
params: add start_time, end_time, truncation_limit, truncation_order
Complex-read 11
question: I didn't understand how to calculate final share
.
Complex-read 12
title: Cycle -> Chain?
params: add start_time, end_time, truncation_limit, truncation_order
result: add Loan
pattern:
1.Do we want to consider the time of apply relationship?
2.How do we deal end if it find a cycle in the chain? .e.gļ¼p1->p2->p3->p4->p2.
To make the result unique and not too complex, consider to find just those topN or nearest traces, and keep them sorted.
After reviewing the slides of last data schema discussion meeting, there are some meaningful comments from @rickatultipa
Hi Shipeng:
I look at the draft and have a few comments:
Slide#9 on Preliminary Data Profiling Result: there are lots of isolated vertices, 21B out of 21.6B, 97% are isolated vertices, which also means only 0.656B vertices are meaningful (connected) and worthy to be ingested for graph analysis. When banks are processing graph data, they tend to purge those isolated data points and the logic is simple: isolated data carry little value in graph-powered network/behavior analysis. I understand you may have used sample data from your group, but I'm wondering if the data can be representative of the average financial industry.
Also, on Slide#9: Hub-vertex degree: Clearly, there are hotspot supernodes with degrees exceeding 100M. This reveals that the underpinning data modeling is questionable, because this will bog down any graph system who tries to traverse such supernode efficiently. On the other hand, consider alternative data modeling/schema that effectively lower the max degree. Over 100M is a bit too extreme -- if the max degrees are in the range of 1-2M, that's probably more practical. (As far as I know, most graph systems today can NOT even traverse hotspot nodes with 10,000+ degrees with a tight latency bound, say, 100ms.)
Slide#14: Regarding TP vs. AP workload, it makes sense to formally introduce some graph algorithms as typical AP workload, from relatively straightforward ones to really sophisticated/time-consuming ones, like PageRank, LPA, Louvain, Node2Vec, etc. And all we need to define are input parameters and expected output formats.
Let me know what you think, thanks.
Best
Consider these aspects,
In the implementation phase:
complex read 1
How to understand the result?
Does it have to be mediumId.size == numMedia
and mediumType.size == numMedia
?
We can add it at pattern and desc.
complex read 2
Does the result need to distinct loans about sumLoanAmount
& sumLoanBalance
?
complex read 4
If there don't have a edge between src
and dst
, how to do it?
complex read 5
If there have more than one edge between two vertices, do we only outputs path once?
complex read 6
The pattern is contain 3 transfer
, but the desc is more than 3 transfer
.
Maybe we need to choose one.
complex read 8
Does upstream
means vertex or edge or edge's amount?
The upstream
of pattern is different from desc.
complex read 9
The params is missing Account.id
.
complex read 10
The params is missing startTime
& endTime
.
complex read 11
How to compute the finalShare
?
complex read 13
The desc is return companies and the sum of their transfer, but the pattern and result is sumEdge2Amount
.
Maybe we need to add Company.id
or Compnay.name
at result.
About Sort:
It will be easier to understand if we add descending order
at describe when it query need sort.
"Type" in table 2.4 is a property. It should start with a lower case. Change "Type" to "type", or better "accountType" for clarity.
The result can be changed to the following
OtherAccountId
SUM(edge2.loanAmount)
MAX(edge2.loanAmount)
SUM(edge3.loanAmount)
MAX(edge3.loanAmount)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
š Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ššš
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ā¤ļø Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.