liugraphql / lingbm Goto Github PK

Linköping GraphQL Benchmark (LinGBM)

License: MIT License

Java 88.59% Shell 3.43% JavaScript 7.98%

lingbm's Introduction

Linköping GraphQL Benchmark (LinGBM)

LinGBM is a performance benchmark for GraphQL server implementations. The wiki of this repo provides an introduction to the project, the specification of the benchmark, and design artifacts. This repo contains artifacts created for the benchmark (such as GraphQL schemas, query templates, query workloads) and the following benchmark software:

Publications related to LinGBM

Sijin Cheng and Olaf Hartig: LinGBM: A Performance Benchmark for Approaches to Build GraphQL Servers. In Proceedings of the 23rd International Conference on Web Information Systems Engineering (WISE), 2022. A significantly extended version of this manuscript is available on arXiv: abs/2208.04784, and we have a repo with the material for the experiments in that paper.

lingbm's People

Contributors

Stargazers

Watchers

Forkers

vmptk rjkhan daniel-dona benjie

lingbm's Issues

Field "AssociateProfessor.supervisedUndergraduateStudents" can only be defined once.

In Type AssociateProfessor this field is duplicating AssociateProfessor.supervisedUndergraduateStudents

Update LinGBM.wiki

query parameters as string interpolation?

https://github.com/LiUGraphQL/LinGBM/blob/master/artifacts/queryTemplates/main/QT10.txt#L3 shows

query stringMatching($keyword:String) {
 	publicationSearch(field:title, criterion:CONTAINS, pattern:"$keyword") {

However, afaik GraphQL params are passed as variables and there's no "string interpolation" So I think it should be without the quotes?

 	pattern:$keyword

OrderFieldInput - Q8

I just want to ask how this is supposed to work.

In the following query:

offers(where: OfferWhereInput, limit: Int, order: [OrderFieldInput]): [Offer]

We can give an order argument which is a list of OrderFieldInputs, each OrderFieldInputs can have two fields to order by.

If there is just ONE OrderFieldInput my guess would be that we sort first after the orderField1, and secondly we sort by orderField2, but we keep the order given in orderField1.

My question then is how should it work if there is a list of OrderFieldInputs? Q8 uses this input, but it says that $attrOffer1 is selected from the 10 possible values of OffersSortingField, the same for $attrOffer2. But the schema does not match here.

The query template should look like the following to use the current schema:

{
  offers(limit:$cnt, order: { orderField1: $attrOffer1, orderField2: $attrOffer2} ) {
    offerWebpage
    validFrom
    price
}

Notice here that it does not take an Array, but a singel OrderFieldInput.

We need to decide here which approach to take. I recommend taking the example given above. That the order argument is always just a single OrderFieldInput, and we can therefore sort by 2 fields as a maximum.

'offerWhereInput' issues

Modify the source code of dataset generator

Output word list used for 'review text'. (works for template 10)
Output word list used for 'vendor comment'. (works for template 14)

bring back BSBM

(#79 describes how LinGBM was initially based on BSBM but is now based on LUBM)

There are some benefits to BSBM:

in BSBM resultset size doesn't increase with dataset size. In large LUBM variants, response time is dominated by the time to return the result. GraphQL results don't have streaming, so this may represent a further problem
Some databases have been tested on very large BSBM variants
Other virtualization frameworks (eg ONTOP, Morph-GraphQL) have been tested against BSBM

And some disadvantages:

BSBM uses named graphs, and I don't know how this could be mapped to GraphQL, and none of the GraphQL-RDF frameworks I know supports named graphs

Neutral:

BSBM results are often dominated by one query that uses regexp. But LUBM QT5 does the same
I think BSBM has a relational variant

(Some of the info above is from @vassilmomtchev, Ontotext CTO)

Would it be a lot of effort to support both?

Command option '-fc'

Add a note in README.md

Typo Error

In schema there are (int) it should be replaced with (Int)

QueryGen Errors

I've started work on a preliminary test-driver. It runs all queries in the actualQueries folder synchrounously.

When I send the following query:

{
  reviewSearch(field:text,
               criterion:contains,
               pattern:"elms")
  {
    title 
    label
  }
}

I get an error on the criterion field, because according to the schema it is a StringCriterion:

enum StringCriterion 
{
    CONTAINS
    START_with
    END_with
    EQUALS  
}

Which is all caps. So the querygen needs to be modified to fit the schema. I also recommend changing the START_with to START_WITH and the same for END_with.

NOTE: Its the queryTemplate10.txt that needs to be edited.

Regarding this, could I be allowed to push to branches on this repo? I have a fix for this issue on my machine but I cannot push it into a branch and make a pull-request. If I could do this I can fix issues and you just have to approve them instead of spending time on fixing them yourselves 😃

queryGen: generate variable values for query templates

Generate *.json files only contain one variable/valuables

issue in query template qt3.txt

query researchGroup_department_head_doctorDegreeFrom($researchGroupID:ID) { researchGroup(nr:$researchGroupID) { subOrgnizationOf { head { id email doctorDegreeFrom {id} } } } }

email must be replaced by emailAddress and should look like this

query researchGroup_department_head_doctorDegreeFrom($researchGroupID:ID) { researchGroup(nr:$researchGroupID) { subOrgnizationOf { head { id emailAddress doctorDegreeFrom {id} } } } }

Divide 'MatchCriterion' into 'DateCriterion' and 'StringCriterion'

rename input types 'DateCriterion' and 'StringCriterion', to 'DateMatching' and 'StringMatching'
add new enums: 'DateCriterion' and 'StringCriterion'
add 'EQUALS' to each enum
update relevant schemas

issue in query template qt4.txt

query lecturer_university_graduateStudent_professor_department($lecturerID:ID) { lecturer(nr:$lecturerID) { doctoralDegreeFrom { id undergraduateDegreeObtainedBystudent { id email advisor { id email worksFor {id} } } } } }

email should replaced by *** emailAddress and it should look like this

issue Query template QT4.txt

There is issue email should be replaced with emailAddress in above query template and it looks like this

Format of instances in 'actualQueries'

Remove the 1st line of each query template when generating actual queries;
Remember to keep the number of the curly braces correct

-nm flag doesn't seem to work

I try to generate one of each queryTemplate using:

java -cp target/querygen-1.0-SNAPSHOT.jar se.liu.ida.querygen.generator -nm 1

But I still get 20 instances of each queryTemplate.

Error: Unknown type "publicationField".

In type Query
publicationSearch(field: publicationField!, criterion: StringCriterion!, pattern: String!): [Publication]
this should be replaced by this i think.

publicationSearch(field: PublicationField!, criterion: StringCriterion!, pattern: String!): [Publication]

Q5: product-->reviewFor

Google doc
query tempalte on Github repo

Issue in query Template q14.txt

query multipleFilters($departmentID:ID,$professorType:String, $interestkeyword:String) { university(nr:$universityID) { undergraduateDegreeObtainedBystudent(where: {AND:[{advisor:{age:{criterion:MORETHAN, pattern:$age}}},{advisor:{researchInterest:{criterion:CONTAINS, pattern:$interestKeyword}}}]}) { id emailAddress takeGraduateCourses {id} } } }

in the filter field, there is unknown parameter advisor: age

i think possible correct query structure is

query multipleFilters( $departmentID: ID $professorType: String $interestkeyword: String ) { university(nr: $universityID) { undergraduateDegreeObtainedBystudent( where: { AND: { age: { pattern: $interestkeyword, criterion: MORETHAN } advisor: { researchInterest: { criterion: CONTAINS, pattern: $interestkeyword } } } } ) { id emailAddress takeGraduateCourses { id } } } }
calling and filter with this AND:{} not with this AND:[]

Define reporting rules in the wiki

We need an additional page in the wiki that defines the rules for reporting benchmark results. For instance, when reporting benchmark results, it needs to be explicitly mentioned:

on what machine(s) the software was run (CPU, RAM, ...),
which versions of what software were used (incl. operating system, node.js version, etc),
which scale factor(s),
etc.

Field "University.undergraduateDegreeObtainedBystudent" can only be defined once.

In Type university there undergraduateDegreeObtainedBystudent this is defined multiplies.

Field "AssociateProfessor.publications" can only be defined once.

In
type AssociateProfessor implements Faculty {
}

there is one field publications define more than once.

Concerns regarding the "scenario" of the approach outlined in the wiki

I love the motivation behind this effort and benchmarking suite! I think this would be very useful for the community if done in a manner useful to GraphQL authors and not just GraphQL vendors. Especially, I believe it will give the community ideas about different ways to implement a GraphQL backend. Also a fan of your work with GraphQL cost measurement!

With reference to the wiki I have some concerns about this benchmarking approach and am jotting down some thoughts here.

The focus of our benchmark will be a scenario in which data from a legacy database has to be exposed as a read-only GraphQL API with a user-specified GraphQL schema.

Systems that can be used to implement GraphQL servers but that are not designed to support this scenario out of the box can also be tested with the benchmark, for which they have to be extended with an integration component such as a schema delegation layer. In such a case, from the perspective of the benchmark, the combination of the system and the integration component are treat as a black box.

I find this approach extremely confusing because the aim is to test integration with a legacy database of a GraphQL server with a user-defined schema. There are 2 problems I have here:

Your approach here instead will end up testing the schema delegation layer and not the GraphQL server? Isn't this a flawed approach to testing the efficiency of a GraphQL server's implementation?
If I was writing a user-defined GraphQL schema, why would that schema expose complex subquerying, filtering, traversal, limit, offset? I would expose exactly the GraphQL schema my apps need. What is the rationale behind putting a schema delegation layer in front of an "out-of-the-box" GraphQL server?

As a personal preference, if I was writing a user-defined schema and building a GraphQL server querying a database I would use an ORM not a GraphQL server, like SQL Alchemy or massive.js or knex.

I find this approach of benchmarking a combination of a user-defined GraphQL server, going through a generic "GraphQL vendor", then going to the database, heavily biased towards a Prisma style of implementation and not applicable to anything else. Benchmarking is hard enough with just a server and a database, and adding a third component will make it harder to reason about the correctness of the benchmark.

Are there other GraphQL vendor or products that are GraphQL middleware/ORMs for authors of APIs other than Prisma and Dreamfactory, that you intend to benchmark with this suite?

Open-source servers like Postgraphile and Hasura (I work here) and vendor solutions like AppSync are meant to be GraphQL servers that are optimised for large numbers of HTTP clients querying for simple to complex real-world queries. It seems pointless to add a GraphQL delegation layer in front of these GraphQL servers.

Recommendation 1:

If this effort is intent on benchmarking the nascent ecosystem of "GraphQL vendors" I would recommend instead:

Choose a dataset on a database with a fixed schema: This is important and reflective of the real world where data is modelled on a database for the database
For the same choke point, write GraphQL queries as exposed by the GraphQL API of the vendor, even if the GraphQL queries have slightly differing syntax

Recommendation 2:

Instead, if this effort is meant to benchmark the broader process of writing GraphQL servers with a hand-written schema, I would recommend:

Choose a GraphQL server implementation and choose a database ORM
Write different optimised backends with different languages
Write the same GraphQL queries that would test similar choke points that your benchmarking suite will run against.

From a usefulness point of view, I think the latter (recommendation 2) is very useful for folks in the community building GraphQL servers! This would become a very useful benchmark for folks building GraphQL servers that have to inevitably talk to a database and that need to choose between different ORMs and approaches to processing the GraphQL query.

I think the former (recommendation 1) is a benchmark of "GraphQL vendors". These are 2 different things and should be treated differently as such.

issue in query template qt2.txt

query university_faculty_publications($universityID:ID) { university(nr:$universityID){ doctoralDegreeObtainers{ publications{title} } } }
i think
doctoralDegreeObtainers there must be where clause in it

! non-nullable issue

QT13 Date-generation error

Hello.

This this bug is quite funny actually. The querygen gave me the following query:

{
  producer(nr:20) {
    products {
      offers(where:{ vendor:{ publishDate:{criterion:AFTER date:"2002-02-30"}}})
      {	
        price
        offerWebpage
        product {label comment}
      }
    }
  }
}

Can you spot the error?

Yes, the date doesn't exist, February does not have 30 days. :D Sending this query to the database causes it to send back an error because of "Incorrect DATE value".

Remove 'Productfeatureproduct' and 'Producttypeproduct'

features & type

rename 'productFeature' to 'features'
keep using 'type'

issue Query template q12.txt

query subqueryFilter1($universityID:ID, $departmentID:ID) 
{ 
  university(nr:$universityID) { 
    doctoralDegreeObtainers (where: {department: {nr:$departmentID} } ){ 
      id 
      emailAddress 
      publications {id } 
    } 
  } 
}

this should be replaced by this

query subqueryFilter1($universityID:ID, $departmentID:ID) { university(nr:$universityID) { doctoralDegreeObtainers (where: {worksFor : { nr: $departmentID} } ){ id emailAddress publications {id } } } }

in doctoralDegreeObtainers instead of the department, we have workFor as input parameter in the schema

Unknown type "undergraduateStudent". Did you mean "UndergraduateStudent"

in Type Department,
there is one typo mistake undergraduateStudents: [undergraduateStudent]

this should be replaced by
undergraduateStudents: [UndergraduateStudent]

Q10 - Schema

Q10 looks like the following:

query stringMatching($textOfReviewKeyword:String)
{
  reviewSearch(field:text,
               criterion:contains,
               pattern:$textOfReviewKeyword)
  
    title 
    label
  
}

reviewSearch returns a list of reviews according to the schema:

reviewSearch(field: ReviewFieldInput!, criterion: StringCriterion!, pattern: String!): [Review]

A Review has no field label, like the query template want us to return. Propose to remove that field, or change it to one that exists.

It's LUBM not BSBM

https://github.com/LiUGraphQL/LinGBM/blob/master/tools/datasetgen/README.md mentions https://sourceforge.net/projects/bsbmtools/, but the link underneath is https://github.com/rvesse/lubm-uba.

LinGBM is based on LUBM not BSBM (which is different) please fix the above link.

(This issue is mostly against myself: I don't know why I thought LinGBM is based on BSBM.
LUBM is a great basis, especially since it can produce both RDBMS and RDF)

Consider adding documents as a PR for feedback

Hey 👋, just stumbled on this, very cool idea and project. Maybe you could consider opening a PR with your initial assumptions / documents for easier feedback. It's hard to comment on a wiki. If not, how would you like to receive feedback?

Thanks!

Who has run LinGBM?

I don't know if you have plans to publish benchmark results; that usually requires the fixation of reporting rules, and agreement from vendors. Perhaps the ldbcouncil.org can help you with this?

Here is a start for a list, because it's interesting for users and vendors to know who has tried it:

Morph-GraphQL (see https://www.researchgate.net/publication/343048433_Exploiting_Declarative_Mapping_Rules_for_Generating_GraphQL_Servers_with_Morph-GraphQL). Tested only to LinGBM 128K. Found QT5 the hardest (22x slower than Virtuoso and 75x slower than Morph-RDF).
@hartig, Jem says your students tested against Hasura? Can we see those results, even if privately?
Any other GraphQL servers?
We have some vague plans to run it against the Ontotext Platform

Conventions of naming fields and types

QueryGen - Duplicates

Hello.

When we met at the mid-term review we talked about the test methodology. One thing we mentioned was that we wanted to run throughput tests with the different query-templates.

For some query-templates only a few different queries can be generated that are distinctly different, for example query-template 2 will only generate 22 different queries when the database is created with the regular settings.

One way to combat this was for the querygen to be able to generate duplicates. I'm ready, or very very soon ready, to start running the real tests now. Would it be possible to include an option in the querygen that makes it generate duplicates? Or should I just copy-paste the generated queries to get more of them?

Optimize dataset generator

Simplify SQL dump files: combine multiple 'updates' into one 'insert'

Aggregation related schema and query

update shema: AggregateOffers, PriceAggregationOfOffers
update Query Template Q16 in Google doc
update Q16 in query template on Github repo
update Q16 on Github wiki

QT15 and QT16

Hello.

Short question and your thoughts around it.

So since the querygen does not generate any query which has the same arguments as any other, some of the queryTemplates are hard to generate in big numbers.

For example when I run the querygen with "-nm 200" I still only get 12 queries each from these templates since thats how many vendors there are.

I suppose one way to increase the querygens limit is to generate a bigger database, which I think might be a good choice in my tests anyway.

But, maybe we should include an option that forces the querygen to generate the amount specified with the "-nm" flag? Or what do you think regarding this? @hartig @chengsijin0817

A sample size of 12 is not good for much right?

different package name for querygen

The code for the query generator should be in the Java package se.liu.ida.lingbm.querygen (not se.liu.ida.querygen), and the generated jar file should be named lingbm-querygen-...jar. I will implement both changes now.

Schema - Person

So, I found now that the Person type in the schema is not used in any of the query-templates. So maybe we should remove it from the schema?

document RDF generation and the LUBM ontology

I assume this test targets GraphQL implementations over RDF (like the Ontotext Platform).

at https://github.com/LiUGraphQL/LinGBM/tree/master/tools/datasetgen please describe how to generate RDF vs relational data (eg using --format TURTLE)
document where is the LUBM ontology
Change https://github.com/LiUGraphQL/LinGBM/wiki/Mapping-of-GraphQL-Schema-to-Relational-Schema-TODO to cover the mapping of both the relational schema and the ontology. I strongly hope that the two are isomorphic, so can be documented together

Product - Reviews ordering - ReviewSortingCriterion - Q9

As in #32 we have differing schema and query template here.

We can query reviews on a product with the following:

reviews(order:[ReviewSortingCriterion]): [Review]

Here it says that we can pass a list of ReviewSortingCriterion to the order argument.

The only place this is used in a query template is in Q9 where it is used in the following way:

vendor(nr:$vendorID) {
    offers(limit:50) {
      product {
        reviews( order:{field:$attrReview direction:DESC} ) {
          title
          rating1
          rating2
          rating3
        }
      }
   }
}

Here the order argument is not a list, but a single object again. So I propse that we change the schema to the following:

reviews(order:ReviewSortingCriterion): [Review]

What are you thoughts?

directories and file names of query templates

@chengsijin0817 After seeing the new directory/file structure with the query templates now, I realize that I would like this to be slightly different still.

First, I think that the individual files for the query templates and their "description" should not be in separate sub-directories. Instead, these files should all be under ./artifacts/queryTemplates/main/

Second, the files should be renamed:

The files with the query templates should be named QT1.txt, QT2.txt, etc.
The files with the variable names should be named QT1.vars, QT2.vars, etc.

/cc @ljukas

Unknown type "Course".

In Schema there is used [Course] but its type is not defined.

Heap Limit reached QT5

So... funny story...

This query:

{
  product(nr:405) {
    reviews {
      title
      reviewFor {
        reviews {
          title
          reviewFor {
            reviews {
              title
              reviewFor { 
                reviews {
                  text
                  title
                  reviewFor { label }
                }
              }
            }
          }
        }
      }
    }
  }
}

Because of the exponential growth of the returned object, the heap limit is reached on the server, and it crashes.

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Is that something we want to account for in testing? or would an alternative approach maybe be to remove the last "reviewFor" or similar? @hartig @chengsijin0817

liugraphql / lingbm Goto Github PK

lingbm's Introduction

Linköping GraphQL Benchmark (LinGBM)

Publications related to LinGBM

lingbm's People

Contributors

Stargazers

Watchers

Forkers

lingbm's Issues

Recommendation 1:

Recommendation 2:

Recommend Projects

Recommend Topics

Recommend Org