Giter Club home page Giter Club logo

spodgi's Introduction

SpOdgi

PyPI version

Use a general graph query language SPARQL to investigate genome variation graphs!

Currently it exposes any Odgi genome variation graph via SPARQL a W3C standard query language. At the moment this is read only mode, and one can not modify the graph using SPARQL update yet.

Benefit

Any Odgi file is now a SPARQL capable database! No translation or extra storage required. Ready for use by FAIR accessors.

Help wanted

This is a hobby for me, but could be very useful for others so please join and hack on this ;)

I am especially in need of current best practices for packaging and testing of code in python. There is a setup.py but it is rough and probably needs a lot of work.

How to run

You need to have Odgi build locally and added it's pybind module directory to your PYTHONPATH. If you work like me it would be checked out in ~/git/odgi and then you can use the env.sh script

You need to have an Odgi file. So conversion from GFA needs to be done using odgi build -g test/t.gfa -o test/o.odgi

How to run with docker

There is a Docker file in docker/. Which you can build with

docker build -t spodgi docker/
# or podman
docker build -t spodgi docker/

Then run interactivily with

docker run -it spodgi

Running a SPARQL query on a Odgi

./sparql_odgi.py  test/t.odgi 'ASK {<http://example.org/node/1> a <http://biohackathon.org/resource/vg#Node>}'

Finding the nodes with sequences that are longer than 5 nucleotides

./sparql_odgi.py  test/t.odgi 'PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?seq WHERE {?x rdf:value ?seq . FILTER(strlen(?seq) >5)}'

See more example queries in the queries directory. You can run them like this.

./sparql_odgi.py test/t.odgi "$(cat queries/selectAllSteps.rq)"

Setting up server to run SPARQL over HTTP

The following command will expose the test/t.odgi file for querying at http://127.0.0.1:5001/sparql.

./sparql_server.py test/t.odgi

If running through docker, expose the 5001 port.

docker run -p 5001:5001 -it spodgi

Variation Graphs as RDF/semantic graphs.

The modelling is following what is described in the vg repository. Such as in the ontology directory

Converting an Odgi into RDF

The code should support all RDF serialisations supported by RDFLib.

./odgi_to_rdf.py --syntax=ttl test/t.odgi test/t.ttl

This makes the same kind of turtle as done by the vg view -t code. However, it adds more rdf:type statements as well as makes it easier to map from a linear genome because each step is also a region on the linear genome encoded using faldo:Region as it would be in the Ensembl or UniProt RDF.

How can this work?

Mapping between types/predicates and known objects

The trick is that in VG RDF there are almost one to one mappings between a rdf:type or a predicate and a handlgegraph object type. For example if we see vg:Node we know we are dealing with a handle, if we see rdf:value as predicate we know it works on the node sequences. All VG and FALDO predicates, classes and literals map straight forwards to a known set of Odgi/libhandlegraph methods and objects.

Predicate Object/Class
rdf:value Node->sequence
vg:links Node->Node (Edge)
vg:linksReverseToReverse Node->Node (Edge)
vg:linksReverseToForward Node->Node (Edge)
vg:linksForwardToReverse Node->Node (Edge)
vg:linksForwardToForward Node->Node (Edge)
vg:reverseOfNode Step->Node
vg:node Step->Node
vg:path Step->Path
vg:rank Step->count allong it's Path
vg:offset Step->count allong it's Path
faldo:begin Step->position
faldo:end Step->position + Node->sequence.length
faldo:reference Step->Path
rdf:label Path->name
Types Object/Class
vg:Node Node
vg:Step Step
faldo:Region Step
vg:Path Path
faldo:ExactPosition Step->begin/end (all are known exactly)
faldo:Position Step->begin/end (all are known exactly, but allows easier querying)

SPARQL engines need one method to override

The way the SPARQL engines are build allows us to get the full (if not optimal) solutions by just implementating a single method. In the RDFLib case this is called triples which accepts a triple pattern and a Context (Named graph).

For each triple pattern we generate all possible matching triples using python generators (yield). For example if we see in triple pattern with rdf:type as predicate we know we need to iterate over all Odgi objects and return for each of them the triples where the rdf:type is set. If the predicate is not known we return an empty generator.

Why VG ?

vG is the first useful graph genome variation toolkit. That has supported writing and reading RDF since 2016. This introduced the predicates and classes needed for describing and navigating through the graph genome topology.

Why FALDO

FALDO is a way to describe a linear coordinate space as used in UniProt and the Ensembl/EBI RDF platform and other sequence orientated databases. Supporting FALDO makes it easier to use queries designed for the linear view on the graph genome view, allowing both kinds of views on the same data.

How to run

Currently this needs a specific branch of Odgi for more python support (specifically equals methods on step objects). Once that is installed and build you can look into the env.sh, to make sure the Odgi pythonmodule is on your path.

Then you can use the setup.py to install SpOdgi.

Notes

Methods in Odgi

The code to access Odgi methods/objects is listed in Odgi src pythonmodule.cpp

RDFLib 5.x

There is a coming RDFLib 5.x this code is tested on it. As 5.x will support federated queries it is better to use this than 4.x

Python3 Generators

We use python generators to allow the RDFLib to lazily evaluate the queries.

yield from
yield

are common in the code base. These don't map nicely to the internal iterators of libhandlegraph style. But with pybind we can have the most important methods be lazy.

Avoiding fetching known data

Odgi is the storarge of the genome graph. We don't add a byte of overhead to the core storage. However SPARQL is join orientated (joins are implicit on variable reuse). Joins are normally expensive. To avoid needing to re-join data we already fetched from disk/Odgi multiple times for a simple patter we attach the reference to the Odgi object(C++ pointers) to the associated RDFLib URIRef objects.

We do this by extending URIRef with our own implementations in . This is useful because the lazy manner of generator use in the RDFLib query engine leads to normal reasonable queries encouraging Odgi objects to have a short live time.

This is also made possible because we use predictable patterns in our IRIs. For example we encode the path/step_rank/position for the faldo:Position objects in their IRIs. This means that given an IRI like this we can use the Odgi (or other libhandlegraph) indexes for efficient access.

Odgi does not have an index for Rank/Position of Steps

This means we need to use an iterator from 0 for every step access. We can be no faster than Odgi here. Unfortunately a lot of interesting queries for visualisation are very much driven by a natural linearisation of the genome variation graph.

Other linhandlegraphs do have this (e.g. xg) and there are ways to index this reasonably well without much overhead in the python code.

spodgi's People

Contributors

jervenbolleman avatar tetron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

spodgi's Issues

Inspect filters to see if they can be translated a better execution model.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
SELECT 
  ?seq 
WHERE {
  ?x rdf:value ?seq . 
  FILTER(strlen(?seq) >5)
}

materializes the sequence as python string. Instead of using the odgi.get_length(handle) method.
If we could push such filter constraints into the triples method we would be able to be faster by generating less intermediate objects.

Docker images

Make it easier to install and use SpOdgi as it is from master.

Validate Prefixes

Currently, prefixes are not validated.
I tried out several queries and none of them worked, until I realized, that the prefix was spelled wrong.
That would be a huge user relief.

odgi_to_rdf misleading error message when 2nd argument is missing

When odgi_to_rdf is executed without the 2nd argument, the error message is very cryptive about what is going on.

(/usr) [heumos@wave spodgi]$ time python odgi_to_rdf.py --syntax=ttl0k_R64-1-1.odgi 
Usage: odgi_to_rdf.py [OPTIONS] ODGIFILE TTL
Try "odgi_to_rdf.py --help" for help.

Error: Missing argument "TTL".

real    0m0.256s
user    0m0.231s
sys     0m0.025s

Step IRI generation

We need to figure out how to generate an IRI from a step. This requires at least two components. The name of the path, and either an ordinal offset or rank (step number in the path).

For the first we can use odgi::get_path_name.
To find the ordinal I don't yet know how to do it.

How to query SpOdgi to get external Ensembl annotations

First we modify the example from Ensembl.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX skos:<http://www.w3.org/2004/02/skos/core#> 
PREFIX owl:<http://www.w3.org/2002/07/owl#> 
PREFIX dc:<http://purl.org/dc/terms/> 
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#> 
PREFIX faldo:<http://biohackathon.org/resource/faldo#> 
PREFIX ensembltranscript:<http://rdf.ebi.ac.uk/resource/ensembl.transcript/> 
PREFIX sio:<http://semanticscience.org/resource/> 
PREFIX dcterms:<http://purl.org/dc/terms/>
PREFIX foaf:<http://xmlns.com/foaf/0.1/>
PREFIX obo:<http://purl.obolibrary.org/obo/>
PREFIX vg:<http://biohackathon.org/resource/vg#>

SELECT *

WHERE {
  ?target a vg:Step ;
    vg:node|vg:reverseNode ?node ;
    faldo:location ?stepLinearLocation .
  ?stepLinearLocation faldo:begin ?bp ;
    faldo:end ?ep .
  ?bp faldo:position ?stepBegin .
  ?ep faldo:position ?stepEnd  .
  BIND(<http://rdf.ebi.ac.uk/resource/ensembl/97/saccharomyces_cerevisiae/R64-1-1/VIII> AS ?ref) .
  SERVICE<https://www.ebi.ac.uk/rdf/services/sparql/>{
    SELECT DISTINCT ?transcript ?ref ?begin ?end {
      ?transcript a <http://rdf.ebi.ac.uk/terms/ensembl/protein_coding> .
      ?transcript faldo:location ?location .
      ?location faldo:begin
        [a faldo:ExactPosition ;
        faldo:position ?begin] .
      ?location faldo:end
        [a faldo:ExactPosition ;
        faldo:position ?end] .
        ?location faldo:reference ?ref .
      FILTER(?begin > ?stepBegin && ?end < ?stepEnd)
    } LIMIT 10
  }
}

Import errors when using the docker container

I built the docker container and tried using it interactively but for the scripts odgi_to_rdf.py and sparql_odgi.py I get the following import error:

root@9580c13cbb68:/spodgi# odgi_to_rdf.py
Traceback (most recent call last):
  File "/usr/bin/odgi_to_rdf.py", line 3, in <module>
    import rdflib
  File "/usr/local/lib/python3.7/dist-packages/rdflib-7.0.0-py3.7.egg/rdflib/__init__.py", line 47, in <module>
    from importlib import metadata
ImportError: cannot import name 'metadata' from 'importlib' (/usr/lib/python3.7/importlib/__init__.py)

For the sparql_server.py script I got a different ImportError:

root@9580c13cbb68:/spodgi# ./sparql_server.py
Traceback (most recent call last):
  File "./sparql_server.py", line 2, in <module>
    from flask import Flask, request, jsonify, Response, g
  File "/usr/local/lib/python3.7/dist-packages/flask-3.0.2-py3.7.egg/flask/__init__.py", line 5, in <module>
    from . import json as json
  File "/usr/local/lib/python3.7/dist-packages/flask-3.0.2-py3.7.egg/flask/json/__init__.py", line 6, in <module>
    from ..globals import current_app
  File "/usr/local/lib/python3.7/dist-packages/flask-3.0.2-py3.7.egg/flask/globals.py", line 6, in <module>
    from werkzeug.local import LocalProxy
  File "/usr/local/lib/python3.7/dist-packages/werkzeug-3.0.1-py3.7.egg/werkzeug/__init__.py", line 5, in <module>
    from .serving import run_simple as run_simple
  File "/usr/local/lib/python3.7/dist-packages/werkzeug-3.0.1-py3.7.egg/werkzeug/serving.py", line 76, in <module>
    t.Union["ssl.SSLContext", t.Tuple[str, t.Optional[str]], t.Literal["adhoc"]]
AttributeError: module 'typing' has no attribute 'Literal'

Both of these issues seem to come from the fact that the docker container has Python 3.7 installed, but rdflib 7.0.0 and flask 3.0.2 require Python 3.8.

SPARQL 1.1. Service does not work

This is a limitation of rdflib 4.2.2. You can check out and pip install my branch of rdflib 5 with pip install --pre ~/git/rdflib

Mapping an IRI based lookup to ODGI

For example in the current vg models in RDF. All nodes have a {SOMEBASE}/node/{ID} iri as identifier. These can be used as hack to identify which methods to call.

Consider the sparql query.

PREFIX node:<http://example.org/node/>
PREFIX vg:<http://biohackathon.org/resource/vg#>

SELECT 
   ?node ?sequenceLength
WHERE {
  BIND(node:25 as ?node)
  ?node a vg:Node ; 
     rdf:value ?sequence .
 BIND(strlen(?sequence) AS ?sequenceLength)

Statically analysing the query AST we should be able to determine that this requires a call to odgi.get_handle as that will give us the handle for the node id.

ASK
node:25 a vg:Node .

Can return true as we can look into the IRI string to see it is a node.

SELECT
?sequence
WHERE
{
node:25 rdf:value ?sequence .
}

Can be mapped to odgi.get_handle on which we can ask for the sequence string.

Then the engine can do a classic translation to sequence length by just calling the python method.

SPARQL endpoint over http?

The current code is command line only. We would need an rdflib based sparql endpoint to make these pangenomes available on the semantic web.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.