geneontology / go-shapes Goto Github PK
View Code? Open in Web Editor NEWSchema for Gene Ontology Causal Activity Models defined using RDF Shapes
Schema for Gene Ontology Causal Activity Models defined using RDF Shapes
In working on some models for defense responses in C. elegans, I'm not clear on how to model the action of the pathogen, e.g. a Gram-negative bacterium, in the pathway.
Should the pathogen be an input to some type of host receptor function or should an MF enabled by some gp of the pathogen have a causal relation to the receptor function? Or something else?
The current state of knowledge is incomplete for the pathway we're modeling, but this bears on what is considered a valid entry for 'has input' and how to represent an MF for a pathogen gp that may not have evolved for the function modeled in this process.
Right now, our schema definitions allow for the addition of relations not specified in the schema for our Shapes. Here is an example (will disappear http://noctua-dev.berkeleybop.org/editor/graph/gomodel:5e28fcb400000000 )
This is an example @vanaukenk added as a positive test for a new <AnatomicalStructureDevelopment>
shape which allows for the 'results_in_development_of' relation to an AnatomicalEntity. That shape is not present in the schema that is being used to test that file in the picture and the Tissue Development node is validating just fine as a Biological Process. This is because the shapes are, by default, open and allow for new relations. (See explanation of closed versus open).
If we leave them open, it is important to get the ShapeMap updated as the new shapes come in, otherwise models may appear to validate correctly when they should not be. e.g. here, the Tissue Development node should be tested against the AnatomicalStructureDevelopment Shape when all of the changes are complete.
Is it desirable to allow new relations, not in the spec for a shape, or do we want to close them such that a new relation would result in a schema validation error?
@vanaukenk @balhoff @cmungall @pgaudet @ukemi @kltm
(lower priority / question)
do we want to do things like
<BiologicalProcess> {
a <BiologicalProcessClass> ;
...
}
<Cell> {
a <CellClass> ;
..
}
...
<CellClass> {
// constraint URI to be CL or FAO or PO
}
This is maybe a bit circular with the tagger but seems like it would be useful to be able to be explicit about what kinds of classes are expected as the targets of rdf:type
currently happens_during is restricted to MF
I think it makes more sense to have this at the BP level of at least allow this
The MF to MF relation:
provides_direct_input_for: @<MolecularFunction>
{0,1};
should only apply to a subset of MF terms, namely MF GO:0003824 'catalytic activity' to MF GO:0003824 'catalytic activity'.
How can we denote this in the specs?
We may need to look at existing annotations and if/how this information has been added by curators, but we need to decide if we are going to mirror the location information, e.g. CC, cell, anatomical structure, organism, currently linked to MFs to BPs as well.
For the CC only version of the form, as well as the MOD imports, we will need to specify that a Gene, Complex, or Protein can be 'located in' a Cellular Component:
<Gene> @<MolecularEntity>
AND {
bl:category [GoGene:] ;
located_in: @<CellularComponent>
{0,1};
}// rdfs:comment "a gene (a piece of DNA with a purpose)"
<Complex> @<MolecularEntity>
AND {
bl:category [GoComplex:] ;
located_in: @<CellularComponent>
{0,1};
}// rdfs:comment "a protein complex"
<Protein> @<MolecularEntity>
AND {
bl:category [GoProtein:] ;
located_in: @<CellularComponent>
{0,1};
}// rdfs:comment "a protein"
Following on from discussions about what relations to allow in GO-CAMs and ShEx:
We need to specify the domain and range constraints for use of specific relations with specific GO terms, e.g. cell morphogenesis 'results in morphogenesis of' motor neuron.
This ticket will be used to draft this list for further discussion on the GO-CAM specifications call.
Looking at annotation extensions for MF annotations, curators have made 'part of' relations between MFs. These annotations have also been made in GO-CAM models.
Do we want to allow this?
If so, we need to add it to the ShEx specs.
For a given Molecular Function, do we want to check that the values of 'has input' and 'has output' are different?
Are there cases where that wouldn't be true, either biologically or because the IDs available for 'has input' and 'has output' aren't specific to, say, modified forms of a protein (in which case I'm not sure we'd want to capture this information at all).
We should remove files that are not really needed (history will remain in github)
This came up on the workbenches call today.
Do we need to specify in the ShEx how negation will be handled?
Currently we have:
@ AND {
bl:category [GoGene:] ;
}// rdfs:comment "a gene (a piece of DNA with a purpose)"
But some groups annotate to RNAs using RNA central IDs. We need to make sure that these are included. Are they categorized as genes?
RNAcentral URS0000183BED_9606 URS0000183BED_9606 GO:0005615 PMID:26646931 HDA C Homo sapiens microRNA 23b (MIR23B), miRNA miRNA NCBITaxon:9606 20180319 BHF-UCL UBERON:0001969
Placeholder ticket to remind us to update these specs wrt relations between MF and BP when MF becomes a BP.
The schema contains an unused shape MacromolecularMachine
. Currently it is restricting @<Protein> OR @<Complex>
. Elsewhere we are frequently specifying @<Complex> OR @<InformationBiomacromolecule>
. Should we:
MacromolecularMachine
to restrict @<Complex> OR @<InformationBiomacromolecule>
and make use of it.or
MacromolecularMachine
We need to assemble a collection of go-cam models (as ttl files) that correctly validate and another collection that fails (for predictable reasons) so that we can validate the validation code.
Anyone that has been working on developing the specification document would be a good candidate for helping to build these exemplars. Ping @lpalbou
Currently we edit in Noctua, copy the file across to this repo, annotate it with the reason for failure. This is a bit clunky and not ideal from a software engineering / unit test perspective.
We should continue to do this but supplement the test suite with hand authored minimal examples of pass and failure. We may want a lightweight notation for go-cams
In branch:
https://github.com/geneontology/go-shapes/tree/dustine32-test-has_input
I have this gocamgen-generated test TTL file with a single assertion individual:
This model fails both java and python validators despite appearing to follow the ShEx spec:
Protein <-enabled_by- MolecularFunction -has_input-> MolecularEntity
Here's the python validator output:
File: ../test_ttl/go_cams/should_pass/WB_WBGene00000903_partial.ttl Success: False PASS: 4 FAIL: 1
FAIL: http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408 SHAPE: http://purl.obolibrary.org/obo/go/shapes/MolecularFunction REASON: Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Triples:
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type obo:GO_0005160 .
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type owl:NamedIndividual .
2 triples exceeds max {1,1}
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
Triples:
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type UniProtKB:P04202 .
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type owl:NamedIndividual .
2 triples exceeds max {1,1}
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
Testing UniProtKB:P04202 against shape http://purl.obolibrary.org/obo/go/shapes/OwlClass
No matching triples found for predicate rdf:type
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
Triples:
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type UniProtKB:P04202 .
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> rdf:type owl:NamedIndividual .
2 triples exceeds max {1,1}
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/0b95360f-8394-4a17-9687-004a234b64fa> against shape N6baf29f3024240789b58b9a33f3380f4
No matching triples found for predicate rdf:type
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Node kind mismatch have: URIRef expected: bnode
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
Triples:
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type obo:GO_0005160 .
<http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> rdf:type owl:NamedIndividual .
2 triples exceeds max {1,1}
Testing <http://model.geneontology.org/0b6b7849-258c-4445-892e-e480610c63fd/09f78549-aae3-4e12-be2f-e27aa601a408> against shape N1b900799bad94646b0f0bdc18dea6b82
No matching triples found for predicate rdf:type
Final report >> all files successful: False
Strange here is that I get a 2 triples exceeds max {1,1}
cardinality violation for predicate rdf:type
when I'm using this predicate for something that seems so fundamental to our models: "X is an Individual" and "X is of class Y".
Running the validator against the rest of the WB:WBGene00000903 model with this assertion individual removed (so that only simple GP->term assertions remain), I get a PASS result. So I think this would indicate that my general OWL syntax in these models is OK; I'm guessing it's this has_input
relation that's causing problems.
@balhoff @goodb Are you able to spot anything here that I can change to get it to pass?
Thanks!
Should we add the <ProteinContainingComplex> part_of: @<AnatomicalEntity>
relation to go-cam-shapes.shex?
I found this connection missing when attempting to translate this GPAD annotation with gocamgen:
MGI MGI:103013 part_of GO:0002095 MGI:MGI:3628972|PMID:16648270 ECO:0000314 20070117 MGI part_of(EMAPA:16105),part_of(CL:0000187)
Specifically, trying to create GO:0002095 part_of(CL:0000187)
: "caveolar macromolecular signaling complex" part_of "muscle cell"
@cmungall I may just be misunderstanding how this works, but I am running into files that cause problems for shaclex but seem to pass python, e.g. should_fail/fail_no_evidence_4.ttl
.
But when I run make test
in python, I don't see any output about this file.
See geneontology/pathways2GO#69 (comment)
The expansion of the shex schema appears to be severely slowing down the shex validator as currently implemented.
The Reactome models are very large compared to the vast majority of other models we are dealing with right now (thinking of the genome imports) but its still concerningly slow...
When we apply the shapes to a go_cam model, we need to formalize what the code should be providing in response. The shex libraries provide a mapping of the RDF nodes in the model to the labels of the shapes in the provided schema. This alone seems insufficient for users. I'm thinking of a response that would require some additional logic, something that contained additional elements like:
On computing model-level validity, I'm thinking something like:
For each named individual in the model:
We have talked about this on calls, but I'm not sure what the final decision was.
For location information, curators may not always be able to annotate the full hierarchy of CC -> CC -> Cell -> Gross Anatomical Entity -> Organism (-> denotes 'part of').
We need to decide how we'll handle cases where the curator might only be able to annotate, for example, a Cell for location.
If a curator only adds a value for a node after the first CC, will each preceding node need to be filled in with a root node and, if so, should the tools fill this in automatically?
What should the GPAD export include?
RNA
Protein complex
ext
E.g.
s/<BP>/BiologicalProcess/
Also we should decide on BASE, e.g. $OBO/go/shapes/
@ AND {
bl:category [GoAnatomicalEntity:] ;
}// rdfs:comment "a chemical entity"
should be
@ AND {
bl:category [GoAnatomicalEntity:] ;
}// rdfs:comment "an anatomical entity"
It looks like the correct label for GO:0002413 is 'directly provides input for'.
May partially replace #22
This ticket is for a simple renaming of the MolecularFunction shape to MolecularActivity, to be consistent with the paper
Note this creates a disconnect between the ontology nomenclature and the GO-CAM nomenclature
provides_direct_input_for:
For http://purl.obolibrary.org/obo/RO_0002413 we have label 'provides_direct_input_for'
But in RO and in Noctua it is 'directly provides input for (process to process)'
process to process
part of the term label.Seems like something should be changed somewhere ... ?
Thanks, Pascale
Do we need to include in the shex where evidence is added and what minimally constitutes evidence for an assertion?
Idea is to build a simple bit of code that can take a .ttl go-cam file and .shex shape file and produce a validation report. I plan to implement this in Java so it could eventually be plugged back into minerva.
For the time being no need to have all of minerva present to run these tests and figure out shex. Online validators, e.g. http://shexjava.lille.inria.fr/demonstrator , are great but we need to get an idea of how to handle the validation report as well.
In the current version of the Noctua form, as well as for the MOD imports, we have 'BP only' annotations modeled as:
gp <- enabled_by <- root MF -> causally upstream of or within (or child) -> BP -> occurs_in -> Cell -> part_of -> Anatomy
We need to review this to make sure this is what we still want (I think we'll want to review the occurs_in hierarchy) and that the ShEx allows this.
Rather than OBO
e.g. s/molecular_function:/bl:MolecularActivity/
May require syncing with tagger code
The goal of this ticket is for others to make as much progress as possible on this project while I am out of email contact the rest of July, and Ben is out. Other projects may end up taking priority, but if there are cycles to work on this project here are tasks
I think we should add a top level shape map that any of the processors (Python, Java, Scala) could use to make model checking more consistent. I'm not exactly sure what shape map is being used inside the Python tester right now.
We will still need to inject subClassOf axioms before running, but then I think we could use a single shape map across applications.
For the Noctua form (and presumably for other reasons), it would be helpful to declare in the specs what constitutes an 'activity unit'.
Could we replace <MolecularFunction>
below with <ActivityUnit>
?
<MolecularFunction>
@<TypedMolecularFunction>
AND {
enabled_by: @<MolecularEntity>
{0,1};
part_of: @<BiologicalProcess>
{0,1};
occurs_in: @<CellularComponent>
{0,1};
has_output: @<MolecularEntity>
*;
has_input: @<MolecularEntity>
*;
provides_direct_input_for: @<MolecularFunction>
{0,1};
adding the following seems to break both the js and java shex implementations
for: http://noctua-dev.berkeleybop.org/download/gomodel:R-HSA-156582/owl
directly_positively_regulates: @<MolecularFunction>
*;
} // rdfs:comment "A molecular function"
To test any PRs for syntax errors. Ideally it would also test against a test suite of known passes and known failures.
Everything currently depends on having access to the subclass closure linking instance nodes in the input models to the root classes defined in the shapemap file. These must be added to the RDF prior to applying the shex patterns. Currently the scala subclass enricher assumes everything needed is in the rdf.go.org endpoint but this is not currently the case for the reactome models as they depend on an additional ontology.
This is most likely an edge case. Jotting it down here to keep in mind and to consider when implementing tools for e.g. pipeline or other environments.
From discussions on 2019-07-10, there are additional relations to add to MF:
PREFIX GoBiologicalPhase: http://purl.obolibrary.org/obo/GO_0044848
happens_during: @<BiologicalPhase>
*;
regulates: @<MolecularFunction>
{0,1};
negatively_regulates: @<MolecularFunction>
{0,1};
positively_regulates: @<MolecularFunction>
{0,1};
directly_regulates: @<MolecularFunction>
{0,1};
directly_negatively_regulates: @<MolecularFunction>
{0,1};
directly_positively_regulates: @<MolecularFunction>
{0,1};
causally_upstream_of_or_within: @<BiologicalProcess>
{0,1};
causally_upstream_of_or_within, negative effect: @<BiologicalProcess>
{0,1};
causally_upstream_of_or_within, positive effect: @<BiologicalProcess>
{0,1};
causally_upstream_of: @<BiologicalProcess>
{0,1};
causally_upstream_of, negative effect: @<BiologicalProcess>
{0,1};
causally_upstream_of, positive effect: @<BiologicalProcess>
{0,1};
Add protein-containing complex as a value for:
has_input
has_output
transports_or_maintains_localization_of
at least for now.
According to @pgaudet these should be moveable to should_pass
.
I am combining tickets #26 #31 #50 into this one, new ticket.
Models fail validation, I believe, because the entity types do not match what is in the ShEx.
In the gpi 2.0 specs, we propose using SO terms to define entity types, although we know that this doesn't cover protein-containing complexes and we need a GO term for that.
I am wondering if we need to add to, or replace, ChEBI's information biomacromolecule in the ShEx with SO term(s).
See also: geneontology/go-annotation#2740
It's not clear that we should use Gene in the model (except in cases of target of transcription etc).
Although we use gene IDs in the models, the IRIs are uncommitted as to what kind of entity they represent, and in neo we just assert them to be information macromolecules. In fact a gene is not actually the object of enabled_by, it's really the product of that gene
See also gene LocatedIn cell component - the gene is always located on the chromosome...
e.g.
<BP> {
a . *;
contributor: . *;
date: . *;
providedBy: . *;
xref: . *;
rdfs:label . *;
exactMatch: . *;
rdfs:comment: . *;
bl:category [biological_process:] ;
part_of: @<BP> *;
} // rdfs:comment "A biological process"
Most of these can be inherited from an Entity base shape.
label should be {0,1} ("normal" entities don't have individual-level labels, but these are useful for reactome. But they should never have >1)
Not sure why xref and exactMatch are in there?
<CellularComponent> @<AnatomicalEntity>
AND {
bl:category [GoCellularComponent:] ;
part_of: @<CellularComponent>
{0,1};
part_of: @<Cell>
{0,1};
allow direct pass-through for single-cell organisms"
part_of: @ {0,1};
} // rdfs:comment "a cellular component"
I originally encoded these as {0,1}
but of course one MA can have the same relation R to two downstream MAs, see test3 for example
From 2019-08-07 call, here are the suggested updates to the Molecular Function shape.
@balhoff @cmungall @thomaspd @pgaudet - please review
enabled_by: ( @<ProteinContainingComplex> OR @<InformationBiomacromolecule> ) {0,1};
occurs_in: ( @<AnatomicalEntity> ) {0,1};
##QUESTION ABOUT ADDING AN OR STATEMENT HERE FOR PROTEIN-CONTAINING COMPLEX, GIVEN THE DEFINITION OF ANATOMICAL ENTITY IN CARO
https://www.ebi.ac.uk/ols/ontologies/caro/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000
has_output: ( @<ChemicalEntity> OR @<ProteinContainingComplex> ) *;
has_input: ( @<ChemicalEntity> OR @<ProteinContainingComplex> ) *;
happens_during: ( @<BiologicalPhase> OR @<LifeCycleStage> OR @<PlantStructureDevelopmentStage> ) *;
causally_upstream_of_or_within: ( @<BiologicalProcess> ) *;
causally_upstream_of: ( @<BiologicalProcess> OR @<MolecularFunction> ) *;
causally_upstream_of_negative_effect: ( @<BiologicalProcess> OR @<MolecularFunction> ) *;
causally_upstream_of_positive_effect: ( @<BiologicalProcess> OR @<MolecularFunction> ) *;
Add to PREFIX:
PREFIX GoLifeCycleStage: <http://purl.obolibrary.org/obo/UBERON_0000105>
PREFIX GoPlantStructureDevelopmentStage: <http://purl.obolibrary.org/obo/PO_0009012>
##CORRECT TERM FROM PO?
I was looking at this shex.js repo
Is there a shex to json converter? This would be awesome. The main use case is for Noctua Form to chunk the model. The four main chunks are
Since the beginnings, Noctua Form uses a static json like, but cannot handle anymore :( .
Even if it is manual converter
My original thinking was that the shex validator would end up merged into the main Minerva code and become part of the Minerva deployment. But, it is convenient to have this repository to isolate issue tracking - and to encourage the development of shex-oriented tooling in other languages.
Thoughts @balhoff ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.