biolink / biolink-model Goto Github PK

View Code? Open in Web Editor NEW

167.0 23.0 71.0 298.26 MB

Schema and generated objects for biolink data model and upper ontology

Home Page: https://biolink.github.io/biolink-model/

License: Other

Makefile 0.34% Python 98.14% Shell 0.01% Jinja 0.76% HTML 0.68% CSS 0.07%

biolink monarchinitiative gene-ontology datamodel standard specification yaml json-api owl ncats-translator

biolink-model's People

Contributors

Stargazers

Watchers

biolink-model's Issues

improve docs on how variants and alleles are modeled

will need help of @mbrush

we need to be capturing everything that is here https://github.com/monarch-initiative/ingest-artifacts/blob/master/sources/ClinVar/Update1-genomic_positions/ClinvarXML_20170308.cmap

Please add biolink URI to biolink-model.yaml

Align upper level classes with identifiers.org types

See for example http://identifiers.org/registry?query=pharmgkb

pathways
gene
drug
disease

http://identifiers.org/registry?query=hgnc

family
genefamily (what is the difference between this an the above?)
symbol

Add evidence slot to the top level 'association'

At present the slots tied to the top level association type in the BLM include the following:

-  slots:
    - association type
    - subject
    - negated
    - relation
    - object
    - qualifiers
    - publications
    - provided by

Seems like a slot to capture evidence should also be included here. There is a has_evidence predicate in the blm, but it is not clear if this is meant to capture ECO codes, or actual evidence data, or both. The range of the has_evidence slot is 'evidence instance' - and I'm not sure what this is. For now probably simplest to allow the has_evidence slot to capture ECO codes or actual evidence. Thoughts @cmungall? If agreed, I can update the documentation in the BLM yaml.

Map entity and association types to NCIT

cc @dexterpratt

We have already made a start on this with the association types

clarify representation of publications on edges

clarify also: no comma-separation, lists first class

show mandated prefixes for translator

From Eric D in gdoc

id [required]: MUST be a CURIE, MUST use translator-mandated prefix <- where is the list of these

currently in yaml, need to better expose

Not all UMLS Semantic Groups are mapped

DISO (Disorders), OCCU (Occupations), LIVB (Living Beings), and ORGA (Organizations)

Disorders is an especially common one, so it should be mapped.

Fix travis

this may be a false +ve: #18

Mappings for monarch to biolink model

@cmungall There are categories in Monarch (https://biolink-kb.ncats.io/types) that aren't in the biolink model, for example "variant" and "phenotype".

How should these be handled?

biological process and molecular activity

BLM contains concepts for "biological process" and "molecular activity". It also contains entries that are unions of other entries. Can we add an entry that is "biological process and molecular activity"? We have found that grouping those two concepts together is useful in building COPs. The biolink function returning "function" returns both of these types, so it would also be useful in annotating that service.

Derived file: matrix of entity types and how they connect

From the subclasses of association, produce a matrix with cols/vals being subj/obj, and the values being predicates

Add 'metabolite' to biolink model

We should have "metabolite", in addition to "drug" as a child of "chemical substance".

align with bio2rdf types

UPHENO mapping for phenotypic feature

Phenotypic feature is currently mapped to UPHENO:0000001, but I think that's a mistake. UPHENO:0000001 is a property "has phenotype affecting". Probably the desired mapping is UPHENO:0001001

Questions about provided_by and is_defined_by slots

Creating ticket here to continue discussion in comments of the TKG Spec here, concerning labels of the following edge properties:

is_defined_by [required]: A CURIE/URI for the translator group that made the KG
provided_by [required]: A CURIE prefix, e.g. Pharos, MGI, Monarch. The group that curated/asserted the edge.

Comment in the gdoc was to consider names for these properties again - e.g. switching the labels so that is_defined_by describes the primary source that originally defined or asserted the claim, and provided_by describes the Translator group that provided the association to the uber graph.

A follow up comment asked if it made any sense to make these more fine-grained? We have an edge property "source" where we store the specific function in our code that produced the edge. But I could imagine putting a curie prefix on that so that is_defined_by becomes something like GAMMA:uberongraph.get_anatomy_by_cell_graph. Ditto for provided_by. Currently we have a property called URL where we put the url we retrieved the info from (for url-derived edges)

And a final comment offered: say you have a relation provided by drugbank, who in turn obtained it from a publication. who should be credited in the provided_by field?

Clarify predicate v relation distinction and change field name to 'edge label'

In Translator we have the notion of minimal and maximal predicates. We had previously mapped minimal to 'predicate' and maximal to 'relation' but this is confusing.

Proposal is to keep relation as the true relationship type, at arbritrary specificity, using a CURIE if available.

For the 'min predicate' used edge_label. This is analogous to node labels in Neo4J. This is a human readable snake_case grouping for the relation

Located in / Part of Predicates

Can we add located in and part of ?

located in: http://purl.obolibrary.org/obo/RO_0001025
part of: http://purl.obolibrary.org/obo/BFO_0000050

I'm not quite sure how these should be structured.. As in, how they are related to each other and the other existing predicates in biolink (such as coexists with).

Add node properties for chemicals, e.g. inchikey

From @stuppie in gdoc "What about string or numerical properties (such as the molecular weight or INCHIKEY for a compound)?"

is there a standard rdf pred for this?

Clarify relationship with / map to DATS

https://github.com/biocaddie/WG3-MetadataSpecifications

Broadly speaking, these two are orthogonal and complementary. There are some linkage points, e.g

But the focus of DATS is on the intrinsic properties of the entities rather than linkages between the entities. We should write documentation that clarifies this, and map entity types where required.

Also: clarify relationship to bioschemas (#3)

Align with semantic types and association types used by reasoner groups

cc @dkoslicki

Map to MSO

https://github.com/The-Sequence-Ontology/MSO

Request for 'capable of' predicate for anatomy - GO MF/BP associations

Documenting a predicate request from Steve Ramsey (Gamma):

Can we please get a “capable_of” predicate added to the Minimum Predicate Hierarchy? This is for connecting UBERON terms to GO terms (see this page for details:
https://github.com/obophenotype/uberon/wiki/inter-anatomy-ontology-bridge-ontologies

Is the associations in BioLink model going to distinguish experimentally-validated results vs ML prediction results

Hi,

I'm wondering is the associations in BioLink model going to distinguish experimentally validated results against computational predicted results?

One example would be the 'ChemicalToGeneAssociation' (http://bioentity.io/vocab/ChemicalToGeneAssociation).

There are two available API endpoints:

They both contain information about chemical2gene associations, but the first one is from computational predicted results, while the second one is from experimentally validated results.

Is the BioLink model going to regard these two cases as the same association or not?

Thanks!

Annotate associations with Wikidata property URIs

http://tinyurl.com/yactqzzm

Integrate use cases into yaml documentation

Define standard serializations

todo: document this. Notes below

The biolink model is intended to be independent of any one serialization format or database technology. This adds an extra layer of abstraction when using for data exchange.

There are a few orthogonal choices here for exchanging links/associations

Use the generic Association class or a subclass, e.g. G2T
JSON or RDF/graph format
Which evidence model to mix in

For RDF exchange, a reality is that there is multiple reification standards. The reference one for us is OBAN, so we could define this as core and provide RDF shapes to check this.

But for many a JSON is most convenient, so the generic association class in the json-schema is best for general exchange.

Export to ShEx

ShEx provides a mechanism for encoding closed world constraints on RDF graphs. This is a natural fit for the metamodel we are using here

For overview for http://book.validatingrdf.com/bookHtml010.html

Semantic Type for Reactome Complex

Hi,

This question specifically regards the BioLink Semantic types. The example is a Reactome complex, e.g.R-HSA-5674003. A Reactome complex might be a combination of proteins, chemical compounds, etc. How would BioLink model assign semantic types for these biological entities, which could potentially be a mixture of multiple different biological entities?

Thanks!

Annotate semantic types with Wikidata QIDs

cc @stuppie

Map to UMLS semantic types

Autogenerated schema class definitions out of order

In schema.py the some classes are referenced before they are declared. The current file requires some manual rearranging.

New predicate proposals for Translator Min Predicate Set

Alignment of additional knowledge sources (beyond the original 5 reasoner KGs that informed the initial iteration of the ~40 predicate set here) has suggested ~20 additional predicates to add.

We would add these to the biolink-model.yaml file, alongside the initial set of predicates that have already been added. Here, predicates that are part of the minimal Translator set will be flagged using the 'subset' slot with the value "translator_minimal". This will allow consumers of the yaml to find the set of slots in this standard, and also enable derivation of a biolink-github.io web page that presents the hierarchy of only these predicates (like this one for blm types).

The hierarchy below presents the proposed new predicates (in bold) in the context of the hierarchy of predicates in the existing minimal set. Parentheticals explain the meaning and/or source requiring each new term.

interacts_with (grouping term for interaction predicates)
- directly_interacts_with
  - molecularly_interacts_with
- genetically_interacts_with (gene - gene, for BioGrid via Monarch)
coexists_with
- co-localizes_with (gene/product -gene/product, for QuickGO via Gamma)
- in_pathway_with
- in_complex_with
- in_cell_population_with
affects
- regulates
  - positively_regulates
  - negatively_regulates
- has_affected_sequence_feature
- disrupts
- treats
participates_in
- input_of
- output_of
has_participant
- has_input
- has_output
overlaps (new for Monarch - make parent of part of and has part)
- part_of
- has_part
is_homologous_to
- is_parologous_to (Monarch)
- is_orthologous_to (Monarch)
- is_xenologous_to (Monarch)
affects risk for
- prevents
contributes_to
- causes
is_correlated_with
- has_biomarker
- is_biomarker_for
expressed_in
expresses (anatomy to gene, inverse of expressed_in, from HetNet via Gamma)
occurs_in (GO/QuickGO)
is_located_in (Wikidata)
is_location_of (SemMedDB)
is_model_of (for Monarch / MODs)
derives_from (for Monarch)
produces (for WD - between producing entity/agent and the product or material produced)
enables (for Monarch/GO, and WD/GO)
same_as (for WD exact match, and owl same_as in Monarch, etc)
in_taxon (Monarch, Wikidata)
has_gene_product
has_phenotype
manifestation_of
treated_by
precedes
derives_into
subclass_of

If there are questions about meaning/utility/name of any of the new proposed predicates, make comment here, or create new ticket if you anticipate prolonged debate.

Also, note that predicates for gene-disease associations are not included here - and are addressed in the ticket #52.

Why not identifiers?

My understanding is that the BLM will not have URI or curie style identifiers for its elements. Can somebody explain why? The advantage of having identifiers in my mind is that we're no longer stuck with particular labels. If we decide that "molecular activity" should be called "molecular function" to bring it in line with GO, then we can do that with impunity because the identifier would not change.

Note that I am not suggesting that the BLM has to have all BLM:00001 type identifiers. I think it would be entirely reasonable to use identifiers from other systems (like RO or SIO or whatever is appropriate for a given identity but choosing a single best identifier for each concept).

Inverse predicates (and their use in Translator KGs)

Different knowledge sources often assert associations in different directions. For example, the gene expressed_in anatomy, vs anatomy expresses gene.

On the 4-30 KG Standardization call, it was proposed that rather than enforcing such associations to always be made in one direction in KGs, such that only a single predicate is needed, we would allow assertion in either direction, and create the inverse predicates. A has_inverse slot in the biolink model will be used to indicate inverse predicates, and allow normalization to one direction when required.

The convention here will be to add the inverse_of statement in the blm only on the predicate representing the 'canonical' direction (which will need to be decided for each such pair of predicates. In this way we mark the canonical direction that is preferred for normalization.

Looking for feedback in this proposal before it gets implemented in the biolink-model.yaml file.

Wikidata predicate relations

The Semantic Medline Database and Wikidata use a set of predicates which should perhaps be added to the Biolink Model (perhaps, via the Translator predicate harmonization effort?). Here is the list of interest (Wikidata wd: curies given):

nse Body
[
{
"id": "wd:P3356",
"name": "positive diagnostic predictor",
"definition": ""
},
{
"id": "wd:P129",
"name": "physically interacts with (in molecular biology)",
"definition": ""
},
{
"id": "wd:P279",
"name": "subclass of",
"definition": ""
},
{
"id": "wd:P276",
"name": "location",
"definition": ""
},
{
"id": "wd:P1557",
"name": "manifestation of",
"definition": ""
},
{
"id": "wd:P361",
"name": "part of",
"definition": ""
},
{
"id": "wd:P156",
"name": "followed by",
"definition": ""
},
{
"id": "wd:P1056",
"name": "product",
"definition": ""
},
{
"id": "wd:P2888",
"name": "exact match",
"definition": ""
},
{
"id": "wd:P2175",
"name": "medical condition treated",
"definition": ""
},
{
"id": "wd:P2283",
"name": "uses",
"definition": ""
},
{
"id": "wd:P1542",
"name": "cause of",
"definition": ""
},
{
"id": "wd:property_id",
"name": "",
"definition": ""
},
{
"id": "kb:P2176",
"name": "drug used for treatment",
"definition": ""
},
{
"id": "wd:P703",
"name": "found in taxon",
"definition": ""
},
{
"id": "wd:P688",
"name": "encodes",
"definition": ""
},
{
"id": "wd:P684",
"name": "ortholog",
"definition": ""
},
{
"id": "wd:P682",
"name": "biological process",
"definition": ""
},
{
"id": "wd:P681",
"name": "cell component",
"definition": ""
},
{
"id": "wd:P680",
"name": "molecular function",
"definition": ""
},
{
"id": "wd:P3433",
"name": "biological variant of",
"definition": ""
},
{
"id": "wd:P31",
"name": "",
"definition": ""
},
{
"id": "wd:P2293",
"name": "genetic association",
"definition": ""
},
{
"id": "wd:P1552",
"name": "has quality",
"definition": ""
},
{
"id": "wd:P128",
"name": "regulates (molecular biology)",
"definition": ""
}
]

Derived file: list of ontologies mapped to entity types they represent

Predicates for gene-condition associations

In recent Translator Knowledge Graph (tkg) standardization calls we reviewed different approaches for creating predicates linking genes directly to disease. The proposal below would create a set of predicates for connecting genes to diseases separate from those used to connect variants to conditions. See column G in the spreadsheet here, starting at row 14, to get a sense of the Reasoner-requested predicates that informed the predicates in this proposal.

Approach 1

The proposed predicates below describe important ways genes are related to conditions, as informed by predicates used in one or more Reasoner/Translator KG. In reality most are shortcuts for the fact that a variant or a product of the gene is related to the condition in the indicated way. These 'shortcuts' are needed because many KSs and KGs don’t represent gene variants or products, and wish only to associate genes directly to a condition to which their variants or products contribute.

gene_associated_with_condition   
      gene_mutations*_contribute_to                        
            gene_mutations*_causal_for                
            gene_mutations*_affect_risk_for                
      gene_regulation_correlates_with                        
      gene_activity_contributes_to        
      gene_product_is_therapeutic_target_for             
  
*for labels, instead of 'gene_mutations' consider  'gene_alterations'?, 'gene_variants'?, 'gene_alleles'?

Requirements for many of these predicates come from Team IR and their GNBR resource - again for specifics see column G in the spreadsheet here, starting at row 14. Team Xray would likely use the generic top level predicate here to map to their 'gene_associated_with' predicate, but may also have a use case for the gene_mutation_contributes_to_condition predicate. Monarch would likely also use the gene_mutations_contribute_to predicate as their gene-disease associations are inferred across causal variants.

In addition to the predicates above, the following would be created to link things like variants or exposures to conditions. These would be used in KGs where the variant/allele is represented. Here we propose using relatively generic predicates such as 'causes' instead of 'causes_condition', which are not specific for variant as domain and condition as range. These predicates describe direct causal/correlation relationships - as it is the variant that is indeed doing the causing or correlating.

contributes_to
     causes
correlated_with
     biomarker_for
affects_risk_for

Pros of this Approach

semantically correct and clear predicates with simple mappings to KG predicates in use by reasoners.
predicates provide precise semantics for traversal and reasoning.
the gene-condition relations above could all be derivable from the variant-condition associations through inference (e.g. property chains) - provided that the variants are connected to the genes they affect. The would allow interoperation of KGs that do and don’t capture this more normalized/granular pattern.

Cons of this Approach

use of separate predicates for genes and variants results in larger number of predicates (but really only three are 'duplicated' in the proposal above).

An orthogonal consideration for this approach concerns the granularity of the proposed gene-condition predicates - i.e. within this approach of using separate predicates for gene-condition vs variant-condition relationships, perhaps can we merge some predicates where the precision/distinction between them is not required at the level of the minimal spec.

Add UMLS Semantic Groups and (some) Wikidata Concept types to the Biolink model

Need to add the UMLS Semantic Groups plus some Wikidata entity concepts as entries and mappings to the BIolink model

CURIE identifiers for biolink model entries

Although the name: fields of the model are both unique and human friendly, there are still likely to be some programmatic instances in which a more concisely specified globally unique CURIE will be more convenient and efficient than multi-word, space delimited, variable length (and sometimes long) Biolink Model ontology term names. For example, when encoding Biolink Model specific semantic data types and predicates as input parameters to API services or as the field values of JSON outputs, CURIEs are probably more concise but easily looked up by programs using an indexed Biolink model read into memory.

annotate with bioschema URIs

Derived file: list of relations used in association hierarchy

For alignment with https://docs.google.com/spreadsheets/d/1zXitcR1QjHyh6WocukgshSR7IoAVg7MJQG-HNh96Jec/edit#gid=3366698

Decide on variant datamodel

we currently map blm sequence variant to http://purl.obolibrary.org/obo/GENO_0000002

this represents an actual state, i.e. allele

need to decide on slots the variant class will have, cc @deepakunni3 @julesjacobsen @mbrush

SPDI https://www.ncbi.nlm.nih.gov/variation/notation/
VMC
also string literals of HGVS
...

See also the jsonld here:
https://reg.genome.network/allele/CA024716

inconsistencies introduced by mappings should be surfaced when running validation on the model [was: Curies are mapped onto multiple categories]

SIO:010004 maps to both molecular entity and chemical substance
SIO:010450 maps to both transcript and RNA product
SIO:010046 maps to anatomical entity, macromolecular complex, and gross anatomical structure
GENO:0000512 maps to both allele and sequence variant
WD:Q4936952 maps to both anatomical entity and gross anatomical structure

Mappings should be functions: each curie should map onto a single category. We could set up Travis CI to check these sorts of things on every pull request.

literature co-occurence slot

We'd like store edges in a knowledge graph indicating that two entities are both mentioned in a particular article (or articles). Appearing together in the same abstract is interesting, but I don't think it's really appropriate to say that the two entities are "associated" or "interacting", so I'd like a slot that is used specifically for literature co_occurence.

"Associated with" predicate

I know "associated with" is overly vague, but is there a predicate for this? Should we make one?

Proposed label changes for a few Translator min predicates

Proposing a few minor changes to labels used for min predicates - mainly to maintain internal consistency of naming principles and assure clear distinction between terms as new predicates get added. And in some cases simply shorten/simplify labels where feasible.

directly_interacts_with -> physically_interacts_with (to better distinguish from its new sibling ‘genetically_interacts_with')
has_affected_sequence_feature -> affects_sequence_feature (a bit cleaner with shorter label) . . . or just eliminate this and use the more generic 'affects' predicate for these edges)

Questions for G2P schema - relationship type, nested schemas

In the G2P schema the range for relationship is RelationshipType. However, RelationshipType does not contain any fields/slots, where I would expect at least id and label. Or am I misinterpreting the spec?

Should certain fields contain nested schemas, e.g.

publications = fields.Nested(PublicationSchema, many=True)

instead of

publications = fields.Str()

Does provider need fields? https://biolink.github.io/biolink-model/docs/Provider.html

Annotate types with identifiers that can be used

align with Broad reasoner types and paths

https://github.com/broadinstitute/reasoner/blob/master/reasoner/KnowledgeMap.py

biolink / biolink-model Goto Github PK

biolink-model's People

Contributors

Stargazers

Watchers

Forkers

biolink-model's Issues

Approach 1

Pros of this Approach

Cons of this Approach

Recommend Projects

Recommend Topics

Recommend Org