monarch-initiative / geno-ontology Goto Github PK

View Code? Open in Web Editor NEW

18.0 30.0 6.0 30.6 MB

Repository for representing genotypes and their association with phenotypes

Dockerfile 0.10% Makefile 70.99% Shell 12.89% Batchfile 0.26% Ruby 6.68% Scala 9.08%

monarchinitiative ontology obofoundry

geno-ontology's Introduction

GENO-ontology

This repository holds the GENO ontology, which can be found in the OWL file here.

A review of the modeling and use cases supported by GENO is can be found in the slide deck here.

A detailed example of how GENO can be used to represent a complex Drosophila genotype is described in the document here.

Contribution and release guide

Overview

GENO is an OWL2 ontology that represents the levels of genetic variation specified in genotypes, to support genotype-to-phenotype (G2P) data aggregation and analysis across diverse research communities and sources. The core of the ontology is a graph decomposing a genotype into smaller components of variation, from a complete genotype specifying sequence variation across an entire genome, down to specific allelic variants and sequence alterations (Figure 1). Structuring genotype instance data according to this model supports a primary use case of GENO to enable integrated analysis of G2P data where phenotype annotations are made at different levels of granularity in this genotype partonomy. GENO also enables description of various attributes of genotypes and genetic variants. These attributes include zygosity, genomic position, expression, dominance, and functional dependencies or consequences of a given variant.

Figure 1: Decomposition of a Genotype. (A) Top level breakdown into reference and variant components. (B) Further decomposition of the genomic variation complement into its more fundamental parts. Class labels are in blue, and exemplar instances of each class are shown in green, for a zebrafish genotype which contains a homozygous ti282 variant of the fgf8 gene, and a heterozygous hu745 variant of the apc gene. Schematics graphically illustrate extent of genomic DNA represented at each level in the partonomy.

In addition to heritable variation in genomic sequence specified by traditional genotypes, GENO also represents transient variation in gene expression, as seen in experiments where genes are targeted by knockdown reagents or overexpressed by DNA constructs at the time a phenotype is assessed. This variation in gene expression is represented in terms of the targeted genes themselves, to parallel representation of sequence variation and facilitate integrated description and analysis of data about any genetic contribution to a measured phenotype.

Finally, GENO also supports modelling of G2P associations, focusing on the interplay between genotype, phenotype, and environment. GENO describes provenance and experimental evidence for these associations using the Scientific Evidence and Provenance Information Ontology (SEPIO) model.

GENO is orthogonal to but has contact points with a number of existing community ontologies, including the Sequence Ontology (SO), the Human Phenotype Ontology (HPO), the Feature Annotation Location Description Ontology(FALDO), and the Variation Ontology (VariO). We will work with developers of these models to align representations and re-use common terms where possible. Further documentation for GENO is under development, but a high level overview of its core model can be found in the slide deck here, and a summary of the use cases for which it is being developed can be found here.

GENO is an open source ontology, implemented in OWL2 under a Creative Commons 4.0 BY license.

geno-ontology's People

Contributors

Stargazers

Watchers

Forkers

harryhoch strubbia shunsunsun standardgalactic jbgaither

geno-ontology's Issues

epigenetic genotype (epitype?)

At some point we'll need to consider a person's epigenetic factors related to disease susceptibility, etc. This could include methylation patterns, histone modifications, etc.

Should this be another subclass of "genotype" (GENO:0000536) and/or part of the "intrinsic genotype" (GENO:0000000) ? There are some relevant classes in SO related to these kinds of modifications:
http://sequenceontology.org/browser/current_svn/term/SO:0001720. We should think about how we might annotate / combine these together to create this part of an organisms' genotype. Or for a different view, perhaps the epigenetic type isn't even part of the genotype, but rather is outside of an organisms' genotype (epitype, anyone?)...considering that epigenetic markers may differ based on cell-type.

For some background reading:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3034103/
http://en.wikipedia.org/wiki/Epigenetics

Perhaps we can consider using some cancer variations and changes in their methylation pattern as a use case, as described here: http://en.wikipedia.org/wiki/Epigenetics#DNA_repair_epigenetics_in_cancer

geno0000846

https://github.com/monarch-initiative/GENO-ontology/blob/develop/src/ontology/geno.owl#L144

missing underscore after geno

Extend to model complex transgenics (such as CRISPR) and overexpression

Per R24 here

Add classes of genetic sex

These are needed for GA4GH meta-data. I pointed them here to GENO as an alternative to creating enums. If there is a better home that's fine, but let's try and avoid them using an enum

Alternative label on http://www.geneontology.org/formats/oboInOwl#hasDbXref in geno.owl

Geno sets the rdfs:label of http://www.geneontology.org/formats/oboInOwl#hasDbXref as hasDbXref when other sources have it as database_cross_reference.

Ideally this should all be set centrally with imports. In the absence of that, can geno.owl be changed to conform?

ontology_name	IRI	label	short_form
vfb	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref
fbdv	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref
fbcv	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref
ro	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref
so	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref
geno	http://www.geneontology.org/formats/oboInOwl#hasDbXref	has_dbxref
fbbi	http://www.geneontology.org/formats/oboInOwl#hasDbXref	database_cross_reference	hasDbXref

need more chr stains

mouse list the following staining patterns:

gneg
gpos100
gpos33
gpos66
gpos75

so, we need to add gpos33 and gpos66

Align with DDGENMOD

https://github.com/dictyBase/migration-data/tree/master/ontologies

cc @pfey03

stalk/short arm and acen terms

Previously 'GENO:0000628' was "short_chromosome_arm"
and continues to be used for this in Monochrom.py. to indicate the 'p' arm of a chromosome.

Now GENO:0000628 is stalk" and is used for this in UCSCBands.py which is more specific than p/short arm.
Decisions on changing Monochrom to use 'stalk' in place of "short_chromosome_arm"
or and an alternative short arm term is needed.

There is also the giemsa staining type "acen" in UCSCBands which possibly indicates "Acrocentric Chromosomes" which I would appreciate a suggested term for.

(there exists http://purl.obolibrary.org/obo/FMA_84705)

Mode of Inheritance classes

Ticket #2 proposes a hierarchy of mode of inheritance terms for GENO that covers diversity of concepts in this area need for data ingest and interoperability. Here we discuss the ontological nature and definitions of these concepts so that we may create a clear and useful model of them for the larger community. The central question is how to frame them - as processes of inheritance? As qualities of inheritance processes? As qualities or dispositions of the trait/phenotype itself?

Some terminological clarification first. We define the following:

Trait = a characteristic that varies across organisms in a species or population (e.g. 'flower color')
Phenotype = the specific values of a trait (red, pink, white, mottled flowers)
Phenotype inheritance = the process by which a specific phenotype (e.g. pink flowers) is passed from one generation to the next based on the genetic interactions between alleles expressed at the responsible locus or loci (and contributions from the environment).

Approaches

Framing inheritance classes as processes (i.e subtypes of a root 'phenotype inheritance process') seems a relatively simple approach and avoids delving into the historically contentious realm of dependent continuants. It would require some relation of participation between a given trait/phenotype/disease and an inheritance class. For example:
- 'baldness' participates_in or is output_of some 'X-linked inheritance'
- diabetes participates_in or is output_of some 'polygenic inheritance'
Framing these classes as qualities of a trait or disease would mean some type of bearer_of relation between a given trait/phenotype/disease and an inheritance class. For example:

'baldness' has_quality some 'X-linked inheritance'
diabetes has_quality some 'polygenic inheritance'

Framing these classes as qualities of an inheritance process would make the link from a given trait/phenotype/disease and an inheritance class trickier to define - as the term represents a quality that inheres not in the phenotype but in the process through which the phenotype is inherited.

Use OBO obsoletion policy for retired IRIs

Reported by @TomConlin monarch-initiative/dipper#379

geno-developer has:

   <rdf:Description rdf:about="http://purl.obolibrary.org/obo/GENO_0000382">
        <rdfs:comment>formerly used http://purl.obolibrary.org/obo/GENO_0000532</rdfs:comment>
    </rdf:Description>

I recommend using the obsoletion plugin and not burying IRIs in comments

TO manually repair, add back the IRI, deprecated it, and use a term replaced by axiom to
http://purl.obolibrary.org/obo/GENO_0000382 (has_variant_part)

Document how gene interactions are modeled

A good question from @kyook (see the slides in Dropbox for background)

I don’t understand where gene interactions are accounted for.  We separate genetic background - like in your example of the daf-2;fog-4 mutant, in this case, as displayed fog-2 would be considered a background mutation, one that needs to be there to assess the phenotype, but in slide 4 why is fgf8 homozygosity not considered the genetic background?  In each experiment, the phenotype needs to be assessed based on a control genotype. In this case, the control genotype would be fgf8 wouldn’t it?

SO term requests

Central Ticket for documenting all term requests for SO.

chromosome band staining intensity qualities presently implemented in GENO as placeholders. And then mireot to replace temp GENO classes. Also need the object property for linking a band to its staining intensity.
new chromosome parts (chromosomal region, chromosome arm, long chromosome arm, short chromosome arm). And then mireot to replace temp geno classes.

Collecting use cases/requirements for genotype data representation and use

Seems that there have been discussions/ideas happening about genotype/variant representation - with use cases/requirements coming from Monarch and other efforts (e.g. JAX, MGI, MPD, AGR, etc).

I'd like to collect info on these efforts and any requirements/thoughts about genotype data representation, querying, operations, and rendering in applications, etc. - ideally in advance of the Feb 28 'Genotype Representation' session at the Monarch All Hands.

@cmungall @mellybelly @pnrobinson @sbello, others - please jot some thoughts down if you have a chance. Thanks!

Mechanism of Pathogenicity terms

Consider adding hierarchy of terms describing mechanism of pathogenicity, based on requirements coming initially from ClinGen and GenCC working groups. Terms from GenCC work to harmonize across Gene curation efforts include things like:

Loss of function
Dominant Negative
All missense/in-frame
Cis-reg or promoter region
5’ or 3’UTR
Activating
Increased Gene Dosage
Duplication
Transcript ablation
Splice Acceptor, Donor
Stop gained/lost
Initiator codon/start lost

Questionable IRIs

There are a few IRIs in geno.owl that use invalid URIs:

<!-- http://urigen-plugin/ontology-iri -->
    <owl:AnnotationProperty rdf:about="http://urigen-plugin/ontology-iri">
        <rdfs:label xml:lang="en">ontology-iri</rdfs:label>
    </owl:AnnotationProperty>

    <!-- http://urigen-plugin/server-url -->
    <owl:AnnotationProperty rdf:about="http://urigen-plugin/server-url">
        <rdfs:label xml:lang="en">server-url</rdfs:label>
    </owl:AnnotationProperty>

and

<!-- http://property/refactor-include -->
    <owl:AnnotationProperty rdf:about="http://property/refactor-include"/>

<!-- http://property/refactor-include-subs -->
    <owl:AnnotationProperty rdf:about="http://property/refactor-include-subs"/>

Is this intentional? Should these be resolvable IRIs?

/cc @nlwashington

Add slides or link to slides in this repo

We have good slides in our dropbox, We should either put them on slideshare or git add them here

Map GENO to HPO inheritance/dominance

Transferred from: NIF-11647
Original Reporter: Nicole Washington

i notice that HPO has some dominance/inheritance terms. these should be reconciled some way with GENO.

for example:
http://www.human-phenotype-ontology.org/hpoweb/showterm?id=HP:0000006

Representation of chromosomal boundary regions

Transferred from: NIF-11899
Original Reporter: Melissa Haendel

This landmark paper describes the identification of chromosomal boundary regions. http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pmc/articles/PMC3356448/

references other papers describing other types of chromosomal boundaries:

A and B compartments http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pmc/articles/PMC2858594/

Lamina-Associated Domains (LADs)
http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pubmed/18463634
http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pubmed/20513434

replication time zones
http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pmc/articles/PMC2813472/
http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pmc/articles/PMC2877573/

Large Organized Chromatin K9-modification (LOCK) domains
http://www-ncbi-nlm-nih-gov.liboff.ohsu.edu/pmc/articles/PMC2632725/

The GENO/SO should have a representation of these different types of chromosomal boundaries, and we should further identify data sources to ingest to support chromosomal level correlations of genotype to phenotype and display on the genome browser.

Clarification on "gene allele" http://purl.obolibrary.org/obo/GENO_0000014

The Protein Ontology has been tasked with taking over the protein-related terms from MRO (Major Histocompatibility Complex (MHC) restriction ontology). The MRO terms are often defined with respect to a locus, which can include multiple syntenic genes. For example, HLA-A, HLA-B, and HLA-C are all genes located at the HLA locus (human); H2-Q1 through H2-Q15 are all genes located at the H2-Q locus (mouse). A search for the term "locus" (or related) brought me to your term "gene allele" (synonym: "gene locus").

I was pleased to see the term defined as I would expect for a locus--that is (paraphrasing) with respect to position. However, the placement in the GENO hierarchy (viewed after reasoning) is odd, as it appears to be tied (indirectly) to sequence (since alleles are, ultimately, sequence-based). The hierarchy, in contrast to the definition, would make this term equivalent in meaning to "an allele of a gene" (as opposed to, say, an allele of a nucleotide, to use an example from the parent term "allele") and not a locus, per se. It thus appears that the logical definitions (equivalencies) are in conflict with the text definition.

The simple fix, taking the shortest route (so to speak), would be to revise the text definition to reflect the logical one, maintaining its position in the hierarchy. Then, a new term "locus" would be minted using the current text definition of gene allele (however, see notes below).

I would expect to use this new locus term along the lines of the following:

encoded_by found_within
(the relations are placeholders, but I trust their intent is understood)

Here I'm using locus in the sense of the text definition. It's possible to make child terms specifying things like gene locus (the place where a specific gene is typically found), etc.

Note 1: It isn't clear to me what "canonical allele" is. (1) By label I would say it's the same as reference allele, (2) by definition it seems more like be the aforementioned locus I requested, and (3) by comment it would seem its child terms represent the bearers of the features given under sequence feature. If (2) is the case, I would not expect allele to be subclassed here. I'm hoping that (3) is correct, as it would mean this term gives clarity to the issue plaguing SO, which conflates information content entities with material entities. The comment within "sequence feature" lends credence to the notion that "canonical allele" is the "biological sequence" mentioned, and thus (3) pertains.

Note 2: Based on other considerations (such as explanations given in child terms), it would seem "gene allele" is serving as three entities: an allele of a gene, the locus of a gene, and a locus in general.

Typo in the human taxon name

https://github.com/monarch-initiative/GENO-ontology/blob/develop/src/ontology/geno.owl#L5342

It should be a capital "H".

See https://github.com/monarch-initiative/monarch-app/issues/1224

Add 'diplotype' class

See definition here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4118015/

Needed for ClinVar annotations -- which will soon be made to diplotypes.

geno-developer imports itself

    <owl:Ontology rdf:about="http://purl.obolibrary.org/obo/geno-developer.owl">
        <owl:imports rdf:resource="http://purl.obolibrary.org/obo/geno-developer.owl"/>

grouping term for population, strain, background, fish, genotype, model?

Data sources use a variety of modeling elements to describe the collection of genetic elements that together produce a particular phenotype or contribute to disease progression.

We'd like one abstract grouping class to collect these various definitions so that we can harmonize the "bag of things" at the organism level, that can be connected to phenotype/disease.

This bag can include:
alleles (and variants, though this relationship should be investigated separately), including point mutations, insertions, deletions, deficiencies, rearrangements, etc.
transgenes (inserted in the genome via constructs)
morpholinos, crisprs, talens, rnai

This bag is sometimes called (depending on various factors as determined by the data source):
population
strain
background
fish
genotype
model

From admittedly naive browsing of GENO in OLS, is the closest mutual parent of these terms is "continuant" in GENO?

We could alternatively handle this in biolink-model via an enumeration of GENO terms.
biolink/biolink-model#1003

RO compatibility issue

On our path to upgrading GENO to ODK, the following two issues came up:

GO_0003674 (molecular function) is hardcoded in GENO as function while RO asserts it to be a process
GENO makes a lot of use of the 'has role' relationship which formally applies to independent continuants. However, most if not all of the entities in GENO are declared to be "generically dependent continuants"; this causes about 40 classes to be unsatisfiable.

How do you want to proceed here?

I assume we just remove the GENO assertion; in the end, we need to get rid of all non-native axioms from GENO in any cases (non-native: axioms that belong to another ontology).
The quick fix is to create a new object property or, even better, find a more suitable one in RO that reflects the geno use case. Example:

genomic background = 'genomic genotype' and ('has role' some reference)

@dosumis @cmungall Any suggestions for a better relation in this case?

Integration issues with SciGraph

(Not sure if this ticket belongs here or in SciGraph, but putting here for now)

When using the latest version of GENO we lose most of the genotype neo4j labels on our genotype nodes.

Genotypes

Production Query

Dev Query

field	count
genotypes	44892 (-656827)

This is resolved by using an older version of GENO - https://raw.githubusercontent.com/monarch-initiative/GENO-ontology/ce4caf5978824c25ffa15e0b0460369661e7ad4c/src/ontology/geno.owl

Our mapping to the genotype neo4j label is GENO:0000536.

haplotype

We need to add the concept of haplotypes.

Def from the interwebs:
A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.

clearly related, but not equivalent, to variant single locus complement.

required to properly express clinvar variants, and other data sources.

sex chromosomes in non-mammals

GENO currently only models the sex chromosome zygosity for mammals (with X and Y), but it should also accommodate the models for birds/reptiles which is Z and W.

here's some info:
http://en.wikipedia.org/wiki/ZW_sex-determination_system

i think that for now, we can leave out the monotremes from the model, but be aware that they have a blended method for their sex-determination system (and bunches of sex chromosomes).

Terms to describe transmission/inheritance of variance (from DECIPHER)

Pasteing text from 1-19-18 email thread between Daniel Perrett, Melissa Haendel, and Peter Robinson:

Dear everybody
Daniel Perrett from the DECIPHER team is asking about terms to describe the transmission (inheritance) of variants. I think this might be best done with GENO.
I agree, Matt has been collecting similar requests from ClinGen and other clinical annotation groups.

Regarding where to put things, generally my feeling is that if it has to do with:
variation/inheritance/genotype and is applicable across species => GENO
human phenotype => HPO
sequence features in any species => SO

One can always import what you need into any given context from these and other ontologies.

Daniel, Matt is working on a manuscript for GENO and we would welcome suggestions! The tracker is here.

GENO currently has some inheritance terms that describe clinical inheritance (that in my opinion do no belong in GENO, they belong in HPO).
=> we can move them as needed, but if it is applicable in other species then probably needs to be in GENO so can be accessible for veterinary applications.

In contrast, GENO does not seem to currently have terms to describe transmission of variants -- this would be really useful for MME and other scenarios. Could we start a conversation about this?
Indeed! yes we need these, others asking for them too.

Cheers,
Melissa

The following text is by Daniel.
-Peter

In DECIPHER, the terms we accept for inheritance on a variant are:

De novo constitutive
De novo mosaic
Paternally inherited, constitutive in father
Paternally inherited, mosaic in father
Maternally inherited, constitutive in mother
Maternally inherited, mosaic in mother
Biparental
Imbalance arising from a balanced parental rearrangement
Mitochondrial Homoplasmy
Mitochondrial Heteroplasmy
Unknown

We no longer collect the following terms, but they exist in the database

CNV (= common in population)
Inherited from normal parent
Inherited from parent with similar phenotype to child
Inherited from parent with unknown phenotype

These are a single field, so a patient only gets one of the above options.

Sequence ontology has the following terms:

de_novo_variant
germline_variant
maternal_variant
paternal_variant
pedigree_specific_variant
population_specific_variant
somatic_variant

My current tentative thinking is that it would be useful to add the following terms:

Germline variation (i.e. catch-all 'this is inherited' term as opposed to the de novo/somatic mutation term)
Maternally inherited variation
Paternally inherited variation
Biparentally inherited variation
Imbalance arising from a balanced parental rearrangement

And possibly also the following, matching the SO terms.

Pedigree-specific variation
Population-specific variation

But there are lots more we could add, e.g. Mitochondrial somatic variation

However, I do not know a lot about how others use HPO for inheritance. I thought it would be useful to ensure this is following a consistent and useful overall approach before requesting lots of terms in the issue tracker (happy to use that if you prefer, though).

I hope all is well with you

Daniel

new inheritance terms

We would like to request the following terms be added as children of inheritance pattern (GENO:0000141)

chromosomal inheritance
def: An inheritance pattern wherein the trait is determined by inheritance of extra, missing, or re-arranged chromosomes possibly together with environmental factors.

chromosomal deletion inheritance
def: An inheritance pattern wherein the trait is determined by inheritance of missing sections of one or more chromosomes, encompassing either 0 or multiple genes, possibly together with environmental factors.
trying to distinguish this from monogenic inheritance
child of : chromosomal inheritance

chromosomal duplication inheritance
def: An inheritance pattern wherein the trait is determined by inheritance of duplicated sections of one or more chromosomes, encompassing either 0 or multiple genes, possibly together with environmental factors.
child of : chromosomal inheritance

chromosomal rearrangement inheritance
def: An inheritance pattern wherein the trait is determined by inheritance of translocation or inversion of sections of one or more chromosomes, possibly together with environmental factors.
child of : chromosomal inheritance

Implement classes describing phenotype/disease penetrance

Needed for ClinGen variant interpretation data modeling.

HPO has just two classes already in this area - complete, incomplete, age-dependent.

susceptable_to relationship

Initially reported on RO tracker: https://code.google.com/p/obo-relations/issues/detail?id=31
Original Reporter: Nicole Washington

Such a relation is needed in GENO as well, so we should work to get this into the RO and use in GENO

Initial post:
we need a relationship that aids in describing the susceptibility of a disease due to variation(s) of genomic regions.

for example:
http://omim.org/entry/607339

this entry describes variants in the genomic region of human chr16pter-p13 as contributing to the susceptibility of coronary heart disease.

this kind of relationship is necessary for relationships created through statistical associations.

Comment from: peternro

There are two different issues here.

Susceptibility:

contributes_to_susceptibility_for

e.g.,
if there is a GWAS hit rs1234 for disease X, then

rs1234 contributes_to_susceptibility_for 'disease X'

single (small) locus vs. genomic region
rs1234 is something like chr7:12345654G>A
that is, well defined position. This may be causative or it may be in linkage disequilibrium with an unknown causative variant.

On the other hand, a genomic region such as in the above OMIM entry is 800,000 bp long, was identified by linkage analysis ("statistical associations"), and the actual variant is simply somewhere in this region. Thus, future research is needed to find the actual variants that causative. The actual nature of linkage is different for Mendelian disease (long distances, family specific) and common disease (shorter distances, population specific).
OMIM does not really model this stuff well.
If we want to do this for Monarch, I would suggest that we do about 10 examples together in detail, these are things that confuse most MDs and biologists and you need to keep staring at them for a long time.

By the way, susceptibility and the locus-region issue are completely orthogonal to one another!

Comment from: nicole.l.washington

yep, i only intended this to be an issue concerning the susceptibility relationship, which i perceive to be between any kind of genomic feature (which could be a specific base pair, or a large arm of a chromosome).

the difference between direct causation vs one (or more) unknown variants in a region with this relationship could either be captured within the relationship, or it could be part of the evidence code.

reference and alternate base properties

We need some object and/or data properties for listing reference and alternate nucleotide (and/or amino acid) sequences. I could not find any other ontologies that have these.

so to say something like:

my_sequence_alteration has_alternate_nucleotide_sequence 'A'

but the question would be, where would the has_reference_nucleotide go? i don't think we'd want to attach it to the variant.

also, if i want to indicate the downstream impact of the nucleotide change, how would i do that?

my_sequence_alteration results_in_alternate_amino_acid_sequence 'P'

? and/or should the nucleotide/amino acids be classes rather than literals?

Collect set of GENO competency questions

A rich and diverse set of CQs related to genotype representation and G2P associations would be useful for a number of purposes:

inform model development
document utility of the GENO model and Monarch data
drive identification of new data sources required to address CQs
inform new Monarch application requirements (w.r.t. information to display, derive, or analyze)

CQs can come from Monarch team, but ideally would also be contributed by a broad and diverse community of stakeholders.

Include OBO and JSON as released files

Currently, GENO apparently only includes OWL for release (OBOF page). To reach a wider audience (tagging AGR/Alliance's @sierra-moxon), it would be nice to have OBO and JSON available as well.

Use of ROBOT is the current workaround.

Tagging @cmungall

Do not use raw github URLs in imports

geno.owl includes

        <owl:imports rdf:resource="https://raw.githubusercontent.com/monarch-initiative/GENO-ontology/develop/src/ontology/imports/oboInOwl.owl"/>

In general I recommend a standard ODK setup

Typo in label for GENO:0000147

'autosomal dominant iniheritance' (GENO:0000147) contains an extra i in inheritance. Can this be corrected? Thank you!

Represenetaiton of copy number variation

Need a way to represent variation due to non-canonical number of copies in the genome.

Not alleles- because these can exist at any location.

See the VMC CNV model being developed with GA4GH, and the ClinGen representation of copy number variation.

Useful for things like representing the subject of this variant interpretation (HER2 gene amplification)

Add allele properties

Transferred from: NIF-11599
Original Reporter: Nicole Washington

We need a section of GENO that records the properties of an "allele" on phenotype/gene expression. These "alleles" could be genetic changes or externally applied changes (like morpholinos/RNAi, transient transfection).

The kinds of things that need to be included here are:

up/down (over/under) expression For example, as resulting from Transgenic/Actin promoter (up expression) or RNAi (down expression)
hypermorph (genetic changes that result in increased phenotype).
hypomorph (genetic changes that result in decreased phenotype).
environmentally-dependent (resulting from drugs, temperature, etc.). changes in phenotype are only observed when an allele is exposed to a particular environment.
4b. temperature-specific. different phenotypes are observed when this allele is exposed to a different temperature.

An allele can have >1 of the above properties. For example, I believe that a single allele might be hypomorphic at one temp, and hypermorphic at another.
There may need to be specific subtypes for each of these, but i think these are the general layout of the classes.

Representing uniparental disomy

The notion of uniprental disomy can be framed in different ways, to describe different aspects of biology:

a type of sequence alteration: SO, Vario
a phenotype: MP
a disease: MONDO, ORDO, NCIt
a mode of inheritance: HP
a cellular quality (ploidy): PATO
an allele origin: ClinVar
a pathological process: OMIT

We need to decide which of these perspectives is relevant to capture in GENO, and define/re-use appropriately.

Consider replacing some terms with PCO classes

I am working on a new term called taxon, which appears to be the same concept as http://purl.obolibrary.org/obo/GENO_0000113 (taxonomic group) (see PopulationAndCommunityOntology/pco#88). The label could just as well be taxonomic unit. This will be the parent to another new term for operational taxonomic unit (PopulationAndCommunityOntology/pco#9. Because these terms have scope well beyond defining genotypes, I think they belong in PCO, which is more general than GENO.

PCO also has a class for population of organisms (http://purl.obolibrary.org/obo/PCO_0000001), which OBI plans to import to replace their population term (see obi-ontology/obi#632). Eventually this term will move into COB, but I suggest replacing your OBI import with the PCO term, and replacing human population (http://purl.obolibrary.org/obo/GENO_0000111) with http://purl.obolibrary.org/obo/PCO_0000027 (collection of humans).

Strain or breed (http://purl.obolibrary.org/obo/GENO_0000112) also probably belongs in PCO, since it is defined as a collection of organisms.

genetic disease polygenic vs. monogenic

We need some classification of inheritance patterns, this should relate existing classes. Not sure which ontology or where this goes, but something like this:

inheritance pattern
--polygenic
--monogenic
----dominant
----recessive
etc.
--no inheritance pattern (not sure what to call this but we need to be able to classify diseases where there is no known genetic inheritance pattern).

@nlwashington @pnrobinson @mbrush
thoughts?

term request: mosaic inheritance and/or genotype

I'm interested in modelling this ClinVar record:
https://www.ncbi.nlm.nih.gov/clinvar/RCV000087646/

I assume mosaicism is more accurately described as a mode of inheritance rather than a genotype, but having some object and/or datatype properties specific to mosaic inheritance could be useful, for example, percentage of one genotype vs another.

germline vs somatic variants

we will need to disambiguate between variants (and perhaps any geno partonomy) that are germline vs somatic.

how should these be represented? should there be classes for "germline variant" "somatic variant", or perhaps there is some kind of property about the origin of the variation "germline origin". if it's a "somatic origin", i could see perhaps linking the origin to a tissue or cell type, if known.

the usecase for this is particularly for cancers, where major changes to the genome (chromosomal rearrangements) happen in only a subset of the cells but not the germline. also, it's possible that we might want to reason differently based on the origin, and in interfaces filtering/sorting on the origin is important.

we hadn't really needed this before as with mice and fish they were generally somatic, but in humans we definitely need it.

IMPReSS assays and phenotypes for KOMP2

Duplicated from: NIF-11832
Original Reporter: Melissa Haendel

This modeling is related to the association model which will likely live in GENO for now, so duplicating from the NIF JIRA into the GENO tracker. But may ultimately live elsewhere.

We are going to need to be able to represent all aspects of KOMP2 phenotyping.
https://www.mousephenotype.org/impress/
(look at the different pipelines)

some of these have already been mapped to MP terms, but many aspects include staging at which an assay was performed, information about sex, etc.

First steps involve examining what is missing from our ontology structures, enhancing our ontologies, and pulling the data as it comes in.

karyotype

add karyotypes as a kind of (or is a part of) genotype?
there are also probably subtypes, like:

[Karyotype](http://en.wikipedia.org/wiki/Karyotype)
  Visible Karyotype  (such as by FISH)
  [Virtual Karyotype](http://en.wikipedia.org/wiki/Virtual_karyotype)
    Array-based karyotype 
          SNP array karyotype
          arrayCGH karyotype
    Sequencing-based karyotype

some issues:
*not sure you want to include methods (arrayCGH, etc) in the classes here (maybe too granular)? but these do convey some level of resolution that might otherwise be missed.
*sequence-based karyotype is kind-of a new technique. i found only this related patent here:
http://www.google.com/patents/US20050221341

for examples, we get karyotypes from Coriell. being able to represent these in our model will be hugely important for cancer use cases. for example, "get all cell lines where karyotype information is available".

somewhat related to #11

Need more heredity in GENO

Transferred from: NIF-11673
Original Reporter: Nicole Washington

We need to expand the genetic heritability subclasses to include:

X- Y- Z- linked (allosomal)
autosomal
unknown genetic heritability

these should probably be linked to the phenotypic dominance terms. these will get used when the genetic basis of inheritance is known, but the dominance is unknown.

review various genomic rearrangement notation

i have been pointed to the documentation for chromosomal rearrangements:
http://onlinelibrary.wiley.com/doi/10.1002/0471142905.hga04cs17/pdf

this has a lot of good information about karyotypes, and all the kinds of things that can happen. here's some examples of real karyotypes in the coriell data to model:

*46;XX;46;XX;der(10)(10qter>10p15::11q13>11qter) [13]/46;XX[37];46;XX{82}/46;XX;der(10)t(10;11)(p15;q13) {14}/47;XXX{4}
*46;XX;inv(8)(p11.2q22)[8]/46;XX;dup(14)(q32.2q31)[6]/46;XX;t(6;10)(p22;q26)[2]/46;XX[34]
*45;X/46;XX;46;XX
*45;X{4}/46;XY{92}/47;XY;+18{4};46;XY/47;XY;+18

promoters driving expression

we need to have some examples of how to model promoters of one gene driving the expression of another.

for example, there is a zfin construct Tg(zp3:fsta,myl7:EGFP) where the promoter of zp3 is driving the expression of fsta.

i am not sure how to capture this with GENO modeling. we have the identifiers for all of them.

my guess is that i might need to make a node that is a genomic feature for "promoter of zp3", which doesn't exist. maybe that would look something like:

:_promoter_of_zp3 a SO:0000167
    RO:regulates ZFIN:ZDB-GENE-991129-7

but, then how would we say that this promoter also regulates the expression of fsta? it clearly only does so because of the construct, and isn't a "wildtype" property. also, the gene that is being expressed is a wildtype gene, but it is just that it might be in a different time/place/abundance...so it's sequence isn't variant, rather it's expression is variant.

does the construct itself become an "expression variant", with some expression-altered locus?

:_zp3:fsta a GENO:0000485    # expression-altered locus
    GENO:has_expression-variant_part  ZFIN:ZDB-GENE-990714-11    # fsta

but i am not sure with this model how to reference anything about the promoter of zp3.

but we can also say:

:_zp3:fsta a GENO:0000485 
    GENO:is_expression_variant_of ZFIN:ZDB-GENE-990714-11

so, i'm not sure which is right, or if it's complete. help @mbrush !

Species specific classes for genome, chromosomes, and their parts

We are generating an ontology of reference chromosomes and their parts (arms, regions, bands) from the UCSC chromosome band data ( see dipper issue #42 ). It is expected that this ontology will grow to include bands from the genomes of other species. To organize these in the ontology, we may want to create species specific grouping classes at each of these levels (e.g. 'human genome', 'human chromosome', 'human chromosome arm', 'human chromosomal region', 'human chromosome band'). These will live in GENO, under their respective parent SO classes.

Add transgenic insertion types to genotype model/SO

Transferred from: NIF-11935
Original Reporter: Melissa Haendel

Not necessarily exactly the same, but see types from MGI:
spontaneous
chemically induced (all)
chemically induced (ENU)
radiation induced
transposon induced
QTL
transgenic (all)
transgenic (random, expressed)
transgenic (random, gene disruption)
transgenic (Cre/Flp)
transgenic (Transposase)
gene trapped
targeted (all)
targeted (Floxed/Frt)
targeted (Reporter)
targeted (knock-out)
targeted (knock-in)

One of the main features we need to support here is conditional expression and/or tissue specific expression and representation of the transgenic insertion itself relative to the host genotype.