We currently have 2 APIs that are tangentially related to our main v0.5 - <a href="htt

+1 on <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi all. I work with <a class="user-mention notranslate" data-hovercard-type="user" dat

Welcome aboard <a class="user-mention notranslate" data-hovercard-type="user" data-hov

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Standardize GAReference.name for the benefit of other APIs,about ga4gh/ga4gh-schemas

Comments (31)

haussler commented on August 14, 2024

Thanks so much Cassie! I would vote for an enumerated data type for the names of the human chromosomes with values chr1, chr2, ..., chr22, chrX, chrY, chrM. The latter is for the mitochondrion. However, I am totally open to any nomenclature, just feel something should be standardized.

from ga4gh-schemas.

diekhans commented on August 14, 2024

Hi David,

Enumerated types are very problematic for any value where the
domain is not fixed. For instance, we want this to be usable
for non-human as well. Instead they should be validated against
a controlled vocabulary and the controlled vocabulary should be
specified (probably as a URL).

One needs to think of enumerated types as if something that would
get compiled into code.

Chromosome symbolic names are especially problematic. We don't
agree on them. chr1' vs1', chrM' vsMT'. The RefSeq
accession is probably the most accurate symbol.

Being able to accurately map between different symbolic names
for the same sequence is important functionality. If should
also be robust against naive mappings. For instance chrM in
UCSC hg19 is NC_001807.4, in GRCh37-lite it is NC_012920.1. A
handful of genes, 6 bases different in length, a lot of
headaches.

Due to the mutation calling tools using whole BAMs a the unit of
work, special composite genomes assemblies are defined. For
instance, TCGA has BAMs that are GRCh37-lite+HPV. It would be
great if the GA APIs removed the need for making composite BAMs
and instead tools could work against a collection of BAMs.

Mark

haussler [email protected] writes:

Thanks so much Cassie! I would vote for an enumerated data type for the names
of the human chromosomes with values chr1, chr2, ..., chr22, chrX, chrY, chrM.
The latter is for the mitochondrion. However, I am totally open to any
nomenclature, just feel something should be standardized.

—
Reply to this email directly or view it on GitHub.*

from ga4gh-schemas.

lh3 commented on August 14, 2024

GAReference::name is a display name. It is unstable and at times ambiguous. GAReference::id is unique, but the comment explicitly said that it varies with implementations. My preference is to use GAReference::sourceAccession. It is universally unique and stable.

from ga4gh-schemas.

delagoya commented on August 14, 2024

+1 for GAReference::sourceAccession

from ga4gh-schemas.

pgrosu commented on August 14, 2024

+1 on @lh3 suggestion with GAReference::sourceAccession - less is definitely more in this case :)

from ga4gh-schemas.

calbach commented on August 14, 2024

Hi all. I work with @dglazer and @cassiedoll.

I just wanted to verify that this change doesn't imply that GAReference.sourceAccession becomes a required field.

My 2 cents is that GAReference.id would be a more correct choice here, as a client could then look up all secondary information including name and sourceAccession (which might be null) if desired, but I don't have as much context as you all as to the goals for these particular methods.

from ga4gh-schemas.

pgrosu commented on August 14, 2024

Welcome aboard @calbach and let's get you up to speed :) So in computer science I agree that an ID is the best for identifying cached data or other uniquely designated forms of data.

The thing is that in bioinformatics, the field sort of agreed on accession numbers for sequences (which can be of different forms). Below I provided the simplest link I could find which describes this:

http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter3.html

So the reason why we would want to use sourceAccession for instance is because reads would be aligned to it assuming the DNA sample one started from is a form of the reference. There is a lot more but I'm not sure of your background in terms of how deep to go, but don't hesitate to ask questions :)

If really an ID is necessary one can create an ID from sourceAccession by many means such as md5sum, but that's just an implementation decision which could be specific to the organization or optimizing data I/O based on how it was stored.

Hope it helps,
Paul

from ga4gh-schemas.

cassiedoll commented on August 14, 2024

@pgrosu - @calbach has plenty of bio experience, so he knows what accessions are :)

from ga4gh-schemas.

cassiedoll commented on August 14, 2024

I agree with @calbach that using one our GAReference.id field could be easier for other API integrations.

Let's say matchmaker was given a sourceAccession to query over - how would they programmatically look up say the length of the sequence that accession represents? Is there some other API they could query to get the data?

If they were instead a fully qualified GAReference.id (example: http://trace.ncbi.nlm.nih.gov/Traces/gg/references/{id}) - their code would be able to query that url, and get the length field as we have defined it in our API (as well as many other relevant fields).

If the INSDC does in fact have an API then that's fantastic - and could be used instead (and maybe we should add a link to it in our field comment)

from ga4gh-schemas.

cassiedoll commented on August 14, 2024

(also note that because NCBI and EBI are implementing the GA4GH API, another way to think of the references methods is just a standard programmatic wrapper around all the great existing accession work)

from ga4gh-schemas.

pgrosu commented on August 14, 2024

Thanks @cassiedoll - glad to have another fellow biologist on board :) For instance for GenBank, one way is to use R like this, but there are many ways (with BioPerl, BioPython, and many others):

library("seqinr")
choosebank("genbank")
getLength(query("AC=U49845"))
[1] 5028

Another is to use directly NCBI for checking the full sequence:

$ curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=U49845" | sed 1d | tr -d "\n" | wc -m

5028

I guess there are many options...as they say "All roads lead to Rome" :)

from ga4gh-schemas.

lh3 commented on August 14, 2024

Accession numbers are stable and unique across the world. This is a crucial feature that binds different bio databases. I am not sure about your intention on GAReference::id. Are you electing it to be another globally stable identifier? I thought ID is specific and internal to an implementation. I am also not sure why getting length by sourceAccession is more difficult than by id. If we can query length by id, couldn't we naturally query by accession as well given that both id and sourceAccession are fields in GAReference?

from ga4gh-schemas.

calbach commented on August 14, 2024

Thanks @pgrosu, this is useful. I have a background in biology and I'm familiar with accessions, but I don't think I'd call myself a biologist; I'm primarily a CS guy and there are gaps in my biology knowledge.

The crux of the issue I'm raising is whether or not GAReference.sourceAccession is a required field. I see several possible issues with making this required:

Let's say I create a new reference, which I've yet to submit to the INSDC. Is it not possible to represent this as a GAReference? Would I not be able to align reads to it, for instance, until it becomes accessioned?
Imagine a user attempts to import an aligned BAM file (for example) into my GA4GH repository. From my reading of the spec, the SAM headers do not supply accession numbers. Does my implementation reject this data? Do I require the user to explicitly supply these accession numbers? Do I do a lookup against genbank or some other database to acquire the correct sourceAccession, before accepting this data into my system?

I'm not challenging whether or not they are useful, but data is messy and incomplete so it may not be reasonable to require them in all cases.

@lh3 I agree that the accession is likely more immediately useful to a biologist than the GAReference::id. However, this would preclude you from returning a Reference which lacked an accession. Maybe that's desired behavior for these methods; as I said before I have little context on them. It also feels inconsistent for an API to return a foreign key without also including its primary key; presumably most methods will operate exclusively or best on the primary key.

In any case, the difference between the two approaches seems to boil down to 1 vs 2 API calls to acquire the sourceAccession.

Are you electing it to be another globally stable identifier?

No, this should only be guaranteed to be unique within the repository.

I am also not sure why getting length by sourceAccession is more difficult than by id. If we can query length by id, couldn't we naturally query by accession as well given that both id and sourceAccession are fields in GAReference?

Yes, these are both possible and look to be supported in the v0.5 reference APIs. I imagine there will also be methods which more naturally operate on a single ID though (GetReference) and in these cases I would expect that ID to be GAReference::id.

As food for thought, another patten which can be used in these situations (where you have a primary and foreign key corresponding to a single entity) is to wrap the IDs in a structure, for instance:
ReferenceKey { id string, accession string }

from ga4gh-schemas.

lh3 commented on August 14, 2024

The following is the submission process I am imagining. For alignments, the submitters have to specify a pointer to the exact reference genome used in alignment. This pointer is ideally the md5sum of each sequence, which is optionally available in the SAM header. Once the database sees a new md5sum, it will ask the submitter either to deposit the new sequence in NCBI, or to map the new sequence to an existing sequence derived from the same sourceAccession. There may be other options. For example, the read store may consider to keep the {md5sum, sequence} pair locally, or to use EBI's md5 service used by CRAM tools. In all cases, the reference genome must be available before the submission of alignments or variants. An alignment/vcf without the corresponding reference genome is useless most of time.

It should be also noted that accessions are generated by many biological databases, not limited to NCBI. A read store may give submitted novel sequences its own accession numbers, though we should avoid this if possible. Having reference sequences managed by one centralized database (i.e. refSeq) is a lot better.

PS: when the SAM does not have md5sum, the read store can make a guess based on the name and sequence lengths and throw a warning. For human genomes, the guess can be very accurate. If a guess cannot be made, the submitters need to specify the genome manually. Alternatively, we may require md5sum in the submitted BAM. EBI, for example, imposes more restrictions on submitted BAMs, which they call archive BAMs.

from ga4gh-schemas.

pgrosu commented on August 14, 2024

@calbach, that's totally fine and from my perspective the more diversity and points of view the better - we already have large overlaps across domains in this group which makes it fun :)

So @lh3 covered a lot of the bases here and I agree with his submission process vision.

So delving back into some genetics, the thing about some of these stretches of DNA are that they can be repeated which can be important in disease identification. Some bacterium as you may know contain multiple copies of 16S rRNA, but that's another story. Thus knowing your reference sequence becomes so much more important.

Since the reads in a BAM file provide the location with respect to a reference, it would make it essential to know where you are with respect to that sequence. I mean extracting portions from the SEQ column of the reads and then trying to determine based on a genome what the reference might have been could be a tricky path for the adventurous. The long term goal is to perform large-scale analysis in the future as more data become available and centralized. At that stage, analysts will assume that a reference sequence exists and if necessary determine precisely its origin by sourceAccession. Picture this scenario: Google has all this data and information stored on the cloud. One day you decide to do some analysis and open a Genome Browser accessing it. There you select several regions of the human genome, and a filter by a specific cancer subtype, while sorting the variants by cancer stage. The last thing you want to worry about is that all this analysis is contaminated upstream by BAM files against questionable reference sequences. Basically everything has to be verified and confirmed.

Now regarding each of the points you raised:

So based on the schema you can align to it, and the sourceAccession should be populated with a temporary accession number denoting that an accession number is pending - now it looks like null is a viable value. Later this should be updated after you get the appropriate accession number. The idea is that these datasets will be global for clinical use among other things, and thus everyone should be able to understand each other based on the same sourceAccession information.
As @lh3 mentioned your analysis will make most sense with an reference sequence. So now there are two worlds one can venture into, and my personal vision:

A. Users submit BAM files and their own reference sequence files. Verification here would not be necessary, but this makes the system not enterprise-friendly for researchers to collaborate on a unified platform.

B. Researchers select from this system a pre-populated GAReference::id or GAReferenceSet::id. If this is not available, then the researcher would need to generate the GAReferenceSet or submit one or more GAReference. Once approved then the BAM file(s) that were generated against these reference sequence(s) can be uploaded.

Now there is my vision of how I would like to see it :)

C. Researchers submit the reads from an experiment that were generated by an instrument (i.e. FastQ files). Then a new GAExperiment will have to also be populated with the appropriate experimental information. Then the researcher selects a GAReference::id or GAReferenceSet::id and assigns it to a BAM-generating pipeline. A VCF-generating pipeline will exist for variants where the set of GAReadAlignment and a GAReference can be selected for processing.

Basically the more automated the process, the fewer the chances that ambiguity in shared datasets will arise.

Thus in my view GAReference::sourceAccession should be a required field :)

from ga4gh-schemas.

calbach commented on August 14, 2024

Thanks, this gives me a much better idea of how you are envisioning the implementation and what kind of validation would be involved. I suppose I'm in the camp of requiring MD5s (or an explicit ReferenceSet::id) for data import, making a best effort to populate sourceAccession, but preferring null to a temporary accession when a sourceAccession is not available. I don't think all implementations should be in the business of assigning these IDs, as you say; I think this would dilute the value of the accession if we assigned these for every scratch GAReferenceSet a user created. I think it may be reasonable for some implementations to validate that this field is set, while others recommend it. Certainly I would intend to make a best effort to populate the field, the EBI MD5 service seems promising; however if these attempts yield no results, I would prefer not to block the user from continuing an import. So long as I have MD5s I can assert equivalence which means I may eventually acquire the proper accession or bases.

Going with a more extreme example, it seems to me it should be possible to write a small hermetic (no network calls) adapter library which reads in a BAM file or Picard SAMRecords and emits GAReadAlignment records along with a single GAReferenceSet. Setting sourceAccession to null here seems preferable to generating a meaningless temporary accession.

Anyways, with respect to the original issue, I think I still prefer GAReference::id to be returned if for no other reason than it would be inconsistent to return a foreign key without also returning the primary key. I do not feel strongly on these particular methods though.

from ga4gh-schemas.

pgrosu commented on August 14, 2024

@calbach, that sounds fine, but having maintained large-enterprise analysis systems and trained researchers on how to use them, what would prevent researchers from loading their own BAM files and always assigning the same GAReference::id or GAReferenceSet::id? Basically then many will just use the system as a cloud storage platform without anyone really making use of the BAM files, except for the researchers who loaded them.

I do not want to prevent regular users from loading data, but if it is not properly annotated then down the line it could prevent an important analysis from getting off the ground or completed, because a required experiment was required, but was assumed not to have been generated because of cryptic annotation. You have to remember that some of these samples are not easy to get a hold of, and thus some of these BAM files are precious. You cannot make significant discoveries, which require a threshold in the design of experiments (or DOE).

from ga4gh-schemas.

ekg commented on August 14, 2024

Commit ids have worked very well for git. It uses SHA-1 checksums. This
plus the size of the data and a date would be very nearly world-unique
without any central authority needing to organize things. What more is
needed and what is wrong with a simplistic model like this?
On Aug 13, 2014 9:13 PM, "Paul Grosu" [email protected] wrote:

@calbach https://github.com/calbach, that sounds fine, but having
maintained large-enterprise analysis systems and trained researchers on how
to use them, what would prevent researchers from loading their own BAM
files and always assigning the same GAReference::id or GAReferenceSet::id?
Basically then many will just use the system as a cloud storage platform
without anyone really making use of the BAM files, except for the
researchers who loaded them.

I do not want to prevent regular users from loading data, but if it is not
properly annotated then down the line it could prevent an important
analysis from getting off the ground or completed, because a required
experiment was required, but was assumed not to have been generated because
of cryptic annotation. You have to remember that some of these samples are
not easy to get a hold of, and thus some of these BAM files are precious.
You cannot make significant discoveries, which require a threshold in the
design of experiments (or DOE).

—
Reply to this email directly or view it on GitHub
#112 (comment).

from ga4gh-schemas.

pgrosu commented on August 14, 2024

@ekg, you are right that for just storing by unique ids, that would be enough, but too simple is not always better. The bigger picture I'm sort of trying to paint is that this API would be there to facilitate with the storage component, which would provide the minimal and necessary information for properly enabling the analysis of clinical NGS data down the line. If I look at the goals of this part of the project at the following GA4GH link:

http://genomicsandhealth.org/our-work/working-groups/data-working-group

Reads: To provide APIs to interoperably store, process, explore and share DNA sequence reads across multiple organizations and on multiple platforms

Based on this, we cover the store and share portions well, but to properly process and explore, it is best to have more stringency with respect to the quality of information of what one is storing.

Researchers usually are comfortable with the bare minimum of annotation, in order to get as quickly as possible to the analysis steps. Maybe later there will be larger frameworks with more checks-and-balances along the way, but it is good to revisit the end goals along the way. All I am saying, is that it is easier to add structure to the data in the beginning through the API model, than to start adding that after the system is populated with a lot of data. Getting people to add more information after the data is already loaded is always much harder, rather than during the initial loading process.

from ga4gh-schemas.

lh3 commented on August 14, 2024

@ekg SAM has unfortunately adopted md5sum, which is much worse than sha1 in terms of uniqueness. Switching to sha1 may involve efforts from multiple teams (well, that is not the end of world). Nonetheless, md5+length should be unique enough for reference sequences. I like your idea.

@calbach the bottom line is that when you submit a BAM, you must make sure the reference sequences are present either in NCBI/EBI/etc or in the read store. This is a very reasonable request that frees us from a lot of complications.

from ga4gh-schemas.

calbach commented on August 14, 2024

@ekg I assume the proposal here is in reference to the original issue here, namely the way in which the reference is described on beacon/matchmaker; I still like GAReference::id best, but I'd vote for MD5/SHA1 over accession (#116 also makes accession a less appealing choice here).

@lh3 Sounds reasonable.

from ga4gh-schemas.

lh3 commented on August 14, 2024

To get GAReference::id, we need to query either by sourceAccession or by md5sum. Why not directly use one/both of these two fields in beacon and matchmaker? I much prefer not to expose an implementation-specific internal ID.

from ga4gh-schemas.

lh3 commented on August 14, 2024

As to beacon and matchmaker, I think they should have a field referenceSetMd5 that links to a GAReferenceSet. The contig/chromosome name field in beacon and matchmaker aprears in the referenceSet. This way, we don't need to standardize names. We can use a simple name like "chr2" without ambiguity. It is all determined by the GAReferenceSet.

from ga4gh-schemas.

calbach commented on August 14, 2024

Oh, I just realized why I wasn't understanding your position. I was looking at the GAMatch as the response, and not as a request. In this case of the request, I think searching by md5 makes perfect sense. Sorry, this was my mistake.

For the matchmaker response, it would be nice if we just embedded the whole GAReference in the returned GAMatches as this will contain the name, md5, accession, and ID. I slightly prefer to refer to other objects by ID over nesting so that you can take advantage of caching and because nesting can become heavyweight in a listed result, but this is a valid alternative. For instance, Heroku recommends this in their API design guide: https://github.com/interagent/http-api-design#nest-foreign-key-relations. I see that GAMatch is used in both the request and the response, so I suppose the API could just allow searching by certain fields as populated on the GAMatch::reference, or else the request and response could diverge thereby making it clear which fields can be searched on.

The above assumes that GAReference::sequence is not populated or ideally is moved behind a separate API endpoint (separate pull request) which I'm guessing is the direction that API is moving, so that you can do substring queries, for instance.

from ga4gh-schemas.

lh3 commented on August 14, 2024

I was looking at Beacon because it is simpler. In beacon, chr only appears in query, so I thought this thread is about query. Sorry.

GAGene in the matchmaker works both as a response and as a request as you said. It has chr and assembly fields. If we disallow two GAReferenceSets to have the same GAReferenceSet::assemblyId (this assemblyId can be null, but if not null, ideally it should be globally unique), GAGene::chr+assembly is sufficient. If we cannot guarantee the uniqueness of assemblyId in practice, I would prefer to add a referenceSetMd5 field to disambiguate assembly.

In response, it does not hurt to write the internal ID. However, it is more important to give stable and globally unique accession, MD5 and/or common chromosome names for various reasons.

from ga4gh-schemas.

pgrosu commented on August 14, 2024

@lh3, my feeling is that if we not want to have the multiple GAReferenceSet containing the same assemblyId, then these will have to be quite large - which might not always be preferred. If all is determined by GAReferenceSet, then some of these should be smaller. This way I agree that GAGene::chr + GAGene::assembly should be sufficient for uniqueness, but with Matchmaker we want to get the most complete read information (i.e. reference, gene, etc.) for those rare samples. Based on the current schema the only one that matches that is GAReadAlignment, since it references GAReadGroup by readGroupId, which in turn references GAReferenceSet by referenceSetId. Thus I prefer to add to GAGene in Matchmaker an array of GAReadAlignment::id, or we can expand the schemas - including Beacon - to make them compatible with less overhead.

@calbach, regarding your earlier point, you probably would not want to return the whole GAReference, since it's too much overhead - a small pointer by ID is better for that. This is especially true for GAMatchResponse, since it's an array<GAMatch>.

from ga4gh-schemas.

cassiedoll commented on August 14, 2024

Alright, so I tried to gather all these threads and pull them into some kind of consensus:

in the backend of a match maker API we would recommend they disambiguate between references with the md5sum.

unlike the other fields, every reference should have only one md5sum, which makes it the best for "matching" from a programatic perspective.
in the response from the match maker API, as long as that md5sum is returned, an API would then be free to return whatever other fields they wanted (name, accession, assembly etc).

We could recommend that they either embed an entire GAReference object, or, return an id field that can be fetched for more info, but it would be up to that API to decide what works best for them.
in the request to the match maker API, we need to translate whatever user input is present into that md5sum for the backend. For this piece, it seems like sourceAccession, assemblyId, and fully qualified GAReference.id could all work.

To be most friendly to users, we would recommend that an API should accept any of those, as long as the md5sum was a valid input and was the field used in the backend.

What do you think?

from ga4gh-schemas.

pgrosu commented on August 14, 2024

Backend:: I agree.

Response: Definitely an id over embedded is preferred just to normalize the overhead.

Request: Here the requests for a match(es) can be varied. It can be a sample name(s), gene variant(s), or sequence(s). We need to think more carefully about this, since we want to be able to say: "Get me all other cancer variants within these chromosomal regions that are stage 1-3, and similar to this sequence." Ideally one can feed a sequence, and ask to find the matches and sort them by matching score. It can get complex if people want more features (i.e. specific pipeline processing with unique parameters).

Also could we add all the Matchmaker schemas to this github repository, so everyone has more context and properly contribute?

from ga4gh-schemas.

calbach commented on August 14, 2024

@cassiedoll sounds good. This would need to be well documented in the matchmaker API as to which fields are supported for searching on the request.

from ga4gh-schemas.

skeenan commented on August 14, 2024

where are we on this issue? can we close it?

from ga4gh-schemas.

delagoya commented on August 14, 2024

Closing in preference of #167

from ga4gh-schemas.

Standardize GAReference.name for the benefit of other APIs about ga4gh-schemas HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent