ga4gh / seqcol-spec Goto Github PK

View Code? Open in Web Editor NEW

14.0 10.0 7.0 1.88 MB

GA4GH Refget specifications docs

Home Page: https://ga4gh.github.io/refget

HTML 90.03% CSS 9.97%

sequence collection digest fasta

seqcol-spec's Introduction

Seqcol Docs

This is the repository for the Seqcol specification. These docs are written using mkdocs and hosted on readthedocs.

Building locally

Switch to the branch you want to build and use either mkdocs build or mkdocs serve to render the docs.

Contributing

Edit the files in the docs subdirectory. Please submit a PR to the appropriate branch for the version you are editing.

seqcol-spec's People

Contributors

Stargazers

Watchers

Forkers

tcezard andrewyatz 1bab3ast anirudhtripathi sveinugu 143divya allmightlegend

seqcol-spec's Issues

Will the API offer an alias to digest conversion endpoint?

One of the use cases brought up was this. What if a user wants to get the sequence collection checksum(s) from either the name of the collections (e.g. grch38).

We determined that Sequence collections should be congruent with the approach taken by refget in terms of allowing human-readable alias-based queries.

In this issue: samtools/hts-specs/issues/329 it seems clear that refget was not intended to do this.

@andrewyatz says:

I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation.

Secondly refget is not built to support sequence retrieval using an alias. Imagine the following URL /sequence/alias/chr1 and how impossible this is to resolve without additional metadata. Refget is trying to resolve this situation by using checksums so supporting alias lookup feels like it's going against refget's ethos.

That hopefully puts clear water between aliases e.g. chr1 and alternative methods of generating the checksum identifiers. We never intended to query the server by alias.

In light of this, I'd propose the seqcol spec specifically not provide endpoints that operate on human-readable aliases.

On the other hand, 'chr1' is a much more universal identifier than something like 'hg38', so perhaps there is some value in returning a list of identifiers that include "hg38" under "aliases".

Identifier vs digest in the specs

In the specification, the words "identifier" vs "digest" are used interchangeably - I have at least not noticed any patterns. I believe we decided to only talk about digests for now and wait with talking about identifiers until we (or someone else) solved the namespacing. That might have just been my opinion though, and not something we agreed on...

What characters should we use for delimiters?

We'd like to solicit community feedback on the delimiters to be codified in the official seqcol algorithm. You can read the draft specification here.

Brief explanation of what the delimiters will be used for

The seqcol spec requires 2 different delimiters:

ATTRIBUTE_DELIMITER is used to delimit the attributes in the string we digest. For example, if | were the delimiter, the string to digest could be chr1|248956422|2648ae1bacce4ec4b6cf337dcae37816, where | delimites the name, length, and refget digest fields. it is used to delimit attributes within a single item.
ITEM_DELIMITER is used to delimit the individual items (sequences) in the collection. Since a collection contains more than 1 item, which we will concatenate, we need a delimiter. So, if / were the item delimiter, for a collection with 2 sequences, we'd have: chr1|248956422|2648ae1bacce4ec4b6cf337dcae37816/chr2|242193529|4bb4f82880a14111eb7327169ffb729b|.

Proposal 1: human-readable whitespace characters

One proposal argues that we should use human-readable characters;

Attribute delimiter: \t (tab)
Item delimiter: \n (unix newline).

The nice thing about this proposal is that you could just print this string and it would be visually appealing. Downsides include that the newline character is not cross-platform, and that some text editors may replace tabs with spaces.

Proposal 2: use ascii non-printing character (NPC) separater delimiters

Attribute delimiter: UNIT SEPARATOR Position 31 (U+001F)
Item delimiter: RECORD SEPARATOR Position 30 (U+001E)

The arguments for proposal 2 are nicely laid out on the wikipedia page for delimiter collision:

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions.

We can restrict the name field from officially allowing whitespace characters, but this won't necessarily prevent people from trying that, meaning we'll require more checks. The non-printing ASCII characters were explicitly created for this purpose and avoid a lot of the potential problems of using visible characters. We could design simple printing functions that would replace them for visualization purposes.

Other options

If you like the general idea behind proposal 1 or 2, but prefer different specific delimiters (like a space, or something else), then please chime in. Here's a list of non-printable ASCII characters.

Documentation request- seqcol without sequences

Can you provide an example of canonical seqcol object representation when sequences are not provided, as in F4.

Do you leave off the sequences entry or do you digest the empty string or other?

@rob-p

Approval for array-based data structure and multi-tiered digests for sequence collections

I would like to solicit feedback from community members on the latest iteration of the digest algorithm for sequence collection identifiers. After lots of discussion (see #8, #1), here is the latest proposal.

It's an array of arrays, and here I'm showing just 3 arrays, but this approach works for any number of arrays, and is backwards compatible with sequence collections that lack certain array definitions.

The retrieval works like this

A simple server could allow only recursion=1, but we agreed that recursion=0 is very useful and should probably be a required part of the specification, while recursion=2 should probably be disabled. Given that the =0 and =1 layers are possible, this also enables retrieval of components, which is independently valuable:

Regardless of what elements end up in a sequence collection, we're in a position to approve this as the digest algorithm. Feedback welcome!

Should we specify a search function?

Various times we've discussed what we refer to as the search function. It's been raised in discussion and also in issues, e.g.:

Brief description of the search function

Given a sequence collection, find other sequence collections that are compatible with it. "Compatible" can have a variety of meanings here... it could mean looking for subset relationships, collections with same content but in different order, collections with same sequences but with different names, same lengths and names but different sequences, etc.

Now that we've come to agreement on the comparison function, we could think about the search function. The search function seems very useful, but it also seems time consuming. In a naive database that is just storing the objects, to calculate this would basically require running the compare function across all other collections in the database.

I suppose we could do something like pre-compute the comparison function for all pairs of collections, and then a search function might be more possible. Or, perhaps there's another way this could be implemented.

At the moment I'm not sure this should be within scope for sequence collections, at least not at this point. This seems like a separate service that could be built on top of the collection and comparison endpoints by computing lots of comparisons and then structuring the results into some kind of smart data structure so that a given search query wouldn't take too long to compute. Very useful, yes, but also probably an extension to seqcol.

thoughts?

Sequence collection, ordered? or unordered?

According to the current definition, a "Sequence Collection" is "a set of reference sequences", however, I believe that in order to be useful, this should be changed to an "ordered set of reference sequences" as a change in the order of the collection has implications downstream (for example a sorted bam file will not be considered valid if one changes the order of its sequence dictionary, but more generally, without a specified order, it will be up to the implementations to decide how to serialize any data that uses a collection.

Comparison function API specification

We earlier discussed the compatibility function #7 , which was focused on my original idea of the flag system. We've moved away from that approach to a more verbose return result. We've also moved toward having a single endpoint that accepts a POST, which can provide multiple options for sequence collection expansion levels. We will track the new approach in this issue.

How will the comparison function API be specified?

We must define 3 components:

endpoint name
Structure of POST content provided by user
Structure of value returned by server.

1. Endpoint name

Proposal: /compare, and it's a POST endpoint.

2. Structure of POST content provided by user

The user posts an object with two properties, seqcolA and seqcolB:

{ "seqcolA": ...,
  "seqcolB": ... }

The values of either entry must be in one of 3 forms: level 0: a string digest; level 1: an object with named string digests; or level 2: an object with named arrays.

Level 0 digests

This example shows a comparison of 2 digests:

{ 
  "seqcolA": "a6748aa0f6a1e165f871dbed5e54ba62",
  "seqcolB": "3b379221b4d6ea26da26cec571e5911c" 
}

Level 1 digests

{ 
  "seqcolA": { 
    "lengths":"4925cdbd780a71e332d13145141863c1", 
    "names":"ce04be1226e56f48da55b6c130d45b94", 
    "sequences":"3b379221b4d6ea26da26cec571e5911c"
  }, 
  "seqcolB": {
    "lengths":"4925cdbd780a71e332d13145141863c1",
    "names":"ce04be1226e56f48da55b6c130d45b94",
    "sequences":"3b379221b4d6ea26da26cec571e5911c",
    "topologies":"1be4f869c6f754a9b0a379f5c2aa4ff9"
  }
}

Level 2 objects

{ 
  "seqcolA": {
    "lengths": [
      "1216",
      "970",
      "1788"
    ],
    "names": [
      "A",
      "B",
      "C"
    ],
    "sequences": [
      "76f9f3315fa4b831e93c36cd88196480",
      "d5171e863a3d8f832f0559235987b1e5",
      "b9b1baaa7abf206f6b70cf31654172db"
    ]
  }, 
  "seqcolB": {
    "lengths": [
      "1216",
      "970",
      "1788"
    ],
    "names": [
      "A",
      "B",
      "C"
    ],
    "sequences": [
      "76f9f3315fa4b831e93c36cd88196480",
      "d5171e863a3d8f832f0559235987b1e5",
      "b9b1baaa7abf206f6b70cf31654172db"
    ],
    "topologies": [
      "linear",
      "linear",
      "linear"
    ]
  }
}

Any of these may also be mix-and-matched, so you can compare a level 0 digest to a level 2 object.

3. Structure of value returned by server

What does the return value from the API look like? Current proposal example:

{
  "lengths": {
    "any-elements-shared": true,
    "all-a-in-b": true,
    "all-b-in-a": true,
    "order-match": true,
  },
  "names": {
    "any-elements-shared": true,
    "all-a-in-b": true,
    "all-b-in-a": true,
    "order-match": true,
  },
  "sequences": {
    "any-elements-shared": false,
    "all-a-in-b": false,
    "all-b-in-a": false,
    "order-match": false,
  }
}

Should there be endpoints to test existence in database?

As I was implementing the /collection and /comparison endpoints we discussed, I thought of a few other possible uses. I would like to know if people think these should be part of the spec.

Overview

Given a POST request the service could notify if this collection is present in the database, and at what level.

Level 1 input

If input is a level 1 representation, INPUT looks like this:

{
  "lengths": "4925cdbd780a71e332d13145141863c1",
  "names": "ce04be1226e56f48da55b6c130d45b94",
  "sequences": "3b379221b4d6ea26da26cec571e5911c"
}

And response looks like:

  "exists": {
    "0": "true",
    "1": {
      "lengths": "true",
      "names": "true",
      "sequences": "true"
    }
}

Level 2 input

To do this for a level 2 representation, input is:

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

Response:

{
  "digests": {
    "0": "a6748aa0f6a1e165f871dbed5e54ba62",
    "1": {
      "lengths": "4925cdbd780a71e332d13145141863c1",
      "names": "ce04be1226e56f48da55b6c130d45b94",
      "sequences": "3b379221b4d6ea26da26cec571e5911c"
    },
  "exists": {
    "0": "true",
    "1": {
      "lengths": "true",
      "names": "true",
      "sequences": "true"
    }
}

For this to work with level 2 inputs, we have to compute the digests on the server. That means this endpoint could actually be used as a digest computing service for local collections. Thoughts?

This is starting to get into the idea of the search function, but this is actually much simpler than the search function we envisioned.

Should we prefix the digests that we return from seqcol?

In issue #37 we raised the point of "to prefix or not to prefix", but there were really 2 issues being discussed there:

Should seqcol prefix digests that become part of the seqcol representation that is digested?
Should seqcol prefix the seqcol digests it generates? And should it require these for queries?

Issue #37 discusses the first issue, which we decided and posted an ADR for (the upshot is we don't add anything specifically, but if an external protocol like refget specfies that such and such prefix is actually part of the identifier, then clearly we just take that at face value).

This issue is meant to track the 2nd point: Should seqcol prefix the seqcol digests it generates? And should it require these for queries?

what do we want to accept in the API? with or without prefixes?
what does the server serve? the output provided to the user. Do we have to say that these strings have to be prefixed with something? When we return things, do we include these prefixes? Or do we make it user-controlled through query parameters or something?

My current thinking is that the answer should be No.

I think we should never add or expect prefixes. They are for entities that surround the spec, not for the spec itself.

Comparison function does not maintain row-wise dependencies when reporting on order

There is one problem with the current solution for the comparison function that I believe we have not properly considered. It might be that we are ok the current functionality, but think that it should be conscious decision, and we should report this as a known issue.

The issue is best explained with a simple contrived example. Given the following sequence collection A:

names	lengths	sequences
chr1	12345	96f04ea2c
chr2	23456	00330e995
chr3	34567	572853213

Let's compare this with sequence collection A', where we shuffle the rows, e.g.:

names	lengths	sequences
chr3	34567	572853213
chr1	12345	96f04ea2c
chr2	23456	00330e995

The comparison function would return the following:

{
  "digests": {
    "a": "b57173a40",
    "b": "1ab89fe61"
  },
  "arrays": {
    "a-only": [],
    "b-only": [],
    "a-and-b": [
      "lengths",
      "names",
      "sequences"
    ]
  },
  "elements": {
    "total": {
      "a": 3,
      "b": 3
    },
    "a-and-b": {
      "lengths": 3,
      "names": 3,
      "sequences": 3
    },
    "a-and-b-same-order": {
      "lengths": false,
      "names": false,
      "sequences": false
    }
  }
}

So let's say we instead shuffle the names and sequences array independently, but let the lengths array follow the sequences to keep the internal consistency, such as in the following sequence collection A'':

names	lengths	sequences
chr2	23456	00330e995
chr1	34567	572853213
chr3	12345	96f04ea2c

Then comparing any two of the three collections A, A', and A'' would give the same results from the comparison function (except the digests, of course). Such a result would typically be interpreted as "they have the same sequences, the only difference is their order". But the most definitely are not the same sequences, as the names refer to different sequences in A'' compared to the other two.

The reason behind this is simply that the comparison function considers each array individually, which is again due to the fact that we are structuring the sequence collections array-wise instead of item-wise (or column-wise instead of row-wise, if you want).

Granted, this is in practice an edge case which might never happen in the data itself. But it could very possibly appear due to some coding bug. To me, having this logical flaw reduces the trust one can have to the comparison function as a consumer.

I do have a suggestion that might solve this and other related issues. Sorry for not posting this earlier, but I have been swamped with work lately.

Should lengths and names be required properties in every sequence collection ?

I wanted to summarise one of the outcome of today's call and clarify a comment made in the PRC feedback document:

I suggest that the specification would mandate that at least one 'required' attribute is used but would not define which one it is. Over time, Refget Collections specification would create recommendations for the use of 'required' fields for different domains e.g. for genome archives this might be to have TWO different Refget Collections digests: 'names, lengths, sequences' and 'accessions, lengths, sequences'.

In ADR from 2023-07-12, we decided that the only mandatory properties in a sequence collections would be lengths and names.
The argument made today was that by requiring lengths and names to be present, we're potentially forcing these attributes in use cases where they are not relevant or in some case not available. The example given was that of a CRAM file that contains a digest for each sequence but does not contains the length.

The argument in favour of having required field is one of interoperability. Guaranteeing the presence of the two fields helps making different services compatible by always having common grounds.

Reading back the ADR from 2023-07-12, the rational does not feel about how lengths and names should be made mandatory but how sequences should not be made mandatory because it would have prevented the use case of coordinate space to be implemented. I think similar argument can be made about other use-cases we might not have envisioned.

@raskoleinonen, please correct me if I misrepresented your point

@nsheff @sveinugu @andrewyatz please chime in as any change would have to be made relatively soon.

Use case: a digest for a collection of sequences

I was reading through the draft of the SeqCol specification. We have an envisioned use case that complements federated VRS, that would benefit from the notion of a unique digest that represents a sequence collection by sequence content only–no sequence names.

As the specification allows for the digest serialization of sequence collections by name and length only (with no sequence digests), would it make sense to also enable sequence collections to allow for length and sequence digests only (with no names)? Or even just sequence digests only (with no lengths or names)?

Reserved namespace policy for future extension of SeqCol

One comment that came up during the Connect meeting that I thought was really interesting was the idea of conventions for a reserved array namespace in SeqCol. The idea being that if future properties are identified that would become part of the specification, it would provide a mechanism for adding those properties without potentially colliding with implementation-defined array names. An idea proposed was that a case-sensitive convention could be used for this purpose, which may have implications for #33.

Revise decision record: sorted_sequences

https://github.com/ga4gh/seqcol-spec/blob/4722c713a9a5692b255f8bdf9ab27c18ca297042/docs/decision_record.md?plain=1#L19-L21

The above text conveys some opinions about the relative unimportance and utility of this feature.

From my perspective, using this feature is important, as it allows us to design federated applications with a reduced overhead on implementations: if a collection of sequences from which a dataset was generated can be universally identified by a GA4GH digest, based purely on sequence content (and not names, lengths, or order of sequences), then a one-time digest is sufficient for answering the question "do these resources represent variants in the same coordinate space?". I think it is much more likely that two implementations across a network will have the same underlying sequence content, but describe sequences in different orders or with different names (and therefore may not be able to compute /comparison/:digest1/:digest2 if digest2 is an unrecognized permutation of digest1 by that instance). It therefore seems to me to that implementing this feature represents the most obvious case for describing Sequence Collections across a federated service network, instead of a case with limited / fringe utility.

I recognize that as someone looking at this spec from the outside, I may not be grasping the larger importance of the comparison function or how Sequence Collections are intended to be used. And in deference to the work and internal discussions of the SeqCol team I am not suggesting that this feature become required or an exemplar case for Sequence Collections. However, I think it would be more inclusive of potential adopters with a similar view as my own to simply describe this feature and its use in the decision record, and tone down any language that may appear dismissive of this use case.

What characters should be allowed in sequence names?

The seqcol algorithm will include the names of the sequences within the strings that we hash to form the seqcol digests. For example, chr1. You can read the draft specification here. What should be the allowable set of characters included in these sequence names?

The current proposal is to follow the specification defined for SAM headers: https://samtools.github.io/hts-specs/SAMv1.pdf#subsubsection.1.2.1

In short that is:

Reference sequence names may contain any printable ASCII characters in the range[!-~]apart frombackslashes, commas, quotation marks, and brackets—i.e., apart from ‘\ , "‘’ () [] {} <>’—and may notstart with ‘*’ or ‘=’.4

In theory, we can be more relaxed because seqcol itself doesn't have the same restrictions of a sam file. So, we could define a more permissive set of allowable characters for seqcol purposes. The advantage would be that seqcol could then be used for a wider array of possible sequence collections, even ones that wouldn't work in a sam file; the disadvantage is that perhaps we lose an opportunity to encourage people to standardize, which could prevent some headaches downstream.

Add sorted_sequences as recommended non-inherent attribute

Some feedback from the PRC was that we could think about another RECOMMENDED non-inherent attribute to live alongside sorted_name_length_pairs, that would be a digest for the sequences that does not respect order. So, something like: sorted_sequences.

This digest would allow you to easily assess order-invariant equivalence of sequences without having to use the comparison function, which would be useful for some use cases.

Terminology streamlining: how do we refer to the various levels of seqcol representation?

Earlier I think @sveinugu raised the point that we've used some inconsistent terminology with how we refer to the different representations of a seqcol, like 'layer 1' vs 'layer 2', "top level", etc. I think it would be good to nail that down. Here's a proposal:

We refer to representations in "levels". The number of level represents the number of "lookups" you'd have to do from the "top level" digest. So, we have:

Level 0 (AKA "top level")

Just a plain digest. This corresponds to 0 database lookups. Example:

a6748aa0f6a1e165f871dbed5e54ba62

Level 1

What you'd get when you look up the digest with 1 database lookup and no recursion. Previously called "layer 0" or "reclimit 0" because there's no recursion. Also sometimes called the "array digests" because each entity represents an array.

Example:

{
  "lengths": "4925cdbd780a71e332d13145141863c1",
  "names": "ce04be1226e56f48da55b6c130d45b94",
  "sequences": "3b379221b4d6ea26da26cec571e5911c"
}

api link

Level 2

What you'd get with 2 database lookups (equivalently, 1 recursive call). This is the most common representation, more commonly used than either the "level 1" or the "level 3" representations.

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

api link

Level 3

What you'd get with 3 database lookups (equivalently, 2 recursive call). The only field that can be further populated is sequences, so the level 3 representation provides the complete data. This layer:

can potentially be very large
is the only level that requires outsourcing a query to a refget server
is disabled on my demo server; it works fine for small seqcols, but is impractical for a complete reference genome and the server just hangs.

Example (sequences truncated for brevity):

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "CATAGAGCAGGTTTGAAACACTCTTTCTGTAGTATCTGCAAGCGGACGTTTCAAGCGCTTTCAGGCGT...",
    "AAGTGGATATTTGGATAGCTTTGAGGATTTCGTTGGAAACGGGATTACATATAAAATCTAGAGAGAAGC...",
    "GCTTGCAGATACTACAGAAAGAGTGTTTCAAACCTGCTCTATGAAAGGGAATGTTCAGTTCTGTGACTT..."
  ]
}

Proposal

The proposal is we'd now refer to these as:

"level 0 representation" or "primary digest";
"level 1 representation", or "level 1 digests";
"level 2 representation"; and
"level 3 representation" of a sequence collection.

API endpoint URL design

Spinoff from #21.

How should the API user interface be designed?

1. Compare endpoint

We decided the GET and POST should both use /compare somehow. Should it be:

/compare/:digestA/:digestB, or
/:digestA/compare/:digestB

The first seems to envision compare as a standalone function which takes 2 values. The second seems to envision compare as an operation that can be done on a particular seqcol, with a second one as the function parameter.

2. Sequence collection retrieval endpoints

Along these lines, how should the retrieval endpoints work? What is the name of the endpoint?

/:digest/:recursionLevel (e.g. /a6748aa0f6a1e165f871dbed5e54ba62/1 for level 1).
/seqcol/:digest/:recursionLevel
/retrieve/:digest/:recursionLevel,
or should recursionlevel go before digest, like /retrieve/:recursionLevel/:digest,

Are there any other endpoints that should be provided?

Reverse complement of a sequence and possible supporting reverse complemented coordinates

@ahwagner raised in the VRS/VCF meeting the idea of reverse complementing sequences and making them available in refget. The idea being if you are going to refer to an event on the opposing strand, such as a transcript, then asking for regions in the relative coordinate system would make sense.

From a basic refget POV this would require knowing what the sequence's reverse complement checksum is, converting that into the forward orientation alongside the requested coordinates and then reverse complementing the response.

However this does make the refget /sequence/checksum endpoint somewhat more complicated and brings with it additional semantics that a server may or may not support. This could be communicated via service-info. This problem might not be a seqcol issue but recording the reverse complement checksum could be an additional array.

Fingerprinting sequences using gaps

Following on from the previous discussions today I spent some time writing up a bit of code to find gaps in a FASTA file. It's in Python so you don't need to worry about Perl. I ran this over chromosome 22 from UCSC and Ensembl. I also verified this against the [AGP file from UCSC](wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz).

$ curl -s https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz  | gzip -dc | grep --regexp="\t[NU]\t" | grep "chr22\t"
chr22	1	10000	1	N	10000	telomere	no	na
chr22	10001	10510000	2	N	10500000	short_arm	no	na
chr22	10784644	10834643	5	N	50000	contig	no	na
chr22	10874573	10924572	7	N	50000	contig	no	na
chr22	10966725	11016724	9	N	50000	contig	no	na
chr22	11068988	11118987	12	N	50000	contig	no	na
chr22	11160922	11210921	14	N	50000	contig	no	na
chr22	11378057	11428056	16	N	50000	contig	no	na
chr22	11497338	11547337	19	N	50000	contig	no	na
chr22	11631289	11681288	23	N	50000	contig	no	na
chr22	11724630	11774629	25	N	50000	contig	no	na
chr22	11977556	12027555	27	N	50000	contig	no	na
chr22	12225589	12275588	29	N	50000	contig	no	na
chr22	12438691	12488690	31	N	50000	contig	no	na
chr22	12641731	12691730	33	N	50000	contig	no	na
chr22	12726205	12776204	35	N	50000	contig	no	na
chr22	12818138	12868137	37	N	50000	contig	no	na
chr22	12904789	12954788	39	N	50000	contig	no	na
chr22	12977326	12977425	41	N	100	contig	no	na
chr22	12986172	12994027	43	N	7856	scaffold	yes	paired-ends
chr22	13011654	13014130	45	N	2477	scaffold	yes	paired-ends
chr22	13021323	13021422	47	N	100	contig	no	na
chr22	13109445	13109544	49	N	100	contig	no	na
chr22	13163678	13163777	51	N	100	contig	no	na
chr22	13227313	13227412	53	N	100	contig	no	na
chr22	13248083	13248182	55	N	100	contig	no	na
chr22	13254853	13254952	57	N	100	contig	no	na
chr22	13258198	13258297	59	N	100	contig	no	na
chr22	13280859	13280958	61	N	100	contig	no	na
chr22	13285144	13285243	63	N	100	contig	no	na
chr22	14419455	14419554	65	N	100	contig	no	na
chr22	14419895	14419994	67	N	100	contig	no	na
chr22	14420335	14420434	69	N	100	contig	no	na
chr22	14421633	14421732	71	N	100	contig	no	na
chr22	15054319	15154318	73	N	100000	contig	no	na
chr22	16279673	16302843	99	N	23171	scaffold	yes	unspecified
chr22	16304297	16305427	101	N	1131	scaffold	yes	paired-ends
chr22	16307049	16307605	103	N	557	scaffold	yes	paired-ends
chr22	16310303	16310402	105	N	100	scaffold	yes	paired-ends
chr22	16313517	16314010	107	N	494	scaffold	yes	paired-ends
chr22	18239130	18339129	142	N	100000	contig	no	na
chr22	18433514	18483513	144	N	50000	contig	no	na
chr22	18659565	18709564	146	N	50000	contig	no	na
chr22	49973866	49975365	1119	N	1500	contig	no	na
chr22	50808469	50818468	1153	N	10000	telomere	no	na

I then ran the gap finder code on two files. One from UCSC which had to be pulled out using their twoBitToFa tool and the second from Ensembl's FTP site.

Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.

The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.

I would suggest that if we take on this idea that it's an optional extension

Minimal and extended schemas proposal

We decided to start with two schemas: a minimal schema that we would post now as what we should implement, and then an extended schema, which is in evaluation stage to see if it should end up in the minimal schema. Here are some drafts of these for comment and revision:

Minimal seqcol schema

description: "A collection of biological sequences, defined by the GA4GH Sequence Collections standard."
$id: "/schemas/seqcol_base"
version: 0.1.0
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable identifiers of each sequence (e.g. chromosome names or accessions)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    description: "Digests of sequences computed using the GA4GH digest algorithm (sha512t24u)."
    items:
      type: string
  sorted_name_length_pairs:
    type: array
    description: "Sorted digests of names+lengths pairs, computed following the seqcol specification."
    items:
      type: string
required:
  - lengths
  - names
inherent:
  - lengths
  - names
  - sequences

Extended seqcol schema

$ref: "/schemas/seqcol_base"
$id: "/schemas/seqcol_extended"
properties:
  masks:
    type: array
    collated: true
    description: "Digests of subsequence masks indicating subsequences to be excluded from an analysis, such as repeats"
    items:
      type: string
  priorities:
    type: array
    collated: true
    description: "Annotation of whether each sequence is a primary or secondary component in the collection."
    items:
      type: boolean
  topologies:
    type: array
    collated: true
    description: "Annotation of whether each sequence represents a linear or other topology."
    items:
      type: string
      enum: ["circular", "linear"]
      default: "linear"
  molecule_types:
    type: array
    collated: true
    description: "Designation of the type of molecule for each sequence, such as RNA, DNA, or protein."
    items:
      type: string
  alphabets:
    type: array
    collated: true
    description: "The set of characters actually present in each sequence"
    items:
      type: string
  alphabet_domains:
    type: array
    collated: true
    description: "The set of characters that could be included in each sequence"
    items:
      type: string

Alignment: `inherent` property

The VRS 2.0 specification uses a property (ga4ghDigest.keys) to indicate keys that are used in creating VRS computed identifiers. This has the same functionality as the SeqCol inherent attribute.

This has been implemented across 16 VRS classes. The ga4ghDigest object is also used for other digest-related keywords, including ga4ghDigest.prefix for VRS objects that are identifiable.

Here are some examples:

This is an opportunity to align these terms before either standard is finalized. Is it feasible to reuse the VRS ga4ghDigest structure for Sequence Collections in lieu of inherent?

Accepted/Recommended Sequence digest algorithms

What are the accepted digests to represent sequence. Refget is not prescriptive about the digest that can be used which means there are as many digests for one sequence as there are digest algorithm.
In order to avoid getting too many SeqCol digest the specification will have to limit the digests that can be used to specify sequences.
We could specify only one accepted sequence digest algorithm.
We could have a list of recommended sequence digest algorithms.

Would molecule designation also know about how many strands there are?

@ahwagner asked the question if the molecule type field could differentiate between single and double stranded DNA.

Steps towards submitting the product to GA4GH

These are the steps I believe are needed before we can submit to GA4GH product approval

Fill out the necessary forms from secretariat (Reggan to distribute)
Complete security and REWS review forms
Complete two implementations of the seqcol specification
Formalise what will be submitted i.e. what is the spec and what is additional items we are writing that side alongside the spec
Complete client(s) for the specification
Compliance checks/suite of tool. @yash-puligundla to advise

Proposal: the attribute endpoint

Capturing an idea from today's meeting.

Motivation

In various recent discussions, people have re-emphasized the utility of a 'sequences' digest or a 'sorted_sequences' digest. See, for example:

Among other things, these have re-raised some questions, like: should names and lengths really be required? Should we relax the requirement/recommendation of a 'core' schema and just let people use whatever schema they want?

Basically after today's rousing discussion, I see 2 paths forward:

We can de-emphasize the shared schema. This will allow use cases that want to not require names, or make collections with sequences only, to proceed more readily. We were already going down this route, and indeed, the current spec already only lists the schema as RECOMMENDED, but we discussed going even further down that route. The downside is that now everyone would be more likely to just use a custom schema, and interoperability of top-level digests is lost, which feels like a significant loss.
We can introduce a new endpoint... (see new proposal below)

Proposal

If we add another endpoint, we could possibly better accommodate the sequence-only-digest use cases, while maintaining the desirable interoperability of top-level digests we get from a shared core schema. Here it is:

3.3 Attribute

Endpoint: GET /attribute/:attribute_name/:digest (RECOMMENDED?)
Description: Retrieves values of specific attributes in a sequence collection. Here :attribute_name is the name of the attribute, such as sequences, names, or sorted_sequences. :digest is the level 1 digest computed above.
Return value: The attribute value identified by the :digest variable. The structure of the should correspond to the value of the attribute in the canonical structure.

Example /attribute/lengths/:digest return value:

["1216","970","1788"]

Example /attribute/names/:digest return value:

["A","B","C"]

How this helps

With an attribute endpoint, then use cases that have no need for names and lengths could just use level 1 digests for sequences (or sorted_sequences as their primary use case, and these would be interoperable with sequence collection servers. You'd just implement this endpoint instead of the /collection endpoint, and not bother computing the top-level digests, if you had no need for that. If you needed to look up a digest in an external reference provider, you'd just use the /attribute endpoint instead of the /collection endpoint.

We could then move 'sequences' back to required for the core schema, and use this /attribute endpoint to solve the coordinate system problems as well. This would have the advantage of keeping the top-level digests more likely to be interoperable because they'd be more likely to follow the same schema.

Alphabet as inherent property of a sequence collection

Related issues: #16, #8 (specifically #8 (comment)).

This came up during the Sequence Collections call at Connect. I think that this is an important feature that would help us resolve a generalizable issue with the interpretation of sequences that is backwards compatible with refget.

Since @andrewyatz was kind enough to raise this on the call I thought I would create an issue here to track discussion on this specific feature request.

@sveinugu @nsheff

How to store and represent and compare non collated single value attributes in a sequence collection

I would like to store metadata attribute that only have single values in a sequence collection.
How do we see them being represented in the JSON at level 1 and level 2.

Represent them similarly to other attributes

At level 2, are we storing them in a single value array anyway?

{
  ...
  "length": [10, 20],
  "single-value-attribute": ["test"]
}

or directly plain text ?

{
  ...
  "length": [10, 20],
  "single-value-attribute": "test"
}

Then at level 1 they would be digested similarly to the other attributes?

{
  ...
  "length": "8djrpzjdbsoeghbadoadq.",
  "single-value-attribute": "psuhfbsjwttzaywhdjsid"
}

Represent them as single value in every level

Alternatively they could be expose in plain text directly at level 1 and level 2 with not changes
Level 2

{
  ...
  "length": [10, 20],
  "single-value-attribute": "test"
}

Then level 1

{
  ...
  "length": "8djrpzjdbsoeghbadoadq.",
  "single-value-attribute": "test"
}

Comparison

The comparison result seems highly dependent on the representation at level2. if we chose the level 2 representation in a single value array then the comparison can be done in the same way as with the other attributes.
Other representation might require different infrastructure.

Use cases

There are many use case for single value attributes like assembly-accession or naming-authority

But the one use case I have in mind is to store sorted-sequences as a single level1 digest.
Since I won't need the detail of sequences already stored in the sequences attribute I can relatively cheaply have a order relaxed comparison on any attribute by comparing the the level1 digest and not store the underlying array.

Discussion on undigested attributes and sorted-name-length-pairs

Related discussions:

History

For years we've debated the question of whether sequence collections would be ordered or unordered. In #17 we determined that the digests will reflect order. However, it is still valuable to come up with a way to identify if two sequences collections are the same in content but in different order. While the comparison function can do this, it is not as simple as comparing two identical identifiers. After lots of debate both in-person and on GitHub (#5), we never came up with a satisfying way to do this. Here is a proposal that can solve this, and other problems, which has arisen in discussion with @sveinugu. This idea has a few components: 1. Undigested arrays; 2. Relaxing the 1-to-1 constraint imposed on default arrays; 3. A specific recommended undigested array named e.g. "names-lengths". In detail:

1. Undigested attributes

Undigested attributes are attributes of a sequence collection that are not considered in computing the top-level digest. We suggest formalizing this as a part of the specification. In the schema that describes the data model for the sequence collections, the attributes will be annotated as either "digested" or "undigested". Only attributes marked as 'digested' will be used to compute the identifier. Other attributes can be stored/returned or otherwise used just like the digested attributes, the only difference is that they do not contribute to the construction of the identifier. This is easy to implement; for returning values, you return the complete level 1 structure, but before computing a digest, you first filter down to the digested schema attributes.

2. 1-to-1 constraint

All of our original arrays have the constraint that they are 1-to-1 with the other arrays, in order. In other words, the 3rd element of the names array MUST reference the 3rd element of the sequences array, etc.. They must have the same number of elements and match order.

I propose this constraint be relaxed for custom arrays/attributes. In fact, this, too, could be specified in the JSON-schema, as order-length-index: true (or something). In other words, arrays that do align 1-to-1 with the primary arrays would be flagged, but not all attributes would have to. This would be useful in case users want to store some information about a sequence collection that is not an ordered array with 1 element matching each sequence, or for situations that need to re-order.

3. The `names-lengths` array/attribute

Next, we'll use both these features for a new proposal. An example of a useful undigested attributed that lacks the 1-to-1 constraint is the names-lengths array. This is made like this:

Lump together names and lengths from the primary arrays into an object, like {"length":123,"name":"chr1"}.
Canonicalize according to the seqcol spec.
Digest each name-length pair individually.
Sort the digests lexographically.
Add as an undigested, non-1-to-1 array to the sequence collection.

If this array is digested it provides an identifier for an order-invariant coordinate system for this sequence collection. It doesn't actually affect the identity of the collection, since it's computed from the collection, so there's no point considering it in the level 0 digest computation -- that's why it should be considered undigested. It also does not depend on the actual sequence content, so it could be consistent across two collections with different sequence content, and therefore different identifiers. It lives under the names-lengths (or something) attribute on the level 1 sequence collection object, but it's not digested.

The term "undigested"

To clarify the digesting: in this case, I am actually proposing to digest this array itself in the same way that all the arrays get digested; but this array would not get included in level 1 object that digests to form the level 0 digest. So maybe "undigested" is not the right term. Ideas?

Add to spec?

There are lots of other non-digested arrays that could useful. This particular one is pretty universal and useful, so it seems like it may be worth adding formally to specification as a RECOMMENDED, but not required, undigested array.

What are example use cases?

The docs (from what I could find) do not provide common scenarios where collections are useful. This is probably self-evident to the team. However, for people that have less clear notions about how this might be used, it would be good to see some examples of the specification as intended by the developing team.

Am I missing this, or is this just yet to be documented?

How will the seqcol compatibility flags be encoded?

One of the use cases for sequence collections is to determine compatibility between two given sequence collections. Input is 2 sequence collections, and output is an assessment of compatibility between those sequence collections.

As a refresher from the use cases document:

As a user I want to compare the two sequence collections used by two separate analyses so I can understand how comparable and compatible their resulting data are. I may want to dial up and down the level of compatibility (sequence content, names, lengths).

Some examples of compatibility levels are:

Strict identity. For example, a bowtie2 index produced from one sequence collection that differs in any aspect (sequence name, order difference, etc), will not necessarily produce the same output. Therefore, we must be able to identify that two sequence collections are identical in terms of sequence content, sequence name, and sequence order.
Order-relaxed identity. A downstream process that treats each sequence independently and re-orders its results will return identical results as long as the sequence content and names are identical, even if the order doesn’t match. Therefore, we’d like to be able to say “these two sequence collections have identical content and sequence names, but differ in order”.
Name-relaxed identity. Some analysis (for example, a salmon alignment I believe) will be identical regardless of the chromosome names, as it considers the digest of the sequence only. Thus, we’d like to be able to say “These sequence collections have identical content, even if their names and/or orders differ.”
Length-only compatible. The weakest compatibility is two sequence collections that have the same set of lengths, though the sequences themselves may differ. In this case we may or may not require name identity. For example, a set of ATAC-seq peaks that are annotated on a particular genome could be used in a separate process that had been aligned to a different genome, with different sequences -- as long as the lengths and names were shared between the two analyses.

In a python notebook I've demonstrated an implementation of this , which may give you an idea of how this works:

https://github.com/refgenie/seqcol/blob/master/advanced.ipynb

With a compare function implementation here if you're interested: https://github.com/refgenie/seqcol/blob/ff5769bf92a2da01b24d75fbff428a30709d1123/seqcol/seqcol.py#L71

The important component for discussion is: how will we encode compatibility? My proposal was to use a flag system (think SAM flags), so a bit vector indicates the result of a bunch of comparisons. Here's an example:

{1: 'CONTENT_ALL_A_IN_B',
2: 'CONTENT_ALL_B_IN_A',
4: 'LENGTHS_ALL_A_IN_B',
8: 'LENGTHS_ALL_B_IN_A',
16: 'NAMES_ALL_A_IN_B',
32: 'NAMES_ALL_B_IN_A',
64: 'TOPO_ALL_A_IN_B',
128: 'TOPO_ALL_B_IN_A',
256: 'CONTENT_ANY_SHARED',
512: 'LENGTHS_ANY_SHARED',
1024: 'NAMES_ANY_SHARED',
2048: 'CONTENT_A_ORDER',
4096: 'CONTENT_B_ORDER'}

I think with these flags, you can make any of the compatibility assessments listed above. But am I missing anything? It's open for discussion, what flags should we provide as part of the specification? How should we order them?

Consumption of seqcol into existing file formats

Speaking in the VRC/VCF meeting, the consensus was these flat file consumers would work directly with the collection header format rather than working with seqcol serialised into their native header format. The thinking was there is no point spending time encoding to decode from one format to another. Much faster to just consume the native seqcol header and use that. Will mean a breaking change in the formats.

Terminology round 2

In #25 we raised some ideas on terminology, but we want to revisit that decision. There are 2 motivating factors for revising:

With pangenomes it becomes possible to have a "level -1" representation, which feels awkward
We've come up with at least 3 ways to use level numbers to represent the different forms... none of them is immediately superior to the others, and indicates that level numbers do not carry intuitive meaning that clearly or even ambiguously refers to a particular form

We are thinking switching to a name-based system, and one proposal is this:

complete form -- refers to the form with no digests; all information is present in the object (aka exploded).
canonical form -- refers to the primary form we use for sequence collections, in which sequences are digests and other elements are present. this is canonical because it's the default return value, it's what you'd provide to the POST endpoint, etc. It's what you think of as a SecCol object. Previously called "level 2" for 2 database lookups.
compact form -- (or, maybe short form?) This is the key-value pair of seqcol attribute with its digest. It's very concise, good for transfer, etc. Previously called "level 1" for 1 database lookup.
seqcol digest form -- the top-level seqcol digest

For adding a new layer to accommodate pangenomes, complete and canonical forms don't change; we'd then have the seqcol compact form, then the seqcol digest form, and then finally the top level becomes the pangenome digest form.

Should we allow PUT sequence collection capability?

Should the /collection endpoint allow a PUT operation, as a way to add new collections to the database? I'm likely to implement this for my server, but is this outside the scope of the spec? This could be restricted to authorized users or something (require a bearer token).

The way I envision it, you would use http PUT , and the request body would be a 'level 2 representation':

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "76f9f3315fa4b831e93c36cd88196480",
    "d5171e863a3d8f832f0559235987b1e5",
    "b9b1baaa7abf206f6b70cf31654172db"
  ]
}

It would add this collection to the database and return the resulting primary digest.

How should we specify the metadata endpoint?

The primary functions of seqcol are to 1) define unique identifiers for sequencing collections; 2) provide a protocol to serve sequence collection data given the identifiers; and 3) provide a function for comparing compatibility among sequence collections.

An important ancillary function is to provide metadata associated with a particular sequence collection, like provider, version, or organism. How should we provide this data? The current proposal is:

There will be a /metadata/:seqcol_digest endpoint which returns an annotated JSON with all metadata for a given sequence collection.
There will be a /metadata-schema endpoint that provides a JSON-schema defining the allowed and required files for
Each server of sequence collections would define their schema.
The protocol provides a base schema with some fundamental metadata, which can be extended by particular servers

Does that seem reasonable? If so, what is the base information that should be included in the base schema? In other words, let's define the core JSON-schema.

Here's a proposal for a JSON-schema that could define a base set of metadata fields:

description: "Schema for sequence collection metadata"
type: array
items: 
  type: object
  properties:
    source:
      type: string
      description: The entity that produced this collection.
    organism:
      type: string
      description: Identifier from the NCBI Taxonomy ontology.
    version:
      type: string
      description: Optional, an identifier for the release version of this collection.
    aliases:
      type: array
      description: A list of human-readable identifiers used to refer to this collection.
      items:
        type: string
  required:
    - source
    - organism
    - aliases

This means the /metadata/:seqcol_digest would return an array of what we might call "metadata packages", where each package must contain "source", "organism", and "aliases", and may contain "version". The rationale behind making this provide an array of "packages" instead of just one package is that multiple providers may provide the same collection, and annotate it in different ways, and this approach keeps their metadata separate.

Perhaps it makes sense to use a simple ontology (or at least controlled vocabulary) for providers, and use those terms in the source field. If we did that, then the metadata endpoint could be qualified by a provider identifier, so you could retrieve only the metadata package specified by a particular provider. I'm not sure going to thls level of complexity is really warranted though.

RFC-8785 and refget compatibility

I've been implementing the spec as it currently stands and came across an issue with our decision to use RFC-8785 JSON canonicalization for the string-to-digest (#34).

The issue is this: JSON canonicalization makes the refget protocol not compliant with the seqcol digest algorithm. It's not strictly required that it be, but it was nice that it was. In other words, when we switch to RFC-8785, then if the sequences themselves were considered under the purview of seqcol, we should quote the strings before digesting them. So, the sequence ATGC should be first seralized into "ATGC" and then digested; but the refget protocol digests the raw string.

What this means is that the sequence collection, if relying on refget, wouldn't be strictly following RFC-8785 canonicalization for the whole thing; it would only be doing it for the parts external to refget.

I see a few ways forward, just to put everything out there:

Just acknowledge that refget is not compliant, and say RFC-8785 only applies to the layers above the sequence digests.
Tweak JSON canonicalization to only apply to objects/arrays, not primitives.
Don't do JSON canonicalization.
Switch refget to add quotes around sequences

I think we'll probably end up with either solution 1 or 2...

List endpoint and pagination

It's come up repeatedly that we will need some kind of list endpoint that will provide information about the available objects in the server. This will be required for a meta- aggregator-type service that would span services, to be aware of what each service provides.

Because there could be a large number of collections, the result must be paged. We intend to follow the GA4GH paging guidelines (which are still being developed).

Proposed API

GET /list with query parameters: ?page_token=abc123&page_size=1000
page_token would default to None, which indicates starting from the most recent collection
page_size could have a server-defined default, which recommendation of 1000 ?

Proposed return value

Here's a proposal for return value of list endpoint, using token-based paging. The terms are following a google standard for paging:

{
	"total_size": len(self.database),
	"page_size": page_size,
	"page_token": page_token,
	"next_page_token": "", 
	"items": ["xyz123", "abc456", ...]
}

items or collection_identifiers or collection_digests ?
should total_size be optional?
should prev_page_token be optional?

API thoughts

Our current API uses:

/collection
/comparison

Should this be: /list_collections, or /list, or simply /collections or /collections/list ?

What should be allowable characters in array names?

The sequence collection is a group of named arrays. These array names include built-in, defined arrays, like names, lengths, and sequences, but users may also use custom array names. Our spec-defined array names are all lowercase ascii characters, but this doesn't mean we must restrict custom array names in the same way.

Should custom array names be restricted in some way? We want to balance 3 desirable properties: 1) interoperable; 2) flexible; 3) easy-to-implement, but can't come up with a solution that excels at all 3 of these. We have come up with 3 options, each one prioritizing two of the desirable properties:

Option 1

Make it required that custom array names be ascii characters (maybe even lowercase ascii characters).
✔️ Interoperable (because all servers will use the same standard)
✔️ Easy to implement (no questions about UTF sorting or encoding, can use array names easily in API endpoint or database names; specification is simpler)
❌ LESS flexible (Users can't use non-English array names)

Option 2

Make it recommended that custom array names be ascii characters, but allow users to use UTF-8 characters as an extension to the spec. Implementing UTF-8 will not be required for an implementation.
❌ LESS interoperable (servers that don't implement UTF-8 will not be completely compatible with any that do)
✔️ Easy to implement (you don't have to implement UTF-8 unless you need it)
✔️ Flexible (users who want non-English array names can use them)

Option 3

Make UTF-8 the specification, so it's required to implement support for UTF-8 array names.
✔️ Interoperable (all servers will be required to allow UTF-8 array names)
❌ LESS easy to implement (implementations must consider sorting algorithm, array name normalization, and can't use array names directly as URL endpoints, database identifiers; specification is more complex)
✔️ Flexible (anyone can use non-English array names)

Feedback or corrections appreciated!

New schema term: accessions

Raised by Rasko

I suggest that we add 'accessions' 'standard' attribute (with a prefix denoting the authority) as this will be handy for established archives.

Define what the service info will contain

The service-info for sequence collections will need to inherit from this specification but additional fields can be added.
This issue is to discuss what are the fields that should be declared in the seqcol's service info.

For examples, we could add the seqCol schema in the service-info.

What is the hashing algorithm, and will there be one or multiple?

The seqcol spec relies on a hashing algorithm to compute digests. These are digests of refget digests, so it makes sense to align the hashing algorithm with the one used by refget. Right now, refget allows 2 options, md5 or TRUNC512. As I understand it, TRUNC512 is tweaked slightly into a "GA4GH identifier".

@andrewyatz put it like this:

TRUNC512:
Normalise seq -> sha-512 -> take first 24 bits -> encode into hex

GA4GH identifier:
Normalise seq -> sha-512 -> take first 24 bits -> base64 url encode -> prefix "ga4gh:SQ." to the encoding

The two identifiers are the same, the only difference being GA4GH was taken up by the VR group in GKS. So if we want to produce identifiers which VR can use for their statements, we need to support the GA4GH identifier and since they're both the same "thing" under the hood we can deprecate trunc512 in favour of ga4gh (plus you can convert on the fly between the two).

So, it seems clear that we will base seqcol on this GA4GH identifier, but the question is: what should we do about md5 digests? Are md5s so deeply embedded that we should continue to allow them as an option? Do we:

Make it so that you can use either GA4GH digests or md5 digests to look up sequence collections?

or,

Allow only GA4GH digests?

If we do choose to allow either digest type, then do we make separate endpoints for each digest type, or do we have just a single endpoint that can accept either type of digest? I'm not sure I see the value of separate endpoints. From the perspective of the lookup, the which algorithm was used to create the digest is irrelevant -- it simply enters the digest in as a key that maps to some value in a database. It is also possible to infer the digest type from the length.

What information is included within the string-to-digest?

We need to confirm the final contents of the string to digest. Right now, I think we had settled on these elements:

name (e.g. chr1)
length
topology
sequence digest (refget digest)

constraints:

Required: name, length, topology (sequence digest may be left blank)
topology must be one of: 'linear' or 'circular'
length must be an integer
name must be a string, subject to the constraints determined in issue #2

Related to this, I initially described this with a JSON-schema that I use in my implementation, but this is a bit more complicated because it encodes the recursion (the fact that the sequence digest element is itself a top-level refgettable element., But, for reference if anyone's interested, it's here:

https://github.com/refgenie/seqcol/blob/master/seqcol/schemas/AnnotatedSequenceList.yaml

Open for discussion.

Test suite?

I’m trying to build a (minimal) seqcol implementation in rust here. Right now, the interface and API are as minimal as can be for what we need (building a seqcol digest from a SAM header for our long read quantification tool, oarfish). However, it may be generally useful to the community to have such a library in rust as that language continues to quickly gain adoption in bioinformatics.

to this end, is there some sort of “test suite” against which one can run an implementation to check compatibility with the reference implementation?

Thanks!
Rob

Cc @mikelove

Identifier construction: To prefix or not to prefix

On 2022-09-21 we debated how to actually form the identifiers. Like, is there a <prefix>, and/or a <type_prefix>, and are these modifiers used just for returning identifiers, or are they actually digested, since our protocol involves digesting digests.

Here are some thoughts:

I think it will be useful to disambiguate the terms and , then: identifier = <prefix>:<type_prefix>.<digest>
we should probably include prefixes for the "sequence_digests" array
the "sequence_digests" array should therefore refer to identifiers, rather than digests, and then probably renamed to "sequence_identifiers"
we should probably also include prefixes/type_prefixes for the level 1 digest algorithm, but then we have to define these type prefixes for each array.
it seems like the definition of the type prefixes should happen by an authority at a level higher than our working group.
are the prefixes actually added before the digest, or just returned to the user? There is not really any identifiability value added in actually digesting them. Could they just be affixed at the return/display stage? This makes the specification more universal.
Would prefixes be required or optional for the input from a user, when requesting a lookup given an identifier/digest?

ga4gh / seqcol-spec Goto Github PK

seqcol-spec's Introduction

Seqcol Docs

Building locally

Contributing

seqcol-spec's People

Contributors

Stargazers

Watchers

Forkers

seqcol-spec's Issues

Brief explanation of what the delimiters will be used for

Proposal 1: human-readable whitespace characters

Proposal 2: use ascii non-printing character (NPC) separater delimiters

Other options

Brief description of the search function

How will the comparison function API be specified?

1. Endpoint name

2. Structure of POST content provided by user

Level 0 digests

Level 1 digests

Level 2 objects

3. Structure of value returned by server

Overview

Level 1 input

Level 2 input

Level 0 (AKA "top level")

Level 1

Level 2

Level 3

Proposal

1. Compare endpoint

2. Sequence collection retrieval endpoints

Minimal seqcol schema

Extended seqcol schema

Motivation

Proposal

3.3 Attribute

How this helps

Represent them similarly to other attributes

Represent them as single value in every level

Comparison

Use cases

History

1. Undigested attributes

2. 1-to-1 constraint

3. The names-lengths array/attribute

The term "undigested"

Add to spec?

Proposed API

Proposed return value

API thoughts

Option 1

Option 2

Option 3

contents

constraints:

Recommend Projects

Recommend Topics

Recommend Org

3. The `names-lengths` array/attribute