Related discussions:
History
For years we've debated the question of whether sequence collections would be ordered or unordered. In #17 we determined that the digests will reflect order. However, it is still valuable to come up with a way to identify if two sequences collections are the same in content but in different order. While the comparison function can do this, it is not as simple as comparing two identical identifiers. After lots of debate both in-person and on GitHub (#5), we never came up with a satisfying way to do this. Here is a proposal that can solve this, and other problems, which has arisen in discussion with @sveinugu. This idea has a few components: 1. Undigested arrays; 2. Relaxing the 1-to-1 constraint imposed on default arrays; 3. A specific recommended undigested array named e.g. "names-lengths". In detail:
1. Undigested attributes
Undigested attributes are attributes of a sequence collection that are not considered in computing the top-level digest. We suggest formalizing this as a part of the specification. In the schema that describes the data model for the sequence collections, the attributes will be annotated as either "digested" or "undigested". Only attributes marked as 'digested' will be used to compute the identifier. Other attributes can be stored/returned or otherwise used just like the digested attributes, the only difference is that they do not contribute to the construction of the identifier. This is easy to implement; for returning values, you return the complete level 1 structure, but before computing a digest, you first filter down to the digested schema attributes.
2. 1-to-1 constraint
All of our original arrays have the constraint that they are 1-to-1 with the other arrays, in order. In other words, the 3rd element of the names array MUST reference the 3rd element of the sequences array, etc.. They must have the same number of elements and match order.
I propose this constraint be relaxed for custom arrays/attributes. In fact, this, too, could be specified in the JSON-schema, as order-length-index: true
(or something). In other words, arrays that do align 1-to-1 with the primary arrays would be flagged, but not all attributes would have to. This would be useful in case users want to store some information about a sequence collection that is not an ordered array with 1 element matching each sequence, or for situations that need to re-order.
3. The names-lengths
array/attribute
Next, we'll use both these features for a new proposal. An example of a useful undigested attributed that lacks the 1-to-1 constraint is the names-lengths
array. This is made like this:
- Lump together names and lengths from the primary arrays into an object, like
{"length":123,"name":"chr1"}
.
- Canonicalize according to the seqcol spec.
- Digest each name-length pair individually.
- Sort the digests lexographically.
- Add as an undigested, non-1-to-1 array to the sequence collection.
If this array is digested it provides an identifier for an order-invariant coordinate system for this sequence collection. It doesn't actually affect the identity of the collection, since it's computed from the collection, so there's no point considering it in the level 0 digest computation -- that's why it should be considered undigested. It also does not depend on the actual sequence content, so it could be consistent across two collections with different sequence content, and therefore different identifiers. It lives under the names-lengths
(or something) attribute on the level 1 sequence collection object, but it's not digested.
The term "undigested"
To clarify the digesting: in this case, I am actually proposing to digest this array itself in the same way that all the arrays get digested; but this array would not get included in level 1 object that digests to form the level 0 digest. So maybe "undigested" is not the right term. Ideas?
Add to spec?
There are lots of other non-digested arrays that could useful. This particular one is pretty universal and useful, so it seems like it may be worth adding formally to specification as a RECOMMENDED, but not required, undigested array.