I've been working on a C++ parser for GF

I propose that The definition of segments should precede the d

Strict ordering in GFA (?) about gfa-spec HOT 25 CLOSED

gfa-spec commented on September 24, 2024

Strict ordering in GFA (?)

from gfa-spec.

Comments (25)

pmelsted commented on September 24, 2024

There is no fixed ordering so a parser will have to accept any possible order.

The only exception would perhaps be the header line, it would be sort of rude to not have it be the first line, but that's not in the spec yet.

from gfa-spec.

edawson commented on September 24, 2024

I guess I was more concerned about output, so that's fine then. It would indeed be a bit coarse to put a header line anywhere but the header.

Thanks!

from gfa-spec.

sjackman commented on September 24, 2024

ABySS requires that the segments incident to a link be given before the link that refers to those segments. I'd like to see that be a requirement of the spec. My preference is that the records are grouped together by type in the order H, S, L then P. I haven't thought about the other record types. It would have been nice if the record types were in alphabetical order. Oh well.

from gfa-spec.

pmelsted commented on September 24, 2024

I think it makes sense to not allow links to refer to paths that haven't been defined yet, similarly for paths. A stronger version of this is to specify everything in a strict order, H, S, L P.

It makes the parsing less painful and shouldn't be a burden on producing the output.

from gfa-spec.

ekg commented on September 24, 2024

But to enforce it in the schema seems overly heavy handed. In any case we
can fix the ordering via a sort. I would suggest we leave it unspecified
what the order might be.
On Feb 19, 2016 9:05 PM, "Pall Melsted" [email protected] wrote:

I think it makes sense to not allow links to refer to paths that haven't
been defined yet, similarly for paths. A stronger version of this is to
specify everything in a strict order, H, S, L P.

It makes the parsing less painful and shouldn't be a burden on producing
the output.

—
Reply to this email directly or view it on GitHub
#25 (comment).

from gfa-spec.

sjackman commented on September 24, 2024

I propose that

The definition of segments should precede the definition of the links that are incident to those segments.
The definition of segments and links should precede the definition of paths that reference those segments and links.

Anyone disagree?

from gfa-spec.

ekg commented on September 24, 2024

I dissent. I think we should not impose an ordering I'm the spec. It will
only cause pain and has no effect on the semantic content of what is
transmitted.

from gfa-spec.

sjackman commented on September 24, 2024

Okay. How about adding it to the spec as a should rather than a must, as a recommendation rather than a requirement? (as per https://www.ietf.org/rfc/rfc2119.txt)

from gfa-spec.

sjackman commented on September 24, 2024

Even when it doesn't affect the semantic content, I think making the format as easy to parse as possible for implementers is an important consideration.

from gfa-spec.

edawson commented on September 24, 2024

I think it's certainly more convenient to parse things if we ensure that one segment of a link is given before the link (e.g. a link's source is always given before that link). But I'm not sure I like requiring both ends to be defined beforehand. This does get complicated with inversions. This advantage goes away if we parse things to a map beforehand anyway.

I'm strongly opposed to the HSLP ordering. While it contains all the elements and is easy to parse by machines, to me it's harder for a human being to parse. It destroys the "graphiness" of GFA by decomposing the graph into loose sets of elements. It's nice to find a segment in the file and immediately see if it's highly connected based on how many link lines immediately follow its definition and whether it is on any paths.

from gfa-spec.

edawson commented on September 24, 2024

Also sorting on the source nodes of elements (S, C, L, P lines) is nice - it's easy enough to enforce alphanumeric sorting and it might simplify random access.

from gfa-spec.

sjackman commented on September 24, 2024

SAM allows specifying in the header whether the file is ordered or not. That makes sense here too.

from gfa-spec.

edawson commented on September 24, 2024

Good point - if there's an equivalent to samtools sort for gfa (and a well-defined sorting order) for GFA it seems reasonable to allow both.

from gfa-spec.

sjackman commented on September 24, 2024

There is no gfatools sort yet, but I think we'll need one for certain for random access.

from gfa-spec.

edawson commented on September 24, 2024

I'm on it - I'll push something here today. Unfortunately my javascript is too rusty to push it to gfatools proper.

from gfa-spec.

sjackman commented on September 24, 2024

I regularly sort SAM files with the UNIX sort utility. It would be nice if GFA could be sorted with the UNIX sort utility. It's unfortunate that there's not a clear way to specify the sort order of the record types to UNIX sort.

from gfa-spec.

sjackman commented on September 24, 2024

You could do a hacky thing where you change H to 0, S to 1, L to 2, etc, then sort, then convert back, but that hackery kind of defeats the benefit of using an off the shelf tool.

from gfa-spec.

ekg commented on September 24, 2024

I have typically sorted on the second column with sort -k2. This yields an
output in which local regions of the graph tend to occur together.
Although, I tend to use numerical IDs, which probably changes things.

from gfa-spec.

sjackman commented on September 24, 2024

I also tend to use numerical IDs, but for simplicity I think the specified ordering should be ASCII ordered. You can always pad with 0s at the left to get consistent numerical ordering.

from gfa-spec.

edawson commented on September 24, 2024

There's now a gfa_sort utility in gfakluge.

Alphanumeric sort on seg IDs, non-blocked:
./gfa_sort -i reads.gfa or ./gfa_sort --gfa-file reads.gfa

Blocked (HSLC) format and alphanumeric sort in each block:
./gfa_sort -i reads.gfa -b or ./gfa_sort --gfa-file reads.gfa --block-order

Block-ordered input/output is in the core lib now too.

from gfa-spec.

lh3 commented on September 24, 2024

I also prefer NOT to impose ordering, except the header line. Having S lines above L lines does make programming easier, but not having this requirement is not too bad.

from gfa-spec.

ggonnella commented on September 24, 2024

A general strict ordering may not be necessary, but I think we still should mandate the header to be on the top. I opened for this reason I different issue on that.

from gfa-spec.

edawson commented on September 24, 2024

It should be possible to sort GFA without knowing the spec version; then the header lines should float to the top, and it should be easy to determine the VN from there. That being said, I agree with the idea posed in #39 that it is nice to have the H VN as the first line in the file.

from gfa-spec.

ggonnella commented on September 24, 2024

Yes, of course it is still possible to basically look for the VN in the file before really parsing it, but I think this should not be required. For the current files, I can parse the file in one pass (by introducing some "virtual" lines, such as S which I expect but did not came up yet), but if I have to look for the VN first, this will not be possible anymore.

from gfa-spec.

stale commented on September 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

Strict ordering in GFA (?) about gfa-spec HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent