Hi! I have written a GFA implementation in Ruby (RGFA <a href="https://github.com/ggon

My suggested wording: The first line of the file must

How about this proposal? The first header <code class="notrans

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Header: same tag on multiple lines? about gfa-spec HOT 15 CLOSED

gfa-spec commented on September 24, 2024

Header: same tag on multiple lines?

from gfa-spec.

Comments (15)

sjackman commented on September 24, 2024 1

My suggested wording:

The first line of the file must be a header H record and contain a version VN:Z field.
A file may contain multiple header H records.
Any particular tagged field must not be present in more than one header H record.

I'm fine with your proposal, @ggonnella. Everyone else? Please indicate with 👍 and 👎 whether you prefer
A. #38 (comment) The tagged fields present in any two header H records must be identical.
B. #38 (comment) Any particular tagged field must not be present in more than one header H record.

from gfa-spec.

edawson commented on September 24, 2024

I can now imagine this coming up quite often, especially for lines describing file provenance like the PG line from SAM/BAM.

We didn't implement the ability to parse multiple same-tagged header lines in vg's parser because we don't make GFA files that contain them, but we'll fix that now. Thanks for the heads up!

from gfa-spec.

richarddurbin commented on September 24, 2024

Yes. I think we should allow multiple uses of the same tag at the syntactic level. For some this is natural, e.g. provenance/history.
For others which might set a global parameter, it will make less sense to have multiple versions, and the semantics are naturally
to use the last version. My implementation stores the whole array of tag-value pairs, and allows retrieval of all such pairs with a
particular tag in order of having been read/created.

from gfa-spec.

lh3 commented on September 24, 2024

I think we should 1) only allow one header line in an entire GFA file; 2) disallow duplicated tag names per line.

from gfa-spec.

richarddurbin commented on September 24, 2024

Why?

This is not what is done for SAM/BAM and VCF as far as I know. They both typically use multiple header lines I believe. Or would you say those are not header lines?

from gfa-spec.

lh3 commented on September 24, 2024

SAM/VCF only has two types of lines: headers lines or data lines. We have to squeeze a variety of information into the header. GFA is different. It has multiple types of specialized lines. There is little header information to keep other than the version number. And even this number is not always necessary.

from gfa-spec.

sjackman commented on September 24, 2024

There's two Ruby GFA libraries. Perhaps you both would like to collaborate?

from gfa-spec.

sjackman commented on September 24, 2024

I think we should 1) only allow one header line in an entire GFA file

If we allow multiple header lines, it's a nice feature that cat foo.gfa bar.gfa results in valid GFA (when there are no conflicting IDs) that is the union of the two graphs. As long as the header lines agree with each other (the same tags have the same values), I don't see any problem in accepting that as valid GFA.

The ABySS pipeline for example would have a GFA file of segments and links output by AdjList, another file of gap distance estimates output by DistanceEst, and a third file of paths output by SimpleGraph and MergePaths. Concatenating those three files would produce a single GFA file that contains all of the information known to ABySS at that point in the assembly.

The following example uses a hypothetical km:i tag that is the k-mer size of the de Bruijn graph.

Valid

H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:64
…

Invalid

H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:96
…

from gfa-spec.

sjackman commented on September 24, 2024

How about this proposal?

The first header H record must contain the VN:Z field.
A file may contain multiple header H records.
The tagged fields present in any two header H records must be identical.

from gfa-spec.

ggonnella commented on September 24, 2024

@sjackman I agree on 1, I am neutral on 2, and don't like 3 very much. I think that having a file format which allows several identical header lines is not very elegant. Concatenating files with identical H lines with the command line is still simple enough (eg grep -v H file2 | cat file 1 - ).

My proposal:

The first line of the file must be a H record and contain a VN:Z field
A file may contain multiple H records, however, tags in different H records must be all different to each other

Point 2 allows for flexibility in formatting the data, but still removes the ambiguity caused by different values of the tags in different H lines. If one wants to store multiple values associated with the same key, we have eg B or J tags for this.

from gfa-spec.

ggonnella commented on September 24, 2024

PS 👎 in the previous comment means I like the B., I hope I got it right (?)

from gfa-spec.

sjackman commented on September 24, 2024

Perhaps thumbs are too confusing when there are multiple options. Just comment here and I'll tally them up.

@edawson @lh3 @richarddurbin Do you prefer either of the two proposals, A or B?

from gfa-spec.

richarddurbin commented on September 24, 2024

I think that the idea that we will not want to have any meta-information in the header is wrong. For example, Gene wants to set the tracepoint spacing, with TS:i:n. I want to be able to write the history of commands used to generate the file, for example with a series of HS:Z: entries. We might want to say where external identifiers can be found. In my experience, all serious file formats including GFF, BAM, VCF, MPEG, etc. end up with metadata in their headers. Certainly for GFA2 we will support multiple header lines. I am happy with a requirement that the header lines come first and the first of these specifies the version, though I note that unfortunately this means that @sjackman's nice idea that you can cat two gfa files together to get a new one won't work, at least not without being followed by a non-standard sort.

from gfa-spec.

sjackman commented on September 24, 2024

Hi, Richard.

I think that the idea that we will not want to have any meta-information in the header is wrong.

I'm not saying that at all. I'm all for metadata in header lines. Both of the proposals above allow for multiple header lines with metadata in them. They exclude having multiple tags with different values. Multiple H TS:i lines with different values, for example, would be problematic.

from gfa-spec.

stale commented on September 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

Header: same tag on multiple lines? about gfa-spec HOT 15 CLOSED

Comments (15)

Valid

Invalid

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent