Giter Club home page Giter Club logo

Comments (15)

sjackman avatar sjackman commented on September 24, 2024 1

My suggested wording:

  1. The first line of the file must be a header H record and contain a version VN:Z field.
  2. A file may contain multiple header H records.
  3. Any particular tagged field must not be present in more than one header H record.

I'm fine with your proposal, @ggonnella. Everyone else? Please indicate with 👍 and 👎 whether you prefer
A. #38 (comment) The tagged fields present in any two header H records must be identical.
B. #38 (comment) Any particular tagged field must not be present in more than one header H record.

from gfa-spec.

edawson avatar edawson commented on September 24, 2024

I can now imagine this coming up quite often, especially for lines describing file provenance like the PG line from SAM/BAM.

We didn't implement the ability to parse multiple same-tagged header lines in vg's parser because we don't make GFA files that contain them, but we'll fix that now. Thanks for the heads up!

from gfa-spec.

richarddurbin avatar richarddurbin commented on September 24, 2024

Yes. I think we should allow multiple uses of the same tag at the syntactic level. For some this is natural, e.g. provenance/history.
For others which might set a global parameter, it will make less sense to have multiple versions, and the semantics are naturally
to use the last version. My implementation stores the whole array of tag-value pairs, and allows retrieval of all such pairs with a
particular tag in order of having been read/created.

from gfa-spec.

lh3 avatar lh3 commented on September 24, 2024

I think we should 1) only allow one header line in an entire GFA file; 2) disallow duplicated tag names per line.

from gfa-spec.

richarddurbin avatar richarddurbin commented on September 24, 2024

Why?

This is not what is done for SAM/BAM and VCF as far as I know. They both typically use multiple header lines I believe. Or would you say those are not header lines?

from gfa-spec.

lh3 avatar lh3 commented on September 24, 2024

SAM/VCF only has two types of lines: headers lines or data lines. We have to squeeze a variety of information into the header. GFA is different. It has multiple types of specialized lines. There is little header information to keep other than the version number. And even this number is not always necessary.

from gfa-spec.

sjackman avatar sjackman commented on September 24, 2024

There's two Ruby GFA libraries. Perhaps you both would like to collaborate?

from gfa-spec.

sjackman avatar sjackman commented on September 24, 2024

I think we should 1) only allow one header line in an entire GFA file

If we allow multiple header lines, it's a nice feature that cat foo.gfa bar.gfa results in valid GFA (when there are no conflicting IDs) that is the union of the two graphs. As long as the header lines agree with each other (the same tags have the same values), I don't see any problem in accepting that as valid GFA.

The ABySS pipeline for example would have a GFA file of segments and links output by AdjList, another file of gap distance estimates output by DistanceEst, and a third file of paths output by SimpleGraph and MergePaths. Concatenating those three files would produce a single GFA file that contains all of the information known to ABySS at that point in the assembly.

The following example uses a hypothetical km:i tag that is the k-mer size of the de Bruijn graph.

Valid

H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:64
…

Invalid

H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:96
…

from gfa-spec.

sjackman avatar sjackman commented on September 24, 2024

How about this proposal?

  1. The first header H record must contain the VN:Z field.
  2. A file may contain multiple header H records.
  3. The tagged fields present in any two header H records must be identical.

from gfa-spec.

ggonnella avatar ggonnella commented on September 24, 2024

@sjackman I agree on 1, I am neutral on 2, and don't like 3 very much. I think that having a file format which allows several identical header lines is not very elegant. Concatenating files with identical H lines with the command line is still simple enough (eg grep -v H file2 | cat file 1 - ).

My proposal:

  1. The first line of the file must be a H record and contain a VN:Z field
  2. A file may contain multiple H records, however, tags in different H records must be all different to each other

Point 2 allows for flexibility in formatting the data, but still removes the ambiguity caused by different values of the tags in different H lines. If one wants to store multiple values associated with the same key, we have eg B or J tags for this.

from gfa-spec.

ggonnella avatar ggonnella commented on September 24, 2024

PS 👎 in the previous comment means I like the B., I hope I got it right (?)

from gfa-spec.

sjackman avatar sjackman commented on September 24, 2024

Perhaps thumbs are too confusing when there are multiple options. Just comment here and I'll tally them up.

@edawson @lh3 @richarddurbin Do you prefer either of the two proposals, A or B?

from gfa-spec.

richarddurbin avatar richarddurbin commented on September 24, 2024

I think that the idea that we will not want to have any meta-information in the header is wrong. For example, Gene wants to set the tracepoint spacing, with TS:i:n. I want to be able to write the history of commands used to generate the file, for example with a series of HS:Z: entries. We might want to say where external identifiers can be found. In my experience, all serious file formats including GFF, BAM, VCF, MPEG, etc. end up with metadata in their headers. Certainly for GFA2 we will support multiple header lines. I am happy with a requirement that the header lines come first and the first of these specifies the version, though I note that unfortunately this means that @sjackman's nice idea that you can cat two gfa files together to get a new one won't work, at least not without being followed by a non-standard sort.

from gfa-spec.

sjackman avatar sjackman commented on September 24, 2024

Hi, Richard.

I think that the idea that we will not want to have any meta-information in the header is wrong.

I'm not saying that at all. I'm all for metadata in header lines. Both of the proposals above allow for multiple header lines with metadata in them. They exclude having multiple tags with different values. Multiple H TS:i lines with different values, for example, would be problematic.

from gfa-spec.

stale avatar stale commented on September 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.