Comments (15)
My suggested wording:
- The first line of the file must be a header
H
record and contain a versionVN:Z
field. - A file may contain multiple header
H
records. - Any particular tagged field must not be present in more than one header
H
record.
I'm fine with your proposal, @ggonnella. Everyone else? Please indicate with 👍 and 👎 whether you prefer
A. #38 (comment) The tagged fields present in any two header H
records must be identical.
B. #38 (comment) Any particular tagged field must not be present in more than one header H
record.
from gfa-spec.
I can now imagine this coming up quite often, especially for lines describing file provenance like the PG
line from SAM/BAM.
We didn't implement the ability to parse multiple same-tagged header lines in vg's parser because we don't make GFA files that contain them, but we'll fix that now. Thanks for the heads up!
from gfa-spec.
Yes. I think we should allow multiple uses of the same tag at the syntactic level. For some this is natural, e.g. provenance/history.
For others which might set a global parameter, it will make less sense to have multiple versions, and the semantics are naturally
to use the last version. My implementation stores the whole array of tag-value pairs, and allows retrieval of all such pairs with a
particular tag in order of having been read/created.
from gfa-spec.
I think we should 1) only allow one header line in an entire GFA file; 2) disallow duplicated tag names per line.
from gfa-spec.
Why?
This is not what is done for SAM/BAM and VCF as far as I know. They both typically use multiple header lines I believe. Or would you say those are not header lines?
from gfa-spec.
SAM/VCF only has two types of lines: headers lines or data lines. We have to squeeze a variety of information into the header. GFA is different. It has multiple types of specialized lines. There is little header information to keep other than the version number. And even this number is not always necessary.
from gfa-spec.
There's two Ruby GFA libraries. Perhaps you both would like to collaborate?
- @ggonnella https://github.com/ggonnella/rgfa https://peerj.com/preprints/2381/
- @lmrodriguezr https://github.com/lmrodriguezr/gfa
from gfa-spec.
I think we should 1) only allow one header line in an entire GFA file
If we allow multiple header lines, it's a nice feature that cat foo.gfa bar.gfa
results in valid GFA (when there are no conflicting IDs) that is the union of the two graphs. As long as the header lines agree with each other (the same tags have the same values), I don't see any problem in accepting that as valid GFA.
The ABySS pipeline for example would have a GFA file of segments and links output by AdjList, another file of gap distance estimates output by DistanceEst, and a third file of paths output by SimpleGraph and MergePaths. Concatenating those three files would produce a single GFA file that contains all of the information known to ABySS at that point in the assembly.
The following example uses a hypothetical km:i
tag that is the k-mer size of the de Bruijn graph.
Valid
H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:64
…
Invalid
H VN:Z:1.0 km:i:64
…
H VN:Z:1.0 km:i:96
…
from gfa-spec.
How about this proposal?
- The first header
H
record must contain theVN:Z
field. - A file may contain multiple header
H
records. - The tagged fields present in any two header
H
records must be identical.
from gfa-spec.
@sjackman I agree on 1, I am neutral on 2, and don't like 3 very much. I think that having a file format which allows several identical header lines is not very elegant. Concatenating files with identical H lines with the command line is still simple enough (eg grep -v H file2 | cat file 1 -
).
My proposal:
- The first line of the file must be a
H
record and contain aVN:Z
field - A file may contain multiple H records, however, tags in different H records must be all different to each other
Point 2 allows for flexibility in formatting the data, but still removes the ambiguity caused by different values of the tags in different H lines. If one wants to store multiple values associated with the same key, we have eg B
or J
tags for this.
from gfa-spec.
PS 👎 in the previous comment means I like the B., I hope I got it right (?)
from gfa-spec.
Perhaps thumbs are too confusing when there are multiple options. Just comment here and I'll tally them up.
@edawson @lh3 @richarddurbin Do you prefer either of the two proposals, A or B?
from gfa-spec.
I think that the idea that we will not want to have any meta-information in the header is wrong. For example, Gene wants to set the tracepoint spacing, with TS:i:n. I want to be able to write the history of commands used to generate the file, for example with a series of HS:Z: entries. We might want to say where external identifiers can be found. In my experience, all serious file formats including GFF, BAM, VCF, MPEG, etc. end up with metadata in their headers. Certainly for GFA2 we will support multiple header lines. I am happy with a requirement that the header lines come first and the first of these specifies the version, though I note that unfortunately this means that @sjackman's nice idea that you can cat two gfa files together to get a new one won't work, at least not without being followed by a non-standard sort.
from gfa-spec.
Hi, Richard.
I think that the idea that we will not want to have any meta-information in the header is wrong.
I'm not saying that at all. I'm all for metadata in header lines. Both of the proposals above allow for multiple header lines with metadata in them. They exclude having multiple tags with different values. Multiple H TS:i
lines with different values, for example, would be problematic.
from gfa-spec.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- Allow empty string value in optional field like SAM does HOT 1
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.