Comments (30)
@rrwick wrote…
One other, unrelated request if we're considering a new GFA version: an optional depth tag for segments. The current RC and KC tags count reads and k-mers, but that's not quite the same thing as read depth or k-mer depth.
I propose adding NC
for nucleotide count, meaning the number of nucleotides that align to this segment. Dividing NC
by the number of nucleotides in the segment gives the depth of coverage.
from gfa-spec.
@sjackman's idea for Bandage sounds very cool, but I cringe at the thought of implementing it! :)
Regarding number 4, Shaun's comments and the general topic of encoding variants in GFA, I'd be curious to hear people's thoughts on the other approach: encoding all variation explicitly with segments and links. For example, this hypothetical graph has a bubble in the first segment and another segment which joins in the middle (as discussed above in item 4).
S 1 C[TG]GGATGT
S 2 AGCA
# Some line here indicating that segment 2 joins into the middle of segment 1.
Whereas this describes the same arrangement using only segments and links:
S 1 C
S 2 T
S 3 G
S 4 GG
S 5 ATGT
S 6 AGCA
L 1 + 2 +
L 1 + 3 +
L 2 + 4 +
L 3 + 4 +
L 4 + 5 +
L 6 + 5 +
The second one is definitely more verbose, but (IMHO) conceptually simpler. It does, however, break up long contigs wherever variation occurs. Richard said (in email), 'This is one of the barriers to people adopting GFA, because many assemblers want to present long primary contigs', and I sympathise with that view.
So if the GFA format was to go with the segment-and-link-only approach, then I think people need to be encouraged to use path lines to describe long contigs. Perhaps we could also allow sequences in path lines. This would be somewhat redundant, but it would allow long contigs to be presented in their entirety:
P 1 1+,2+,4+,5+ CTGGATGT
My task with Bandage is then to visualise graph paths in a useful manner. That has actually long been on my to-do list and it will (I promise!) happen someday :)
Thoughts on this? My personal preference is for the segment-and-link-only approach, though I'm willing to be swayed. What use-cases do people have that would be awkward with only segments, links and paths?
from gfa-spec.
One other, unrelated request if we're considering a new GFA version: an optional depth tag for segments. The current RC and KC tags count reads and k-mers, but that's not quite the same thing as read depth or k-mer depth. In an Illumina set, the reads may all mostly have the same length, in which case read count and read depth would correlate tightly. But for long reads where there is a lot more variation in length, that's not true. As Richard said at the start, GFA should be updated to work well with long reads, and I think a depth tag (something like DP:f
) would be a nice option to have.
from gfa-spec.
- move CIGAR strings from required fields to CG:Z: optional fields, replacing them by required lengths in the two segments in the Link lines and equivalents elsewhere. CIGAR is not optimal or necessary for long single molecule technology reads.
CIGAR strings are optional, and I agree are not the best representation for read-read overlaps. Replace the cigar string with an asterisk *
. I suggest that we change CIGAR string describing overlap
to optional CIGAR string describing overlap
.
See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#required-fields-1
Edit: I've clarified that segment sequence and link CIGAR are optional in e70a22d
from gfa-spec.
- require a length for Segment fields, before the (optional) string for the sequence
Parsing dynamic fields is not onerous, as argued by Heng Li (@lh3) in private communication. The sequence column is optional. Replace it with an asterisk *
as desired. I suggest that we change The nucleotide sequence
to optional nucleotide sequence
.
When the optional sequence field is not present, the optional tags should include the sequence length tag: LN:i
. ABySS (@sjackman) for example stores the nucleotide sequences in an indexed FASTA file (my prefernce) and stores only their lengths in the GFA file. Bandage (@rrwick) happily reads this format.
See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#required-fields
and https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#optional-fields-2
Edit: I've clarified that segment sequence and link CIGAR are optional in e70a22d
from gfa-spec.
- introduce a new optional Piece record, that says how Segments are made from Pieces with external ids. e.g. how reads align to a unitig, or how unitigs align to later longer contigs/scaffolds. This sorts traceability through an assembly process.
I would prefer to use the widely accepted SAM/BAM format to express the alignment of reads to segments. My preference is to store the SAM/BAM file separately, but if the motivation is to include all the relevant information in a single file, we can discuss how to encapsulate SAM records in a GFA file with a SAM record type within the GFA format.
from gfa-spec.
- support branches where a Segment end inserts into the middle of another Segment. This supports alternates of bubbles, so allowing long primary called contigs with variants. As in ALTs on the human reference, or the results of bubble detection and primary sequence choice to achieve long contiguous primary assembly contigs. We do this by extending the C (Contain) syntax.
When the bubble is a closed bubble as defined in Heng's Fermi paper (doi:10.1093/bioinformatics/bts280) I would like to see a self-contained subgraph also described by GFA within the primary assembly GFA file (though not quite sure how this hierarchical graph would look in its actual implementation). This issue becomes trickier I think when the bubble is not self-contained, that is, when there are links between subgraphs.
Within Bandage (@rrwick), for example, the user could select the contig and zoom in to see the unitigs and collapsed bubbles that compose that contig, also describe by GFA.
from gfa-spec.
@aphillippy @cschin @jts @lh3 @pb-jchin @pmelsted @richarddurbin @rrwick @thegenemyers
from gfa-spec.
And as a side note, I'm happy to see GFA getting more attention lately and want to see it become useful to as many people as possible. A couple times I've talked to somebody working with sequence graphs (in their own custom format) and I tell them, 'Use GFA!'
from gfa-spec.
Here are some thought about "piece" and "path". Basically, in my experience, I find that it is useful to have a way to store a set of "links". For example, I will like to be able to do some queries like "give me all links that are 'associated' with a contig". The set of "links" associated with a contig may be more the a path. For example, some spurs or bubbles removed during graph-to-contig layout may be throw away but they will be useful for diploid assembly or give assessment of remaining ambiguities related to the contig. It will be useful we have some way to store such subsets, although I think we can discuss whether we need to store the information of such sets in the same GFA file.
To track the discussion related to this issue. I will open a new issue.
from gfa-spec.
Hi all. I think the horse is out of the barn a bit early. I'm putting the finishing touches on the proposal and maybe it would be a good idea for all to see the proposal rather than Richard's preliminary list of features which was only intended to tell you what we were up to. I've been reading the discussion and have already made some adjustments to the doc I am working with based on what I've read. I'll try to turn my word doc into a suitable .md doc and put it in a repository that we can then work versions on. Sound OK?
from gfa-spec.
I agree that it would be good to see the big picture roadmap, although Richard's feature list hinted at that, before splitting it up into discrete tasks.
Also a google doc can be easier to view/share/modify than in a repo.
from gfa-spec.
There was an open discussion about this all in the Assembly discussion group here at Dagstuhl this morning. As can be expected, there is tension between conservative approaches keeping as consistent as possible with GFA version 1, and more radical approaches that change more.
Some comments are:
- the XY:: syntax is a mistake in the text format (fine in binary). Instead we should infer the type from the string for the value as mordern programming languages, such as requiring a decimal point for float, double quotes for a text string, single quotes for a char, 0x prefix for hex, some other prefix for Jason etc. Without this you risk (and get) inconsistency. [Pascal Costanza - Intel]. Personally I feel that this does not really solve inconsistencies - you can still open a double quote and not close it, etc.
- related to this, in trying to implement a C parser I myself have a problem with array type B. How do I know whether the elements are ints or floats? Can there be different symbols for these?
- how to represent phased haplotypes? It was suggested that these can be segments, in which case we would like to indicate where they align.
- this moved to a discussion to extend the Contain/Branch join record even further, to allow an arbitrary local alignment of a subsegment of s1 to a subsegment of s2. That could also subsume Link. There was to and fro argument about whether to keep Link special-cased.
- subsequently Gene proposed to add a scaffolding record, i.e. a pair of segments with orientations and some gap size estimate. This should be different from a Link, even a zero gap end-butting link (which should be a Link).
from gfa-spec.
I will get the doc up as soon as possible. While google doc seems reasonable it won't allow
version control. I think the way to go is to create a fork and then deposit the proposal as an .md file in the fork. Please give me a chance to put the doc up so y'all can really see what the proposal is.
from gfa-spec.
- introduce a new optional Piece record, that says how Segments are made from Pieces with external ids. e.g. how reads align to a unitig, or how unitigs align to later longer contigs/scaffolds. This sorts traceability through an assembly process.
What is the typical way of storing alignments in GFA? I'm thinking of both sequences embedded in the graph (pieces/parts/paths) and external sets of alignments that describe differences from the graph (akin to BAM). Could these be represented the same way?
from gfa-spec.
OK, I've created a fork and you can find a first version of the GFA2 proposal here:
https://github.com/thegenemyers/GFA-spec
The new spec is GFA2-spec.md.
Please read it before commenting further. Perhaps we could also start another issue thread in direct response to the document at my fork?
Cheers, Gene
from gfa-spec.
the XY:: syntax is a mistake in the text format (fine in binary). Instead we should infer the type from the string for the value as mordern programming languages, such as requiring a decimal point for float, double quotes for a text string, single quotes for a char, 0x prefix for hex, some other prefix for Jason etc. Without this you risk (and get) inconsistency. [Pascal Costanza - Intel]. Personally I feel that this does not really solve inconsistencies - you can still open a double quote and not close it, etc.
I proposed exactly this concept for the type system, and the proposal was voted down. I'm reasonably happy with either system.
I believe this discussion took place in private communication before this GitHub issue existed. It's a good example though of why each proposal needs its own issue so that we can easily refer back to previous proposals that have been proposed, voted, and decided. The audience of GFA is clearly larger now, so it is reasonable to revisit decided issues, but there needs to be an overwhelming argument to reopen decided issues. There are enough implementations of GFA now that breaking backward compatibility should be avoided if at all possible.
from gfa-spec.
related to this, in trying to implement a C parser I myself have a problem with array type B. How do I know whether the elements are ints or floats?
The array type B
is an array of bytes, that is uint8_t
, an unsigned 8-bit integer.
Edit: My mistake! I was going on memory and was recalling the H
type, which is an array of bytes. The B
type is indeed an array of integer or float. This type is taken directly from the SAM spec.
Can there be different symbols for these?
This issue has been discussed and decided. The solution was to use JSON for complex types.
Edit: The first character indicates whether it's an array of integers or floats. Arrays of other types can be represented using JSON. See #15 and #18
from gfa-spec.
subsequently Gene proposed to add a scaffolding record, i.e. a pair of segments with orientations and some gap size estimate. This should be different from a Link, even a zero gap end-butting link (which should be a Link).
👍 I proposed adding distance estimates to the link L
record in #9. The issue is open and undecided.
from gfa-spec.
@rrwick wrote…
So if the GFA format was to go with the segment-and-link-only approach, then I think people need to be encouraged to use path lines to describe long contigs.
👍 If useful, containment records C
could show how the atomic segments S
align to the path P
.
from gfa-spec.
OK, I've created a fork and you can find a first version of the GFA2 proposal here:
https://github.com/thegenemyers/GFA-spec
@thegenemyers Hi, Gene. Could you please open a pull request? That'll create a new issue on this repository where we can discuss your proposal. We can then comment on individual lines of the pull request.
There's lots of great text in this proposal describing the concepts and motivations. Thank you! I'd like to see whether we can maintain backward compatibility and achieve the ability to express the relationships that you describe by adding optional attributes to the existing GFA records. The question that needs to be answered is whether it is possible, or not possible, to represent these relationships in the current GFA spec, or whether it is possible but possibly inconvenient or inefficient.
@lh3 Heng has opened issue #33 to discuss the E
and G
records.
from gfa-spec.
Ok. Makes sense. But that is not what the GFA spec says. It says "integer or numeric array" and allows decimal points in its regexp. What is the correct code for an array of ints? Maybe I should look at the SAM or VCF spec.
Richard
from gfa-spec.
I agree there is not enough case to change this. I was just reporting that it was raised.
from gfa-spec.
My mistake! I was going on memory and was recalling the H
type, which is an array of bytes. The B
type is indeed an array of integer or float. This type is taken directly from the SAM spec. The first character indicates whether it's an integer type or float type.
For an integer or numeric array (type ‘B’), the first letter indicates the type of numbers in the following comma separated array. The letter can be one of ‘cCsSiIf’, corresponding to int8 t (signed 8-bit integer), uint8 t (unsigned 8-bit integer), int16 t, uint16 t, int32 t, uint32 t and float, respectively. During import/export, the element type may be changed if the new type is also compatible with the array.
I've added this text to the GFA spec in 8554074
from gfa-spec.
I've clarified that segment sequence and link CIGAR are optional fields in e70a22d
from gfa-spec.
I think that if people are going to report the long contigs as contigs then they had better be represented as contigs in the final GFA file, that is in Segment records.
Richard
from gfa-spec.
A Path P
record could have an accompanying segment S
record giving its sequence for this use case.
Incidentally, it's possible to create a FASTA index (.fai) file of a GFA file (the GFA file itself, not an accompanying FASTA file) such that samtools faidx
(and I presume other tools) can extract sequences from the GFA file! Maybe just a parlour trick, but I think that's pretty cool.
from gfa-spec.
After rereading my earlier responses, some seemed terse verging on rude. Sorry if I put anyone off! It wasn't my intent. I'm very excited to see such interest in GFA.
I'll be out of town at a friend's wedding from Friday until Tuesday. I look forward to catching up on the discussion when I get back.
from gfa-spec.
Ping Giorgio Gonnella @ggonnella of Ruby library for handling GFA files
https://github.com/ggonnella/rgfa
from gfa-spec.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- Allow empty string value in optional field like SAM does HOT 1
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.