Hello all, I am at Schloss Dagstuhl with Gene Myers, Jason Chin, Ada

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

GFA version 2 with branches, external ids, and non-mandatory CIGAR,about gfa-spec/gfa-spec

Comments (30)

sjackman commented on September 24, 2024 2

One other, unrelated request if we're considering a new GFA version: an optional depth tag for segments. The current RC and KC tags count reads and k-mers, but that's not quite the same thing as read depth or k-mer depth.

I propose adding NC for nucleotide count, meaning the number of nucleotides that align to this segment. Dividing NC by the number of nucleotides in the segment gives the depth of coverage.

from gfa-spec.

rrwick commented on September 24, 2024 1

@sjackman's idea for Bandage sounds very cool, but I cringe at the thought of implementing it! :)

Regarding number 4, Shaun's comments and the general topic of encoding variants in GFA, I'd be curious to hear people's thoughts on the other approach: encoding all variation explicitly with segments and links. For example, this hypothetical graph has a bubble in the first segment and another segment which joins in the middle (as discussed above in item 4).

S   1   C[TG]GGATGT
S   2   AGCA
# Some line here indicating that segment 2 joins into the middle of segment 1.

Whereas this describes the same arrangement using only segments and links:

S   1   C
S   2   T
S   3   G
S   4   GG
S   5   ATGT
S   6   AGCA
L   1   +   2   +
L   1   +   3   +
L   2   +   4   +
L   3   +   4   +
L   4   +   5   +
L   6   +   5   +

The second one is definitely more verbose, but (IMHO) conceptually simpler. It does, however, break up long contigs wherever variation occurs. Richard said (in email), 'This is one of the barriers to people adopting GFA, because many assemblers want to present long primary contigs', and I sympathise with that view.

So if the GFA format was to go with the segment-and-link-only approach, then I think people need to be encouraged to use path lines to describe long contigs. Perhaps we could also allow sequences in path lines. This would be somewhat redundant, but it would allow long contigs to be presented in their entirety:

P   1   1+,2+,4+,5+ CTGGATGT

My task with Bandage is then to visualise graph paths in a useful manner. That has actually long been on my to-do list and it will (I promise!) happen someday :)

Thoughts on this? My personal preference is for the segment-and-link-only approach, though I'm willing to be swayed. What use-cases do people have that would be awkward with only segments, links and paths?

from gfa-spec.

rrwick commented on September 24, 2024 1

One other, unrelated request if we're considering a new GFA version: an optional depth tag for segments. The current RC and KC tags count reads and k-mers, but that's not quite the same thing as read depth or k-mer depth. In an Illumina set, the reads may all mostly have the same length, in which case read count and read depth would correlate tightly. But for long reads where there is a lot more variation in length, that's not true. As Richard said at the start, GFA should be updated to work well with long reads, and I think a depth tag (something like DP:f) would be a nice option to have.

from gfa-spec.

sjackman commented on September 24, 2024

move CIGAR strings from required fields to CG:Z: optional fields, replacing them by required lengths in the two segments in the Link lines and equivalents elsewhere. CIGAR is not optimal or necessary for long single molecule technology reads.

CIGAR strings are optional, and I agree are not the best representation for read-read overlaps. Replace the cigar string with an asterisk *. I suggest that we change CIGAR string describing overlap to optional CIGAR string describing overlap.
See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#required-fields-1

Edit: I've clarified that segment sequence and link CIGAR are optional in e70a22d

from gfa-spec.

sjackman commented on September 24, 2024

require a length for Segment fields, before the (optional) string for the sequence

Parsing dynamic fields is not onerous, as argued by Heng Li (@lh3) in private communication. The sequence column is optional. Replace it with an asterisk * as desired. I suggest that we change The nucleotide sequence to optional nucleotide sequence.

When the optional sequence field is not present, the optional tags should include the sequence length tag: LN:i. ABySS (@sjackman) for example stores the nucleotide sequences in an indexed FASTA file (my prefernce) and stores only their lengths in the GFA file. Bandage (@rrwick) happily reads this format.

See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#required-fields
and https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#optional-fields-2

Edit: I've clarified that segment sequence and link CIGAR are optional in e70a22d

from gfa-spec.

sjackman commented on September 24, 2024

introduce a new optional Piece record, that says how Segments are made from Pieces with external ids. e.g. how reads align to a unitig, or how unitigs align to later longer contigs/scaffolds. This sorts traceability through an assembly process.

I would prefer to use the widely accepted SAM/BAM format to express the alignment of reads to segments. My preference is to store the SAM/BAM file separately, but if the motivation is to include all the relevant information in a single file, we can discuss how to encapsulate SAM records in a GFA file with a SAM record type within the GFA format.

from gfa-spec.

sjackman commented on September 24, 2024

support branches where a Segment end inserts into the middle of another Segment. This supports alternates of bubbles, so allowing long primary called contigs with variants. As in ALTs on the human reference, or the results of bubble detection and primary sequence choice to achieve long contiguous primary assembly contigs. We do this by extending the C (Contain) syntax.

When the bubble is a closed bubble as defined in Heng's Fermi paper (doi:10.1093/bioinformatics/bts280) I would like to see a self-contained subgraph also described by GFA within the primary assembly GFA file (though not quite sure how this hierarchical graph would look in its actual implementation). This issue becomes trickier I think when the bubble is not self-contained, that is, when there are links between subgraphs.

Within Bandage (@rrwick), for example, the user could select the contig and zoom in to see the unitigs and collapsed bubbles that compose that contig, also describe by GFA.

from gfa-spec.

sjackman commented on September 24, 2024

@aphillippy @cschin @jts @lh3 @pb-jchin @pmelsted @richarddurbin @rrwick @thegenemyers

from gfa-spec.

rrwick commented on September 24, 2024

And as a side note, I'm happy to see GFA getting more attention lately and want to see it become useful to as many people as possible. A couple times I've talked to somebody working with sequence graphs (in their own custom format) and I tell them, 'Use GFA!'

from gfa-spec.

pb-jchin commented on September 24, 2024

Here are some thought about "piece" and "path". Basically, in my experience, I find that it is useful to have a way to store a set of "links". For example, I will like to be able to do some queries like "give me all links that are 'associated' with a contig". The set of "links" associated with a contig may be more the a path. For example, some spurs or bubbles removed during graph-to-contig layout may be throw away but they will be useful for diploid assembly or give assessment of remaining ambiguities related to the contig. It will be useful we have some way to store such subsets, although I think we can discuss whether we need to store the information of such sets in the same GFA file.

To track the discussion related to this issue. I will open a new issue.

from gfa-spec.

thegenemyers commented on September 24, 2024

Hi all. I think the horse is out of the barn a bit early. I'm putting the finishing touches on the proposal and maybe it would be a good idea for all to see the proposal rather than Richard's preliminary list of features which was only intended to tell you what we were up to. I've been reading the discussion and have already made some adjustments to the doc I am working with based on what I've read. I'll try to turn my word doc into a suitable .md doc and put it in a repository that we can then work versions on. Sound OK?

from gfa-spec.

pmelsted commented on September 24, 2024

I agree that it would be good to see the big picture roadmap, although Richard's feature list hinted at that, before splitting it up into discrete tasks.

Also a google doc can be easier to view/share/modify than in a repo.

from gfa-spec.

richarddurbin commented on September 24, 2024

There was an open discussion about this all in the Assembly discussion group here at Dagstuhl this morning. As can be expected, there is tension between conservative approaches keeping as consistent as possible with GFA version 1, and more radical approaches that change more.

Some comments are:
- the XY:: syntax is a mistake in the text format (fine in binary). Instead we should infer the type from the string for the value as mordern programming languages, such as requiring a decimal point for float, double quotes for a text string, single quotes for a char, 0x prefix for hex, some other prefix for Jason etc. Without this you risk (and get) inconsistency. [Pascal Costanza - Intel]. Personally I feel that this does not really solve inconsistencies - you can still open a double quote and not close it, etc.

     - related to this, in trying to implement a C parser I myself have a problem with array type B.  How do I know whether the elements are ints or floats?  Can there be different symbols for these?

how to represent phased haplotypes? It was suggested that these can be segments, in which case we would like to indicate where they align.
- this moved to a discussion to extend the Contain/Branch join record even further, to allow an arbitrary local alignment of a subsegment of s1 to a subsegment of s2. That could also subsume Link. There was to and fro argument about whether to keep Link special-cased.
- subsequently Gene proposed to add a scaffolding record, i.e. a pair of segments with orientations and some gap size estimate. This should be different from a Link, even a zero gap end-butting link (which should be a Link).

from gfa-spec.

thegenemyers commented on September 24, 2024

I will get the doc up as soon as possible. While google doc seems reasonable it won't allow
version control. I think the way to go is to create a fork and then deposit the proposal as an .md file in the fork. Please give me a chance to put the doc up so y'all can really see what the proposal is.

from gfa-spec.

ekg commented on September 24, 2024

introduce a new optional Piece record, that says how Segments are made from Pieces with external ids. e.g. how reads align to a unitig, or how unitigs align to later longer contigs/scaffolds. This sorts traceability through an assembly process.

What is the typical way of storing alignments in GFA? I'm thinking of both sequences embedded in the graph (pieces/parts/paths) and external sets of alignments that describe differences from the graph (akin to BAM). Could these be represented the same way?

from gfa-spec.

thegenemyers commented on September 24, 2024

OK, I've created a fork and you can find a first version of the GFA2 proposal here:

https://github.com/thegenemyers/GFA-spec

The new spec is GFA2-spec.md.

Please read it before commenting further. Perhaps we could also start another issue thread in direct response to the document at my fork?

Cheers, Gene

from gfa-spec.

sjackman commented on September 24, 2024

the XY:: syntax is a mistake in the text format (fine in binary). Instead we should infer the type from the string for the value as mordern programming languages, such as requiring a decimal point for float, double quotes for a text string, single quotes for a char, 0x prefix for hex, some other prefix for Jason etc. Without this you risk (and get) inconsistency. [Pascal Costanza - Intel]. Personally I feel that this does not really solve inconsistencies - you can still open a double quote and not close it, etc.

I proposed exactly this concept for the type system, and the proposal was voted down. I'm reasonably happy with either system.

I believe this discussion took place in private communication before this GitHub issue existed. It's a good example though of why each proposal needs its own issue so that we can easily refer back to previous proposals that have been proposed, voted, and decided. The audience of GFA is clearly larger now, so it is reasonable to revisit decided issues, but there needs to be an overwhelming argument to reopen decided issues. There are enough implementations of GFA now that breaking backward compatibility should be avoided if at all possible.

from gfa-spec.

sjackman commented on September 24, 2024

related to this, in trying to implement a C parser I myself have a problem with array type B. How do I know whether the elements are ints or floats?

~~The array type B is an array of bytes, that is uint8_t, an unsigned 8-bit integer.~~

Edit: My mistake! I was going on memory and was recalling the H type, which is an array of bytes. The B type is indeed an array of integer or float. This type is taken directly from the SAM spec.

Can there be different symbols for these?

~~This issue has been discussed and decided. The solution was to use JSON for complex types.~~

Edit: The first character indicates whether it's an array of integers or floats. Arrays of other types can be represented using JSON. See #15 and #18

from gfa-spec.

sjackman commented on September 24, 2024

subsequently Gene proposed to add a scaffolding record, i.e. a pair of segments with orientations and some gap size estimate. This should be different from a Link, even a zero gap end-butting link (which should be a Link).

👍 I proposed adding distance estimates to the link L record in #9. The issue is open and undecided.

from gfa-spec.

sjackman commented on September 24, 2024

@rrwick wrote…

So if the GFA format was to go with the segment-and-link-only approach, then I think people need to be encouraged to use path lines to describe long contigs.

👍 If useful, containment records C could show how the atomic segments S align to the path P.

from gfa-spec.

sjackman commented on September 24, 2024

OK, I've created a fork and you can find a first version of the GFA2 proposal here:
https://github.com/thegenemyers/GFA-spec

@thegenemyers Hi, Gene. Could you please open a pull request? That'll create a new issue on this repository where we can discuss your proposal. We can then comment on individual lines of the pull request.

There's lots of great text in this proposal describing the concepts and motivations. Thank you! I'd like to see whether we can maintain backward compatibility and achieve the ability to express the relationships that you describe by adding optional attributes to the existing GFA records. The question that needs to be answered is whether it is possible, or not possible, to represent these relationships in the current GFA spec, or whether it is possible but possibly inconvenient or inefficient.

@lh3 Heng has opened issue #33 to discuss the E and G records.

from gfa-spec.

richarddurbin commented on September 24, 2024

Ok. Makes sense. But that is not what the GFA spec says. It says "integer or numeric array" and allows decimal points in its regexp. What is the correct code for an array of ints? Maybe I should look at the SAM or VCF spec.

Richard

from gfa-spec.

richarddurbin commented on September 24, 2024

I agree there is not enough case to change this. I was just reporting that it was raised.

from gfa-spec.

sjackman commented on September 24, 2024

My mistake! I was going on memory and was recalling the H type, which is an array of bytes. The B type is indeed an array of integer or float. This type is taken directly from the SAM spec. The first character indicates whether it's an integer type or float type.

For an integer or numeric array (type ‘B’), the first letter indicates the type of numbers in the following comma separated array. The letter can be one of ‘cCsSiIf’, corresponding to int8 t (signed 8-bit integer), uint8 t (unsigned 8-bit integer), int16 t, uint16 t, int32 t, uint32 t and float, respectively. During import/export, the element type may be changed if the new type is also compatible with the array.

I've added this text to the GFA spec in 8554074

from gfa-spec.

sjackman commented on September 24, 2024

I've clarified that segment sequence and link CIGAR are optional fields in e70a22d

from gfa-spec.

richarddurbin commented on September 24, 2024

I think that if people are going to report the long contigs as contigs then they had better be represented as contigs in the final GFA file, that is in Segment records.

Richard

from gfa-spec.

sjackman commented on September 24, 2024

A Path P record could have an accompanying segment S record giving its sequence for this use case.

Incidentally, it's possible to create a FASTA index (.fai) file of a GFA file (the GFA file itself, not an accompanying FASTA file) such that samtools faidx (and I presume other tools) can extract sequences from the GFA file! Maybe just a parlour trick, but I think that's pretty cool.

from gfa-spec.

sjackman commented on September 24, 2024

After rereading my earlier responses, some seemed terse verging on rude. Sorry if I put anyone off! It wasn't my intent. I'm very excited to see such interest in GFA.

I'll be out of town at a friend's wedding from Friday until Tuesday. I look forward to catching up on the discussion when I get back.

from gfa-spec.

sjackman commented on September 24, 2024

Ping Giorgio Gonnella @ggonnella of Ruby library for handling GFA files https://github.com/ggonnella/rgfa

from gfa-spec.

stale commented on September 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

GFA version 2 with branches, external ids, and non-mandatory CIGAR about gfa-spec HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent