CC <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

minor re-edit, <a class="user-mention notranslate" data-hovercard-ty

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've created PR <a class="issue-link js-issue-link" data-error-text="Failed to load ti

The future of the GFA format,about gfa-spec/gfa-spec

Comments (20)

thegenemyers commented on September 24, 2024

Your list of additional features in GFA2 is incomplete:

These is no ability in GFA to code the multi-alignment that gives rise
to a segment.
Any end-user of an assembly will want to see said. It is an important
and obvious
use case to anybody in the field.

Paths also allow the encoding of subgraphs, Jason has given use cases in
the discussion.

Local alignments cannot be modeled in GFA and that's what daligner and
in fact all long
read overlappers output initially.

I don't agree with you.

On 9/23/16, 7:58 PM, Heng Li wrote:

CC @thegenemyers https://github.com/thegenemyers, @richarddurbin
https://github.com/richarddurbin, @ekg https://github.com/ekg,
@rrwick https://github.com/rrwick, @sjackman
https://github.com/sjackman, @jts https://github.com/jts,
@pb-jchin https://github.com/pb-jchin, @skoren
https://github.com/skoren, @aphillippy
https://github.com/aphillippy, @MihaiPop
https://github.com/MihaiPop, @ggonnella
https://github.com/ggonnella, @pmelsted
https://github.com/pmelsted, @edawson https://github.com/edawson
The current status of GFA1
There were two general-purpose assembly formats: FASTG
http://fastg.sourceforge.net/ and GFA1. With David Jaffe et al's
SuperNova assembler apparently moving away from FASTG
http://support.10xgenomics.com/de-novo-assembly/software/pipelines/1.1/output/graphs,
GFA1 is practically the only generic assembly format. I have written
converters https://github.com/lh3/gfatools for ABySS, SGA, Velvet,
Spades, SOAPdenovo and fermi short-read assemblers. @jts
https://github.com/jts's SGA and @sjackman
https://github.com/sjackman's ABySS natively support GFA1 output, I
believe. @jts https://github.com/jts's fork of DALIGNER
https://github.com/jts/DALIGNER can emit GFA1. @pb-jchin
https://github.com/pb-jchin has written a converter
https://github.com/PacificBiosciences/FALCON/wiki/Convert-FALCON-assembly-graph-to-GFA-format
for FALCON. My miniasm and fermi-lite assemblers output GFA. I believe
the vast majority of mainstream assemblers are compatible with GFA1,
too. For tools working with variations, @ekg
https://github.com/ekg's vg supports GFA output and has an internal
data representation conceptually equivalent to GFA1 (vg effectively
implements a bidirected graph). SuperNova graph can be converted to
GFA1 (not implemented yet). DISCOVAR
https://software.broadinstitute.org/software/discovar/blog/ outputs
FASTG which can also be converted to GFA in principle (not
implemented, either). As to tools consuming GFA, @rrwick
https://github.com/rrwick's Bandage can visualize GFA graphs. I have
written gfaview https://github.com/lh3/gfaview to perform graph
transformation (e.g. transitive reduction, tip trimming and bubble
popping) for long-read graphs. @sjackman https://github.com/sjackman
has implemented similar transformations for short-read graphs in
ABySS. There are already a few libraries in C++
https://github.com/edawson/gfakluge and Ruby
https://github.com/ggonnella/RGFA to read GFA1. In conclusion, GFA1
is getting used. It is fairly simple yet general enough for all the
tools and applications mentioned above.
About GFA2


  The necessity
GFA2 was proposed because GFA1 does not work when we choose a path at
a fork to merge. This leaves an end-to-internal match, which can't be
described by GFA1. A few other hypothetical use cases (e.g. alignment
between two long haplotypes) have also been raised. So far, I am not
sure which implementations output and, more importantly, consume
end-to-internal or internal-to-internal matches (NB: containment is a
special case of end-to-internal alignment, but it can be described
with GFA1). I am happy to update this post if there are any.
  The GFA2 graph representation
While GFA1 models a directed skew-symmetric graph
https://en.wikipedia.org/wiki/Skew-symmetric_graph that is
topologically equivalent to bidirected graphs, overlap graphs and
string graphs, GFA2 models an undirected multi-graph where mapping
coordinates are playing a central role. They represent fundamentally
different types of graphs. Although we can see GFA1 as a special case
of GFA2, how to understand the graph will be distinct. GFA2 will also
have more complex syntax and implementations for what GFA1 is really
good at.
  My take
SAM/BAM is popular not only because they can store alignments, but
more because they enable us to do things to the alignments that would
be complex otherwise. Similarly, I see GFA is not just a storage
format; it should be a format that helps our analysis. Due to the lack
of clear downstream use cases and implementations of GFA2 (e.g. what
information do we want to extract from GFA2? how to?), I am unable to
evaluate the necessity of the added complexity, especially given that
vg and SuperNova can already achieve part of the GFA2 goal with the
GFA1 representation only. I am reluctant to add unproven features too
early.
The future of GFA
As the creator of the initial GFA1, I do not see GFA2 is ready to
replace GFA1. I foresee the coexistence of GFA1 and GFA2 for a period
of time. During this period, developers have to make a choice between
GFA1 and GFA2. We may re-evaluate the necessity of GFA2 yearly. I will
be happy to phase out GFA1 if GFA2 is proven to be useful with
concrete and practical applications. I understand the split is
unfortunate, but this seems an unavoidable cost when we explore the
unknowns.
  The future of GFA1
The current GFA1 spec was modified from a blog post. I would like to
replace it with something a little more formal like the one here
https://github.com/lh3/gfaview/blob/master/gfa2-ideas.md, with
further improvements of course.

I also want to ask developers on the CC list: how much do CIGAR on
L-lines and the lack of segment length on S-lines hurt? Personally, I
would really like to change the format, but if you all think you can
live with these issues, I am ok to keep them as they are. Once we
reach a consensus on this issue, we will try best to maintain the
compatibility of GFA1 going forward. We may have new line types, but
we don't break existing lines.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#49, or mute the thread
https://github.com/notifications/unsubscribe-auth/AGkkNpMTSf3w74tsyMmOxdGOpiA-wB73ks5qtBNigaJpZM4KFPE8.

from gfa-spec.

lh3 commented on September 24, 2024

Thanks a lot for the additions, Gene. All fair points.

These is no ability in GFA to code the multi-alignment that gives rise to a segment. Any end-user of an assembly will want to see said. It is an important and obvious use case to anybody in the field.

Yes, I agree. I am currently using a custom line for this information. We need to standardize that. We could expect GFA1.1 to have this along with Gap lines.

Local alignments cannot be modeled in GFA and that's what daligner and in fact all long read overlappers output initially.

Once GFA1.x has Match-lines, which I still think a good idea overall, we could choose to encode generic local alignments in GFA. Nonetheless, personally, I prefer to use another format for read overlappers instead of overloading GFA.

from gfa-spec.

pb-jchin commented on September 24, 2024

minor re-edit,

@lh3 I think I might get your idea on only considering the SSG. However, I fail to see that is indeed more consistent than the general cases proposed by Gene.

For example,

(1) to make SSG graph, in most pragmatic cases, it is necessary to trim the reads or sequences.

(2) there are many different SSGs from the same set of overlaps, we all know reduction/filtering is necessary.

I am worry about the push to have a "pure" SSG representation is going to harm the generality of this format for larger and more complicated genome assembly as we might not foresee all possible user cases.

from gfa-spec.

ggonnella commented on September 24, 2024

@lh3: you speak of GFA1.1 with M lines and gaps. I thought that would be 2.0.

What will then be the difference from 1.1 to 2.0? Adding a mandatory sequence length? Writing alignments as CIGAR or Tracepoints and in tags instead as positional fields? F lines? (The latter can be included in GFA1 easily as well, as it doesn't break the compatibility or SSG assumption).

from gfa-spec.

sjackman commented on September 24, 2024

@thegenemyers

These is no ability in GFA to code the multi-alignment that gives rise to a segment.

@lh3

Yes, I agree. I am currently using a custom line for this information. We need to standardize that.

When the reads are segments, and the assembled contig is a segment, the multi-alignment of the reads to the contigs can be stored in containment C records. An optional segment field could be used to indicate which segments are reads and which are contigs. Since you both agree, I feel that I must be missing something.

For short-read assembly, I'd probably put the read-to-contig alignments in a BAM file. For long-read assembly, I can see that you'd want this information in the GFA file.

from gfa-spec.

sjackman commented on September 24, 2024

@lh3

We could expect GFA1.1 to have this along with Gap lines.

@ggonnella

you speak of GFA1.1 with M lines and gaps. I thought that would be 2.0. What will then be the difference from 1.1 to 2.0?

I'd love to have gap G records in GFA 1.1. It would be very useful to ABySS. I had proposed #9 "Add link attributes DI and DE to specify the estimated distance between two sequences" to address this need, but I'm quite happy instead adding a dedicated gap G record.

I'd like to add all the new records and optional fields that don't change existing records to GFA 1.1. GFA 2.0 can then include the changes that do change existing record types.

from gfa-spec.

ggonnella commented on September 24, 2024

@sjackman a difference between F and C would be that F does not require both its components to be S lines; ie. one would be able to keep track of what reads lead to the contigs and align where, without including the reads in S lines or elsewhere in the GFA file.

from gfa-spec.

sjackman commented on September 24, 2024

Got it. Thanks for the explanation, Giorgio. Edge E records also have an edge ID, but fragment F records do not have a fragment ID.

from gfa-spec.

sjackman commented on September 24, 2024

I've created PR #50 to add the gap G record, inspired by @thegenemyers and @lh3, to GFA 1.0.

from gfa-spec.

richarddurbin commented on September 24, 2024

Hello Heng and others,

Thanks for your comments. I think the proposal to allow GFA1 and GFA2 to run in parallel is fine. I see this like GRCh37 and GRCh38, and hope that in time people realise the benefits of GFA2 and switch to it, in particular for new developments, just as I see people coming round to using GRCh38. I hope it doesn’t take so long!

Of course you are right that there is no public software yet for GFA2, but that is not surprising since it is new, only just officially formulated a couple of days ago. I am expecting that Gene and Jason and my group and I hope others will soon start making available code that uses it, and hence makes the case for using it.

I have a few comments about the claims you made in your post.

Skew-symmetry

First, you claim that GFA1 is a mathematically a skew-symmetric graph, while GFA2 is not. In fact my understanding is that both are skew-symmetric multigraphs. Here is the definition of a skew-symmetric graph from the wikipedia page https://en.wikipedia.org/wiki/Skew-symmetric_graph that you point to:

a skew-symmetric graph G is a directed graph, together with a function σ mapping vertices of G to other vertices of G, satisfying the following properties:
• For every vertex v, σ(v) ≠ v,
• For every vertex v, σ(σ(v)) = v,
• For every edge (u,v), (σ(v),σ(u)) must also be an edge.
One may use the third property to extend σ to an orientation-reversing function on the edges of G.

Clearly for both GFA1 and GFA2 the vertices v are segments, and the function σ(v) maps a segment to its reverse complement, which statisfies properties 1 and 2. For GFA1 the edges are defined as links, and in GFA2 we just call them edges. For both GFA1 and GFA2 we allow multiple directed edges from node u to node v. For GFA1 we can have 4 types of link between u and v, from the + or - end of u to the + or - end of v. More than one of these types can exist within the same graph, which makes GFA1 mathematically a multigraph. (This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers, in which case there would be many more possible edges.) For GFA2 we allow a wider range of edges between u and v, parameterised by an interval in u, an interval in v, and a direction of alignment. GFA2 is a multigraph like GFA1, but with a richer description language over its edges. Neither is a simple graph.

There is a secondary point about directionality. It is possible to define directionality as specified by the order on the line, but consensus seems to be that this implies the reverse edge, which if true would mean that edges in either form of GFA, while being labelled as to type, are not in fact directed. However, having said that, it does not seem that the directedness of a skew-symmetric graph is essential for many of its properties.

What matters for the automorphism is the associated set of rules on the links and edges. For GFA1 we have that if [u, x, v, y] is a link, with x and y denoting ends either + or -, and we write +’ = - and -' = +, then so are
[u, x, σ(v), y’]
[σ(u), x’, v, y]
[σ(u), x’, σ(v), y’].
I have left out the overlap numbers because these don’t transform. Note that these couple the explicitly defined graph and its sigma graph, so they are not disjoint. Also this does not include the reverse edges [v, y, u, x] and its sigma versions.

For GFA2, a canonical edge is [u, x, v, a, b, c, d] where a, b are coordinates in u and c, d are coordinates in v, and x is the relative direction + or -. Then we also implicitly have edges
[u, x’, σ(v), a, b, $d, $c]
[σ(u), x’, v, $b, $a, c, d]
[σ(u), x, σ(v), $b, $a, $d, $c]
where the $ operation is as defined in the GFA2 syntax, and of course $$a = a. (Note that for a GFA2 edge we always have a <= b, and c <=d, and if a < b then by definition $b < $a.)

For graphs in mathematical graph theory there are no sequences and internal coordinates for vertices, but both GFA1 and GFA2 require these to represent DNA sequences. Both GFA1 and GFA2 have a concept of alignment associated with their links/edges, as evidenced by the fact that both define end coordinates for those alignments and permit more detailed alignment description by CIGAR or (in the case of GFA2) a tracepoint array. Of course it has to be true that any skew-symmetric property of GFA1 has an exact equivalent for GFA2, because both represent in a well-defined fashion ways of connecting DNA sequences, which are inherently skew-symmetric, and for both any path/walk needs to have an implicit reverse complement. I hope you accept that this lays to rest the claim that one is more mathematically correct than the other. I can not see any basis to that claim.

Representational power

Second there are statements about representational power. I summarize my understanding of the differences:

GFA2 has a generalised concept of edge, as discussed above. I believe that the value of this will be apparent as soon as you explicitly deal with diploid assemblies or ones with higher ploidy or from multiple individuals (including metagenomics). Arguably this is already important for segmental duplications in a haploid assembly, for which it is incorrect to collapse bubbles in repeats. There is an alternative to keep links and the bubbles, and represent inferred contiguous sections of sequence in the assembly as paths/walks, as in vg and Erik’s proposal for walks, but this means that such inferred assembled sequences are not available as explicit segments and sequences, which is not my sense of what end users expect or want. (For internal representation purposes I am sympathetic to the vg representation with short nodes that join without overlaps at their ends, and longer sequences represented by walks, though actually I prefer single base uni-directional internal representation, but GFA1/2 are exchange formats, not internal formats.)
Gap records. These seem necessary to represent scaffolds, which are the end product of a number of the assemblers that you mention. Of course it is easy to add Gap records to GFA1, as I think Heng proposes.
Fragment records, which are syntactically parallel to its Edge records, but show the relationship between segments and external sequences, rather than to other sequences. These allow two things: connection to external well-known objects, such as annotation, and also the relationship to segments in a previous GFA file. I strongly would like to use GFA for modular operations on assemblies, including bringing additional information to bear on scaffolding and phasing, e.g. combining 10X Genomics or Dovetail data with PacBio or Oxford Nanopore data, plus also bringing in genetic mapping and other
information. To support a series of GFA files it would make sense to represent how the objects in each file in the sequence relate to/are derived from ones in
the previous file.
Subgraphs. This supports the ability to identify subgraphs, for example bubbles and more complex subperbubbles or equivalents, or repeat hairballs, in one operation, then resolve them in a second operation. I believe the representation of subgraphs and paths in GFA2 is clean. An elegant feature is to use a single id space for all components (nodes, edges, paths/subgraphs), which allows hierarchical construction of the path/subgraph objects. Edge ids can be needed to disambiguate edges in paths in the general case, given the more flexible edge definition - in GFA1 that was done with +/- symbols.

Syntactic differences

A) GFA2 added a requirement to specify the length of a segment. We think this is good, but clearly it is syntax not semantics.

B) GFA2 allows Tracepoint arrays as an alternative to cigar strings. This introduction from Gene gives control over the tradeoff between representation space and compute time to recompute alignments, which we think is valuable for long inexact read overlaps/matches.

C) GFA2 uses space as the separator for paths and subgraphs. This is clean given that it allows segment identifiers to contain any printable non-whitespace character, as in the original GFA1 definition. I would argue that once you allow arbitrary alphanumerics you would like to allow segment identifiers to use at least the minus ‘-‘ character, since this is used regularly in representations of intervals and so definitions of sequence fragments.

D) GFA2 allows multiple header records. This makes it easier to support richer information, including adding header records progressively as a series of operations are carried out on GFA files, recording what was done and perhaps additional relevant information.)

Conclusion

Overall I think GFA2 is more powerful in ways that we have good cause to want, and is cleaner. My sense is that the real disagreement is about whether the generalisation of Link to Edge is a good thing, and I accept that is the biggest semantic change. Of course it is possible for GFA1 to adopt any of these features, and I am seeing a tendency in this direction, with even Heng’s proposal for Matches allowing representation of Edges. It would be a bit disappointing for the GFA1 community to reject GFA2 but then piecemeal adopt all its features, with more clumsy syntax.

Richard

from gfa-spec.

richarddurbin commented on September 24, 2024

Ah. I now see the mapping that Heng and others were seeing to the skew-symmetric graph, which in fact I had seen before, but missed when I was writing my last message. Consider a node u as the forward strand, with σ(u) as the implied reverse complement strand, and a directed edge as succession in the 5' to 3' direction.

Then you map link L u + v + to edge (u,v) and implicitly (σ(v),σ(u))
L u + v - to edge (u, σ(v)) and (v, σ(u))
L u - v + to edge (σ(u), v) and (σ(v), u)
and L u - v - to edge (σ(u), σ(v)) and (v, u)

So I concede that this is a clean skew-symmetric graph as defined. Apologies.

There are still corresponding skew-symmetric mappings for the forward and reverse strands of the GFA2 graph, but it is true that in this case it is a multigraph not a simple graph.
An E u + v a b c d definition creates edges (u,v;a,b,c,d) and (σ(v),σ(u),$d,$c,$b,$a),
while an E u - v a b c d definition creates edges (u,σ(v);a,b,$d,$c) and (v,σ(u);c,d,$b,$a).

from gfa-spec.

pb-jchin commented on September 24, 2024

Here is an prototype script converting FALCON internal sg_edges_list to the proposed GFA2 format. https://gist.github.com/pb-jchin/031af0dfc8b0b33ef1c5e039c142c194

Note that I use a somehow non-so-inituive convention. The generated coordinate pairs are counted from different directions as the faith information in sg_edges_list is the overhangs not the overlaps. The sg_edges_list file along does not contain all length information of all segments. However, because the introduction of a coordinate system that allows us to count from both ends, the overlap information can be represented by the overhang information faithful without the knowledge of the segment lengths.

Here is a short list of example of the E records:

jchin@flash:/pbi/pure/jchin/zebra_fish/2-asm-falcon_LA4F1_l18000
$ less tmp.gfa2 | awk -F"\t" '$1 == "E"'  | head -20
E       *       007400352       +       007412148       7783    $0      0       $11558  T:Z:TR
E       *       005894898       +       003236163       0       $17642  6888    $0      T:Z:TR
E       *       005772945       +       001021964       0       $26812  1915    $0      T:Z:G
E       *       003160281       +       005219341       0       $8994   17336   $0      T:Z:TR
E       *       001291379       -       004608073       5620    $0      1585    $0      T:Z:TR
E       *       002265113       -       000960439       0       $15624  0       $18101  T:Z:TR
E       *       000343915       -       002629794       8573    $0      8305    $0      T:Z:TR
E       *       001615147       +       006128718       0       $3332   36068   $0      T:Z:TR
E       *       000019758       +       006972566       17766   $0      0       $6237   T:Z:TR
E       *       007336673       +       004486597       0       $17833  3636    $0      T:Z:S
E       *       001991281       +       007408577       20842   $0      0       $2584   T:Z:TR
E       *       006361879       -       002328635       10904   $0      6394    $0      T:Z:TR
E       *       000514330       -       006646516       18839   $0      33601   $0      T:Z:R
E       *       007015038       -       000241901       19910   $0      16613   $0      T:Z:TR
E       *       005891381       -       004113339       0       $4156   0       $16210  T:Z:G
E       *       000387088       -       005507551       0       $20115  0       $26149  T:Z:TR
E       *       003872285       +       003629580       10855   $0      0       $11663  T:Z:TR
E       *       005658068       +       001248572       0       $7623   10933   $0      T:Z:TR
E       *       001544068       -       006537270       0       $3601   0       $11285  T:Z:TR
E       *       007240756       -       006527774       1051    $0      11264   $0      T:Z:G

EDITED: in my initial code, I forget the eid field.

from gfa-spec.

lh3 commented on September 24, 2024

I mainly focus on the mathematical aspects of GFA1 and GFA2 below.

@richarddurbin Your representation is similar to the one I used in the miniasm paper. It is correct, but in my view is a little confusing in the context of GFA1. This is because GFA1 does not keep the complement segments (SPAdes and velvet keep both strands, so your representation is cleaner for them). For GFA1, I usually take a (segment,strand) or (segment,end) tuple as a vertex. An L-line

L  segment1  strand1  segment2  strand2

explicitly denotes a directed SSG edge (segment1,strand1)->(segment2,strand2). Similarly, GFA1 path A+,B+,C- denotes a standard graph path (A,+)->(B,+)->(C,-) in case of simple graphs.

In GFA1, CIGAR or overlap lengths are not indispensable attributes, in the sense that graph visualization and many graph operations (e.g. graph traversal and bubble popping etc) often ignore overlap CIGARs. The graph topology itself is the most important. In contrast, in a GFA2 graph, even the most basic operations require to read coordinates. The topology itself is not very useful in general.

There are still corresponding skew-symmetric mappings for the forward and reverse strands of the GFA2 graph, but it is true that in this case it is a multigraph not a simple graph.

Yes, there is a kind of skew-symmetry in GFA2 in that every edge has a complement edge. However, this doesn't make GFA2 an SSG. Suppose there is a local alignment between [10,20) on A and [15,25) on B. The alignment leads to two edges e1=(A,+,B,10,20,15,25) and e2=(B,+,A,15,25,10,20), and four possible ways to walk through the twp edges: 1) A:0->e1->B:$; 2) B:0->e2->A:$; 3) A:$->e2->B:0 (could we write e1 here?); 4) B:$->e1->A:0. This suggests general GFA2 edges have no directionality -- there should be only two ways to walk through two directed edges. Thus GFA2 can't be a SSG because SSG is a directed graph.

from gfa-spec.

lh3 commented on September 24, 2024

an E u - v a b c d definition creates edges (u,σ(v);a,b,$d,$c) and (v,σ(u);c,d,$b,$a).

To find the complement edge is more complicated in this case. Because the current GFA2 uses coordinates on the forward strand, you need to use "length - b" etc to derive the complement coordinates. This is where I like @pb-jchin's syntax better. It is straightforward to derive the complement in his syntax, and thus also makes it obvious that every edge has a complement.

@pb-jchin, thanks for your falcon output example. My falcon2gfa (option -S) converter currently writes such GFA1.x:

L       000222720       +       000070307       -       :13733  L2:i:10378
L       000070307       +       000222720       -       :13792  L2:i:12661

13733 is the corrected overlap length of 000070307; 10378 is the overhang length of the same segment. There are no S-lines as segment lengths can all be inferred. "gfaview" is like a formatting tool. It combines the two lines into

L       000222720       +       000070307       -       13792:13733     L1:i:12661      L2:i:10378

In a future version of GFA1, I will replace CIGAR with an overlap field, which can be a CIGAR or two overlap lengths separated by colons. It is just so awkward to work with CIGARs for long reads.

from gfa-spec.

sjackman commented on September 24, 2024

Hi, Richard.

Clearly for both GFA1 and GFA2 the vertices v are segments, and the function σ(v) maps a segment to its reverse complement, which statisfies properties 1 and 2.

A vertex in GFA1 is not a segment, but a two-tuple of a segment ID and an orientation, for example, (A, +), which is shortened to simply A+ for convenience. σ(A+) = A-, and σ(A-) = A+ . Each segment is associated with two vertices in the SSG.

In a Boolean logic implication graph, a type of skew-symmetric graph, each Boolean variable has two vertices associated with it: "This graph has a vertex for each variable or negated variable" from https://en.wikipedia.org/wiki/Skew-symmetric_graph#Satisfiability In GFA1, each segment has two vertices associated with it.

For GFA1 the edges are defined as links, and in GFA2 we just call them edges. For both GFA1 and GFA2 we allow multiple directed edges from node u to node v. For GFA1 we can have 4 types of link between u and v, from the + or - end of u to the + or - end of v.

Where each vertex is a tuple of a segment and orientation, only two edges between two vertices u and v are possible.

u -> v
v -> u

Very import is that u -> v and u -> σ(v) are different edges between different vertices. If this were not true, GFA1 would be a multigraph.

More than one of these types can exist within the same graph, which makes GFA1 mathematically a multigraph.

GFA1 is a strict directed graph, not a multigraph. A+ and A- are distinct vertices. σ(u) maps u to a different vertex. σ(A+) = A-, and A+ and A- are different vertices, that happen to be associated with the same segment A.

A+ -> R+
A+ -> R-

is not a multigraph, because R+ and R- are different vertices.

(This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers, in which case there would be many more possible edges.)

I'm making the same assumption.

Cheers,
Shaun

from gfa-spec.

pb-jchin commented on September 24, 2024

Likely a TL;DR; post for a record

I think it is useful to think where the "symmetry" coming from. (side note: As an ex-physicist, symmetry in nature was one of the most important subjects in my training. It is always fun to think about it.)

Let's simplify the problem a bit first. Let's assuming (1) for any two segments, there is only one unique dovetail overlap between them and (2) there is no reverse complement relationship between the strings. Under such conditions, an assembly graph should be pure directed graph without any skew-symmetry as there is only one direction that one can read the sequence out. However, while there is only one overlap but there are two ways to describe it: sid1 -> sid2 or sid2 <- sid1. Note that these are just two equivalent description of the same overlaps. The order of the sequence ids is artificial.

In such case, if we use string graph vertex / edge definition in Gene's string graph paper, we create 4 vertices and two edges:

edge 1:  sid1:E -> sid2:E   # L sid1 + sid2 +
edge 2:  sid2:B -> sid1:B   # L sid2 - sid2 -

However, the edge 2 is totally useless in such assembly problem as we don't read the sequence backward. We don't need the skew-symmetry for assembling such "genome". (In fact, the - has no meaning here as we don't have reverse complement sequence in the current assumptions.)

Let's get back to the real DNA case. For real DNA, the assumptions are (1) for any two segments, there is only one unique dovetail overlap between them and (2) a reverse complement (RC) operation is defined and we can traverse the sequence in it's normal direction or the reverse complemented one. In such case, edge 2 becomes necessary to satisfy (2) for RC traversal. However, (1) is still the fundamental invariant (and where the symmetry coming from) between the two sequences. There is still only one overlap between two segments no matter which direction that we want to read the sequences. As long as the set of overlaps is recorded, you can derive related symmetry generated by the RC operations. In my opinion, GFA.2 proposal focuses (1) but Heng's GFA.1 and current FALCON graph (= directed string graph) have more focus on express (2) explicitly.

Here is an interesting thought experiment. Given a set of fragments to be assembled, we create the RC segments but we just forget about the relation between each pair. (We effectively double the coverage this way.) But, in the overlap stage, we do not allow reverse complement overlaps. While the resulting assembly graph is skew-symmetrical, we simply don't know about it. However, we can still traverse the graph to get assembly. There will be just some caveats that I can think of now: (1) one will get assembly of both the normal direction sequences and RC sequences at once (assembly size = twice of the size of the genome) and (2) palindrome or inverted repeats will make the graph tangled in some way that can cause incomplete assembly. Anyway, while skew-symmetry is a nice property for getting a good assembly, it is not necessary required to get an reasonable assembly.

Ultimately, the fundamental information is from the dovetail overlaps and removing containment related overlaps. All other properties can be derived from the overlaps. One can easily define an operator acting on the GFA.2 graph to satisfy a skew-symmetry condition with additional vertices representing the reverse complemented segments. However, one can argue that it is not necessary as the L record in GFA.1 can be derived easily from the E records of GFA.2. However, there will be other cases the one can encoded in an E record that is not possible to be expressed in a L record.

from gfa-spec.

ggonnella commented on September 24, 2024

The thread is indeed very long. I tend to agree with @lh3 that the topology in a GFA1 graph is its most important information, while for GFA2 this is not true anymore. I also think that both formats can co-exist, at least for a while.

However, one could work, to make the formats as similar to each other as possible, without breaking the compatibility with existing GFA1 implementations, without requiring changing its syntax, and without interfering with its topology coded in S/L records.

In particular, one could support in GFA1:

G lines for gaps (see #50)
F lines for relating S sequences to external sequences
other alignment formats for the L/C/P overlap fields: @lh3 mentioned a future version in which he would add alignments as pairs of lengths on the two S. I think also tracepoints alignment could be actually supported in GFA1 without changing the syntax (maybe using another separator rather than comma, as this would create problems in the syntax of the P lines Overlaps field).

from gfa-spec.

lh3 commented on September 24, 2024

I have a different subjective interpretation: the skew symmetry is originated from DNA strands. The begin-end symmetry in a string graph is the result of the strand symmetry; without the strand symmetry, there would be no string graph.

General GFA2 has two symmetries: the skew symmetry caused by DNA strands, and the "undirectional" symmetry: if there is v->w, there is w->v. As a result, a general local alignment below implies four edges.

They are:

v1→v2: l1→aligned→r2
σ(v2)→σ(v1): r2→aligned→l1
v2→v1: l2->aligned→r1
σ(v1)→σ(v2): r1→aligned→l2

In fact, we can define undirected skew-symmetric graph (uSSG) as follows:

For every vertex v, σ(v) ≠ v,
For every vertex v, σ(σ(v)) = v,
For every undirected edge <u,v>, <σ(v),σ(u)> must also be an undirected edge.

Then general GFA2 is a uSSG, where a vertex is (segment,strand) and an edge is defined by an Edge-line. In case of dovetail overlaps, l2=r1=0. We don't regard v2→v1 and σ(v1)→σ(v2) as edges any more. This breaks the "undirectional" symmetry and leads to a directed SSG, which is GFA1.

I think the uSSG interpretation of GFA2 helps to understand general graph traversals, but this does not really make our life easier as we still have to deal with coordinates.

from gfa-spec.

sjackman commented on September 24, 2024

Here is an interesting thought experiment. Given a set of fragments to be assembled, we create the RC segments but we just forget about the relation between each pair.

This idea is the inspiration for the skew-symmetric overlap graph of ABySS. It has two vertices for each segment (one for each strand) and two edges for each overlap (ditto). The algorithms that navigate that group and do not modify it (topological sort, for example) consider only the directed graph and have no idea that it's SSG.

When the graph is modified, a helper class maintains skew-symmetry so that whenever vertex A+ is removed, vertex A- is also removed. Ditto for edges, when A+ -> B+ is removed, B- -> A- is also removed.

As an implementation detail, the complementary vertices share the same sequence string in memory with a flag to indicate whether it's reverse complemented so that the sequence is not duplicated in memory.

from gfa-spec.

sjackman commented on September 24, 2024

The GFA 2 PR #48 has been merged.

from gfa-spec.

The future of the GFA format about gfa-spec HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent