Hi everyone, if people aren't yet sick of discussing the finer points of GFA implement

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Pall <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="16

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Paths in GFA v2,about gfa-spec/gfa-spec

Comments (79)

MihaiPop commented on September 24, 2024 2

I don't really understand the hang-up about a "true" graph as far as graph theory is concerned. A "true" graph has nodes and edges - what these represent is at some level irrelevant to graph theory. GFA2 is as much a graph as GFA, as is a multi-graph where two nodes could have two or more edges between them.

I think part of the problem here is that there is an implicit assumption of the the true graph that we want represented is the "graph at different stages in the execution of an assembler". I don't think the latter is a well defined concept, but the underlying thread implies that the true graph is what you might see when assembling a haploid genome, using unpaired reads, and using either a de Bruijn or string graph paradigm.

Part of the reason why Gene proposed GFA2 (if I infer the reasoning correctly) is that most genomes we work with today are not haploid.

The reason I care about how the format evolves is because in my world the data being assembled is (highly) poly-ploid. All the unusal things that one has the luxury to ignore when assembling human genomes occur in metagenomic data, and are likely to occur in tumor samples as well.

If we stick to a format that works "well" for the current narrow use-cases, we'll have another discussion about GFA3 and 4 and 5 in the next couple of years, as we start figuring out there are all these other interesting questions we may be able to ask.

I think that what Gene proposes is fairly future-proofed, and I'd like to make sure that GFA2 will be able to encode a rich set of situations that may or may not occur in today's data but will be critical for making sense of upcoming data sets. Can we try to focus on this bigger picture rather than more nitty gritty details of implementation/efficiency, etc.?

I'm not sure we've fully agreed on all the relationships we will need to encode yet, and whether GFA2 is sufficiently expressive to encode all of them. We can then discuss whether additional bells and whistles would make the format easier to parse for specific things like dovetails, or for converting to a graphical interface.

from gfa-spec.

MihaiPop commented on September 24, 2024 1

My impression of what GFA2 is trying to achieve is to create one graph format that can be used to encode a broad range of situations that arise in assembly. Gene- can you pitch in here?

I'm concerned about making things too specific - the history of genome assembly is littered with file formats that were specific to a particular application or software tools. For what it's worth, variants, mate-pairs, optical/nano-coding maps, etc. are a key part of the information that comes out and goes into the assembly process and the format we choose should be rich enough to allow all such information to be encoded, rather than pushing the ball down the line (one format for basic assembly, another one for diploid, another one for metagenomics, another one for scaffolding, etc... )

from gfa-spec.

lh3 commented on September 24, 2024 1

@sjackman Your message reminded me of my earlier proposal, where I called E-lines as M-lines. The advantage of my proposal is we don't take "E" as edges, a concept very specific to graph. Instead, we take "E" as general relationships. Then I would not worry about encoding two different types of graphs in GFA – because GFA only encodes SSG. A revision to my earlier proposal would be to rename "M (mapping)" to "R (relationship)". I agree with Gene that "mapping" often implies other things. This way I would be happy to have general relationships in GFA, as GFA is still mathematically consistent.

Accordingly, "P (path)" is reserved for paths through L-lines only. We may introduce "T (thread)" to describe "paths" through both L- and R-lines. In implementation, I expect the supposed libgfa to only store R and T lines but do not support advanced queries for the time being.

from gfa-spec.

sjackman commented on September 24, 2024 1

Segment `S` record

It would help I think to separate the concepts of fixed-position fields and tagged fields, from mandatory fields and optional fields.

A fixed field can be mandatory or optional. An optional fixed field that is omitted is replaced with *.
A tagged field can be mandatory or optional. An optional tagged field is simply omitted.
Currently we have only optional tagged fields, but there's no reason a tagged field couldn't be mandatory.

I suggest and in fact wholly support that the length LN tagged field of the segment S record should be a mandatory field when the sequence is omitted, or alternatively, simply always mandatory. ABySS for example always outputs the length LN field.

The current text is

When the sequence is not stored in the GFA file, its length may be specified using the LN tag, and the sequence may be stored in an external FASTA file.

I suggest changing this to either

When the sequence is not stored in the GFA file, its length must be specified using the LN tag, and the sequence may be stored in an external FASTA file.

The length must be specified using the LN tag. When the sequence is not stored in the GFA file, the sequence may be stored in an external FASTA file.

from gfa-spec.

sjackman commented on September 24, 2024 1

Pall @pmelsted has suggested that we move this repository from its current location
https://github.com/pmelsted/GFA-spec
to
https://github.com/GFA-spec/GFA-spec
The current location will redirect to the new location. Is everyone happy with that?
Give a 👍 if so.

from gfa-spec.

iminkin commented on September 24, 2024 1

#30 ?

from gfa-spec.

rrwick commented on September 24, 2024

@lh3 @sjackman @thegenemyers @ekg @pmelsted @pb-jchin

(Sorry if I've notified people twice on this one - GitHub's spam filters briefly labelled me as a robot and deactivated my account, so I don't think the first one got through.)

from gfa-spec.

sjackman commented on September 24, 2024

2 - CIGAR-free paths

The CIGAR field of the path P record is optional and can be replaced with an asterisk *, indicated by its regex \*|([0-9]+[MIDNSHPX=])+. I've committed 4fb2dd0 to make it more clear that its optional.

from gfa-spec.

sjackman commented on September 24, 2024

1 - Start/end positions

Do you have a plan to actually use this feature in an implementation, or are you anticipating a need?

from gfa-spec.

sjackman commented on September 24, 2024

3 - inexact matches in overlaps

Many more than two segments will overlap at each base of the path when the path is overlapping reads rather than overlapping contigs, so determining the consensus sequence of a path is not trivial. The path itself is not enough to define the sequence of the resulting merged sequence. To me, the path P records are the result of the layout stage of OLC, and the consensus sequence has not yet been decided. The consensus tool would determine the sequence of each path. I propose that the consensus tool would output a corresponding segment S record for each path with the same name as the path to define its sequence.

P path1 A+,B+ *
S path1 CATGCTT

from gfa-spec.

lh3 commented on September 24, 2024

CCing @ekg.

Another option to express paths is to take the idea from AGP, a golden path format. Basically, a golden path is a sequence composed of sub-segments. A golden path line D could look like:

S  A  150  *
S  B  160  *
L  A  +  B  +  50  55  # there is an overlap involving last 50bp of A and first 55bp of B
D  path1  A+,0,150  B+,55,160

which concatenate sub-segments together. This way, we don't need to care about CIGAR and inexact matches. Another format for D is something similar to BLAT's PSL:

D  path1  A+,B+  0,55  150,160

I don't have a strong opinion towards either format of D.

EDIT: perhaps the second is better, because we can allow default starts and ends like:

D  path1  A+,B+  *  *

from gfa-spec.

pb-jchin commented on September 24, 2024

@lh3 again, I just like to point out something, if we think A+ and B+ are the vertices in the SSG, then there is only one edge(one sequence) in between. However, when you layout the example above, you have two sequences. There is some interesting fact. Whether you need to include sequence A+,0,150 will depend on whether there are other edges connect to A+. If you take the SSG literally, the edge should only have sequence from segment B. This is quite subtle, and it took me some time to think it through.

from gfa-spec.

sjackman commented on September 24, 2024

If the goldenpath idea is popular, I'd suggest adding optional start and end attributes to the path P record:

P path1 A+,B+ * ST:B:I,0,55 EN:B:I,150,160

The type B:I is an array of unsigned 32-bit integers (uint32_t[]).
See https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#optional-fields

from gfa-spec.

sjackman commented on September 24, 2024

@rrwick

So should the path sequence be CATGCTT or CATACTT? Here are a few brainstormed solutions:

Another option is to use an ambiguity code, CATRCTT, when the GFA file doesn't specify the consensus sequence of the path.

from gfa-spec.

ekg commented on September 24, 2024

@lh3 how would this work in the case of multiple overlaps on the same node
end?

from gfa-spec.

lh3 commented on September 24, 2024

@ekg, you will have different starts and ends. Example:

12345678901234    # coordinate
ACGTGATGAT        # read1
       GATGCTA    # read2, 3bp overlap
    GATGcTA       # read2, 6bp overlap, with 1 mismatch

If the golden path goes through the 3bp overlap, you get:

D  A+,B+  0,3  10,7

If goes through the 6bp overlap, you get one of them:

D  A+,B+  0,6  10,7
D  A+,B+  0,0  4,7

These are two different golden paths with different sequences. The bad thing about golden paths is there are multiple paths to spell the same sequence; the good thing is the sequence spelled from the path is clear – unlike CIGAR which does not specify how to resolve mismatches/gaps in the overlap.

I am just throwing ideas. I am not very certain "golden path" is a good thing to have in GFA.

from gfa-spec.

ekg commented on September 24, 2024

@lh3 I am considering graphs with many paths. This would seem to be difficult for the case of multiple paths entering the same overlap. In a way, the overlap feature hides the fact that we haven't fully resolved the sequence space of our graph, which can make it difficult to describe how things walk through the overlap.

I would prefer to separate the graphical model from the description of how reads walk through it. I think of the graphical model (S and L namespaces in GFA) as a precise regular language that is a superset of the sequences from which we have built the assembly, which ideally could be represented as paths (P namespace).

An extension to this would then allow the paths to include edits against the graph, so that they would not need to be "fully embedded" in the graph. In vg, this feature allows us to describe alignments and paths embedded in the graph using the same data model.

I am not sure this is the right place for this discussion, given the ongoing conversation about GFA2. My major concern now with the developments here is that GFA and other graphical formats may focus too strongly on the overlap graph use case, at the expense of simpler and more precise models. I understand that overlap graphs can be much easier to obtain than the full resolved string graph (with blunt ends rather than overlaps).

The situation here reminds me a lot of how VCF was designed to provide features useful when calling SNPs and indels. It embedded basic assumptions about the use case which have prevented its generalization and continuously caused many conceptual problems when we started to think about larger variation. Given that these issues are one motivation towards our exploration of graph based formats, I think we should be careful that we don't replicate the previous pattern here and over-emphasize use cases at the expense of simplicity and generality of our interchange model.

A concrete example related to the GFA2 discussions: If we recorded contig/read structures in the graph as embedded paths, the overlap information provided either in L (dovetail/end only overlaps) or E (side graph like joining + L) would be precisely defined by the relationship between paths in the graph.

This drawing would describe a possible case where three sequences are overlapping in two L records, as well as how we would represent this as paths over a fully-determined graph. I can describe how the same pattern would work for E records (side graph) but I think it should be clear from this:

The first graph is meant to represent an overlap graph. The blue, magenta, and yellow bits are parts of the overlap that are distributed in the overlap-free graph. The other colored pieces are the sequences (red, blue, green) in both the overlap and overlap-free graph.

So I am aware that it is unlikely that we would move radically to a model like this, but given the discussion I thought I should mention what seems to be the "easiest" object for us to work with in resequencing analysis so that this use case isn't forgotten.

from gfa-spec.

sjackman commented on September 24, 2024

Both graphs that you show can be represented in GFA—GFA allows for segments that are abut. It's possible to convert the overlap graph (which would be output by overlap alignment tools) to the corresponding string graph. For your application, that string graph may be more convenient than the overlap graph for representing paths of the original sequences (reads) threaded through the graph.

from gfa-spec.

lh3 commented on September 24, 2024

@ekg, first of all, I would like to clarify the word graph. In graph theory, a graph is a 2-tuple G=(V,E). You can clearly identify vertices and edges in a graph. String graph, overlap graph and skew-symmetric graph (SSG) are real graphs. Bidirected graph, as it is equivalent to SSG, is also a graph. Jouni's base-pair graph is a graph, too. The original GFA exactly models a SSG. It is truly a graph format. On the other hand, a lot of "graphs" we talk about in other context are not really (mathematical) graphs. For example, side "graph", insert "graph" and "E"-connected "graph" are not graphs. You can't clearly identify vertices and edges making sense in graph theory. To make distinctions, I will call these "graphs" as diagrams in the following, and reserve graph for the graph described in graph theory. A graph is a diagram, but a diagram is not necessarily a graph.

My major concern now with the developments here is that GFA and other graphical formats may focus too strongly on the overlap graph use case, at the expense of simpler and more precise models.

The initial purpose of GFA is to model an assembly graph for different stages of assembly wherever the graph is equivalent to an SSG. Your concern arises mostly because we are deviating from the initial purpose: we are now pushing GFA to describe variant diagrams. In fact, my personal preference is to separate assembly graph and variant diagrams –– we design a new format specifically for variant diagrams. In the new format, we are free to trash features important to assembly. I would virtually prefer to get rid of overlap in this variant diagram format as well. Note that common operations on GFA and the new format will also be distinct. For GFA, my initial purpose is to have common modules for transitive reduction, bubble popping, etc. For variant diagram, we are more interested in queries such as "give me branching points or variants in [1000,2000) on contig xyz". This makes the data structures to effectively store the graphs different, too.

This drawing would describe a possible case where three sequences are overlapping in two L records

Your top figure is not an overlap graph, as the red and green don't have a dovetail overlap.

So I am aware that it is unlikely that we would move radically to a model like this, but given the discussion I thought I should mention what seems to be the "easiest" object for us to work with in resequencing analysis so that this use case isn't forgotten.

Regardless of your top figure, your bottom figure is a GFA graph. If there is an effective way to create blunt-end graphs (that is another topic), most assemblers produce such a graph. I haven't dealt with this myself –– my understanding is Gene and Jason prefer not to output such a graph as the final assembly. They apparently want some segments to explicitly represent full-length contigs. This pushes the GFA graph to a diagram.

This would seem to be difficult for the case of multiple paths entering the same overlap. In a way, the overlap feature hides the fact that we haven't fully resolved the sequence space of our graph, which can make it difficult to describe how things walk through the overlap.

I finally see why you need CIGARs on paths. I have a different view. Mismatches and gaps in an overlap are errors rather than variants. A perfectly clean string/overlap graph should only have exact overlaps. Then you would not need CIGARs. To push this even further, David Jaffe argued a truly final assembly graph should have no overlaps. I gradually come to agree with him.

from gfa-spec.

thegenemyers commented on September 24, 2024

Heng,

I showed you some time ago that the GFA2 spec with E-edges did indeed code a multi-graph.
I even gave you an example with a picture of the graph. Why are you now saying it is not a graph?

Furthermore, I spent just 10 minutes thinking about how to draw a GFA2 graph with segment lengths and saw how to do it and that it is not complex. Limiting the situation to just L-edges is really not as critical as y'all seem to think, but like I said, I'll acquiesce and keep it in the standard for those that can't understand the generalization or how to draw it.

from gfa-spec.

pb-jchin commented on September 24, 2024

@thegenemyers @lh3 As far as I can see, Heng Li's point is the L records encode the SSG (which is equivalent bidirected string graph). However, we should accept that other kind of graphs might be useful too.

from gfa-spec.

lh3 commented on September 24, 2024

I have said "yes" to general edges, which means I can live with other kind of "graph" in GFA. I know you think these other representations are useful.

But from a purely theoretical angle, I am not sure GFA2 is encoding a graph in the classical sense any more. Very basic questions: what are vertices in GFA2? Are they segments, the (segment,strand) 2-tuple like in GFA1, or each base? In the generic case:

what is an edge? Can we write this edge into a form of v --> w? How is a path defined? Is v --> w && w --> x still implying a potential path v --> w --> x?

I reiterate that I can live with other "graphs" in GFA, but it would still be good to understand whether these other "graphs" are really graphs as defined in graph theory.

from gfa-spec.

thegenemyers commented on September 24, 2024

Heng, very sorry, the example I though you saw was on an issue thread at my fork which only Ryan and Shaun apparently participated in. What I wrote at the time:

Hi Ryan, Shaun,

GFA2 still encodes a graph, its just an undirected multigraph with E-type edges and G-type edges. Your issue arises when you want to interpret a segment as a line segment. If you think of a segment as a dot as in a traditional graph, then E-edges are just edges between the dots, and 2 dots may have several edges between them e.g. the After picture is the graph:

So the question isn't "is it a graph", as it is, but rather, "what do I draw" when I want to represent each segment as a line that has length (versus a zero-width dot). A simple proposal would be that one could simply ignore all E edges that are not dovetails. I have a suggestion to the format for E edges that will make it trivial to identify which represent dovetails, containments, or general edges. I will propose this in issue #33 shortly. The collection of dovetails, when considering segments to have length, can be given direction as your prefer.

Shaun wants a different specification for dovetail, containment, or general edges. I prefer one consolidated, general concept in which the cases that would be distinguished by a different letter in the first column are just as easily distinguished from looking at the E-line. Please see my forthcoming proposal in pmelsted#33 (comment)

from gfa-spec.

lh3 commented on September 24, 2024

@thegenemyers Sorry, I missed that thread. I can see you are defining a segment as a vertex. This, however, leads to two inconsistencies. First, in GFA1 graph, a vertex is (seg,strand). GFA2 starts to encode two types of graphs. Second, in your graph, B,1,A,3,C is a legitimate path, but if you look at the coordinates, this "path" is invalid in that it does not spell a sequence (right?). In general, your graph will have many invalid paths.

@MihaiPop My major concern is that we are mixing two types of information in one format. To an inaccurate analogy, GFA1 is like SAM, keeping the "raw" assembly that contains all the information we need but in a hidden way; GFA2 is like VCF, which is derived from the raw assembly and makes polymorphisms explicit (PS: there is not really a concept of "polymorphism" or "haplotype" in the raw assembly). GFA1 and GFA2 have different graph representations, different applications, different query patterns and will probably have separate implementations. I think it would be theoretically cleaner to separate E-lines into a new sister format (e.g. Graphical Ploidy/Polymorphisms Assembly format?). In the new format, I would entirely agree with Gene that L is redundant and useless to encode variants.

Can we try to focus on this bigger picture rather than more nitty gritty details of implementation/efficiency, etc.?

My experience is that a successful format should be accompanied by some implementations during design. Something trivial on the paper may be hard to implement; something hard initially may turn out to be a small trick in implementation. The feedback from implementation is hugely valuable. My view is in format and data model design, implementation is almost everything. I have learned both positive and negative lessons on this topic.

It would be good if people favoring one format only could give C structs and APIs about how to query (e.g. give me all variants in contig123:1000-2000? find all paths through two positions?) and manipulate the GFA2 graph. I tried but failed to achieve anything clean to make myself happy. That may be due to my lack of experiences, though.

from gfa-spec.

sjackman commented on September 24, 2024

First, in GFA1 graph, a vertex is (seg,strand).

The vertices of E edges can still be (seg,strand) tuples. The big difference is that E edges are undirected. For example,

E U 1 5 + V 2 4
E W 1 5 - X 2 4

The first line encodes the complementary undirected edges
(U+,V+) and (U-,V-)
and the second line encodes the complementary undirected edges
(W+,X-) and (W-,X+)

from gfa-spec.

pb-jchin commented on September 24, 2024

First, in GFA1 graph, a vertex is (seg,strand).

just to point it out, as far as I know, a vertex in the skew-symmtrical string graph is not (seg, strand) but (seg, end_point_of_the_seg). So, technically, GFA1 does not really encode the SSSG. Does it really matter? Not really. It is easy enough to covert one to the others and they are mathematically isomorphic or homomorphic. Same for converting dovetail E records to the SSSG vertices and edges.

from gfa-spec.

lh3 commented on September 24, 2024

@sjackman You can think that way, but edges are still different. Also, you still have E-paths that do not spell any sequences. E-edges encode a very different type of graph from SSG.

Same for convert dovetail E record to the SSSG vertices and edges.

Allowing dovetail E greatly hurts user experiences, but this is relatively a small one in comparison to the fact that non-dovetail E encodes a very different graph from SSG.

(EDIT: I see what you mean by SSSG. Removed the irrelevant part in my comments. Nonetheless, I need to point out that an SSSG is an SSG. As GFA1 encodes every SSG, it naturally encodes SSSG.)

from gfa-spec.

sjackman commented on September 24, 2024

I prototyped a two-step overlap pipeline once. The first step used minhash to estimate Jaccard index, so it knew if two reads shared k-mer similarity, but didn't know anything about the nature of the alignment. The second step determined whether the match was a dovetail, containment, or internal (repeat-induced) match. Using GFA, the first step could output edge E records (with * for coordinates and the estimated Jaccard index in an optional tag), and the second step could determine which edges corresponds to dovetail edges and output L records.

I see E records as a precursor to L records—you may disagree with me here, Gene. The point is that I can see a use case for E records, that is, records that indicate "these two segments share sequence similarity". I'm not sure that this specification needs to define in any more detail what that sequence similarity implies.

(It was actually a four step pipeline: 1. estimate Jaccard index 2. determine dovetail, containment or internal 3. estimate amount of overlap 4. pairwise alignment)

from gfa-spec.

pb-jchin commented on September 24, 2024

Accordingly, "P (path)" is reserved for paths through L-lines only. We may introduce "T (thread)" to describe "paths" through both L- and R-lines. In implementation, I expect the supposed libgfa to only store R and T lines but do not support advanced queries for the time being.

I think they are all too specific. I think it is fine to have "P" as a specially category of subgraph. After that, we should just have a generation notion of subgraph. (I do use the concept of "bundles" in the code and writing but they are just subgraphs too.)

from gfa-spec.

sjackman commented on September 24, 2024

I like M for match (or map), but that's really a bike-shedding argument. Once we agree on what the content of the record should be, I'm happy with whichever character is most popular. E is fine by me too. I'm less fond of R.

I propose

M <seg1> <start1> <end1> <orientation> <seg2> <start2> <end2> <score> <cigar> <optional_tags>

[start, end) is a half-open 0-based interval. The size of the match is end - start.
Score is an implementation-dependent score. It could be the alignment score. It may be *.
CIGAR is optional (may be *). Any 👍 votes or counterproposals? Let's wrap this up!

from gfa-spec.

sjackman commented on September 24, 2024

This format is very similar to the Exonerate SUGAR (Simple UnGapped Alignment Report), VULGAR (Verbose Useful Labelled Gapped Alignment Report), and CIGAR (Compact Idiosyncratic Gapped Alignment Report) formats.
See http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate-manual

query_id Query identifier
query_start Query position at alignment start
query_end Query position alignment end
query_strand Strand of query matched
target_id The same 4 fields for the target sequence
target_start
target_end
target_strand
score The raw alignment score
cigar Compact Idiosyncratic Gapped Alignment Report

from gfa-spec.

ekg commented on September 24, 2024

@lh3 sorry to take a few days to get back around to this.

Your top figure is not an overlap graph, as the red and green don't have a dovetail overlap.

First off, sorry for the error in the drawing. Here's a corrected version:

To make distinctions, I will call these "graphs" as diagrams in the following, and reserve graph for the graph described in graph theory. A graph is a diagram, but a diagram is not necessarily a graph.

You can call them whatever you want, but I'm going to assume when you say "variant diagram" that you mean "variation graph", unless it's been rendered as an image! I don't understand how the data model we're working with isn't a graph.

The initial purpose of GFA is to model an assembly graph for different stages of assembly wherever the graph is equivalent to an SSG. Your concern arises mostly because we are deviating from the initial purpose: we are now pushing GFA to describe variant diagrams.

I'm pointing out that maybe we don't need overlaps or E records if we are willing to embed sequences which we are interested in as walks through blunt-ended graphs. I don't think this will be taken up by the group, but think it's worthwhile mentioning this perspective as it will continue to persist as long as we continue developing vg.

Variation graphs as implemented in our schema are compatible with GFA. We need to model all of GFA or GFA2 or whatever assembly format because we use assembly graphs as input and output, and we can't be sure that assemblers will reduce their representations into something equivalent to the models that we've found easiest to work with.

We don't need to push GFA to describe variation graphs (diagrams?) because it already does! However, we can't use GFA for everything we do because its semantics for dealing with paths (or walks, or alignments) are poorly developed. Paths are central to many things we do in vg. We use paths to represent genomes, contigs, alleles, variable loci, alignments, and graph morphisms. When carefully defined, they are super useful for supporting graph based genomics!

In fact, my personal preference is to separate assembly graph and variant diagrams –– we design a new format specifically for variant diagrams. In the new format, we are free to trash features important to assembly. I would virtually prefer to get rid of overlap in this variant diagram format as well.

I agree that a new format is useful. We've made one... but I'd like to point out that it includes features which are only useful for interfacing with assemblers, through which we've ensured compatibility with core features in GFA. For instance, the data model allows us to record overlaps on edges so that we can read in GFA assemblies and then collapse overlaps and normalize them. Also, we use the same model of edges (there are four possible types) and nodes (they have sequence labels on them). We could also have made a DAG-only model, but it would be much less useful for numerous reasons.

So I also disagree. I think that conceptual compatibility is crucial, and I don't think we need to break away or make things incompatible. We've got nothing to gain by doing that. A central objective of variation graph (PRG/ "graph reference"/ pan-genomes etc.) methods is to expand the prior that is used in resequencing (the reference) to include many genomes/isoforms/assemblies rather than just one. This goal is perfectly served by assembly algorithms, and we're going to do whatever is needed to interface with them.

Note that common operations on GFA and the new format will also be distinct. For GFA, my initial purpose is to have common modules for transitive reduction, bubble popping, etc.

I don't think this is necessarily the case either. For instance, vg implements a kind of bubble popping. However, we call it "genotyping". Transitive reduction of overlaps might not be something we specifically need to implement, but normalizing the graph is, and normalization procedures we implement could be used in assembly pipelines provided we maintain conceptual compatibility with the interfaces used by such methods.

In some sense, you could think of vg as an attempt to maximally break down and distribute process of generating an assembly out of a collection of genomes. But that's a topic for another day.

For variant diagram, we are more interested in queries such as "give me branching points or variants in [1000,2000) on contig xyz". This makes the data structures to effectively store the graphs different, too.

This could also be useful in assembly contexts. We support it using data structures which I would seriously not want to see released into the wild as an interchange format! I also wouldn't want to adjust the schema of the interchange format to somehow make such queries easier, as this would require defining all manner of secondary indexes or duplicating aspects of the data. Let's just define the data model, and leave querying it efficiently up to the community.

If there is an effective way to create blunt-end graphs (that is another topic), most assemblers produce such a graph. I haven't dealt with this myself –– my understanding is Gene and Jason prefer not to output such a graph as the final assembly. They apparently want some segments to explicitly represent full-length contigs. This pushes the GFA graph to a diagram.

This of course makes such outputs harder to work with in our case. We will probably find a way to convert or reform such assemblies into something we can work with. vg implements overlap reduction, and I assume it could be generalized to work on the side graph /diagram things that the E records in GFA2 suggest.

I finally see why you need CIGARs on paths. I have a different view. Mismatches and gaps in an overlap are errors rather than variants. A perfectly clean string/overlap graph should only have exact overlaps. Then you would not need CIGARs. To push this even further, David Jaffe argued a truly final assembly graph should have no overlaps. I gradually come to agree with him.

For vg at least, we need CIGAR like things (we use a series of Edit operations) for the same reason they are useful in resequencing. A common reference assembly can be accessed in a read-only manner such as when running read alignment on a cluster. Then, we can collect evidence and decide if we believe the things in the CIGARs using an error detection model like genotyping or count-based filtering. These aren't necessary for serialized graphs, but a lot of thought and code can be saved if the path "embedded" in the graph is represented with the same semantics as the path of an alignment through the graph.

I'm uneasy that GFA will develop an alignment format that is unrelated to paths, and to which we will be unable to interface directly. This is another conversation though.

In general, I'd like to ask for review of the vg schema by people in this group. It is 200 lines including self-describing comments, so hopefully this isn't much to ask. I would like it to be able to support most graphical models used by assemblers that read and write GFA like formats. I'd be very interested to know if anyone sees something that would need to change to ensure model compatibility with their software.

from gfa-spec.

sjackman commented on September 24, 2024

One way to interpret GFA as a graph is for each vertex to be a single nucleotide represented by the tuple (segment, orientation, position) and for each directed edge to connect two nucleotides. For example:

Alignment

   0123456789
U: AAAAACGTAA
       || |
V:   CCACCTCCCC
     0123456789

GFA

S U AAAAACGTAA
S V CCACCTCCCC
M U 4 8 + V 2 6 + 3 4M

Graph

Graphviz

digraph match {
graph [rankdir="LR"]

"U+0" [label="A"]
"U+1" [label="A"]
"U+2" [label="A"]
"U+3" [label="A"]
"U+4" [label="A"]
"U+5" [label="C"]
"U+6" [label="G"]
"U+7" [label="T"]
"U+8" [label="A"]
"U+9" [label="A"]

"V+0" [label="C"]
"V+1" [label="C"]
"V+2" [label="A"]
"V+3" [label="C"]
"V+4" [label="C"]
"V+5" [label="T"]
"V+6" [label="C"]
"V+7" [label="C"]
"V+8" [label="C"]
"V+9" [label="C"]

"U+0" -> "U+1" -> "U+2" -> "U+3" -> "U+4" -> "U+5" -> "U+6" -> "U+7" -> "U+8" -> "U+9"
"V+0" -> "V+1" -> "V+2" -> "V+3" -> "V+4" -> "V+5" -> "V+6" -> "V+7" -> "V+8" -> "V+9"
"U+3" -> "V+2"
"V+1" -> "U+4"
"U+7" -> "V+6"
"V+5" -> "U+8"
}

Paths

Eight paths are possible through this graph from a source to a sink. Two of those paths represent the original reads. Six represent possible contigs composed of these two reads.

I don't believe the GFA spec needs to define this interpretation. I wanted to show one way that match M records can be interpreted as edges in a way that is consistent with meaningful paths through the directed graph.

from gfa-spec.

lh3 commented on September 24, 2024

I think they are all too specific. I think it is fine to have "P" as a specially category of subgraph. After that, we should just have a generation notion of subgraph. (I do use the concept of "bundles" in the code and writing but they are just subgraphs too.)

Interesting. However, subgraph is a subset, but path is an ordered list. How do you introduce "order" to a subgraph?

I like M for match

I like this: "M" for match.

One way to interpret GFA as a graph is for each vertex to be a single nucleotide represented by the tuple (segment, orientation, position) and for each directed edge to connect two nucleotides.

In GA4GH, we call this "base graph". This is a real graph.

@ekg Sorry that my wording is ambiguous. By "variant diagram", I mean insert "graph", side "graph", and "graph" requiring unusual edge properties. The vg variation graph has GFA1 at the heart. It is indeed a mathematical graph, not a diagram. Jouni's base graph is a real graph and can encode variations. FASTG is equivalent to GFA1, so the discovar variation graph is a real graph, too. I agree with the rest of your comments: GFA1 alone can be an effective way to encode variations, though not necessarily the best way (I don't know what is the best way, so I don't want GFA to be locked into a particular representation too soon).

from gfa-spec.

lh3 commented on September 24, 2024

The following is a unfinished draft on the GFA format. It largely copies the overall structure of the current spec. A version controlled draft is here.

Introduction
Terminologies
GFA: Mandatory Fields
TODO

Introduction

GFA stands for Graphical Fragment Assembly format. It is a TAB-delimited text format to describe the relationships between sequences. Initially designed for sequence assembly, GFA may also represent variations in genomes and splice graphs in genes.

For an example, in the picture below, each line is a nucleotide sequence with the arrow indicating its orientation:

Assuming each sequence is 200bp in length and each overlap is 20bp in length, we can encode this graph in the GFA format as:

S  A  200  *
S  B  200  *
S  C  200  *
S  D  200  *
L  A  +  B  +  20  20
L  A  +  C  -  20  20
L  B  +  D  +  20  20
L  C  -  D  +  20  20

where each S-line, or segment line, gives the property of a sequence, including its length and actual nucleotide sequence; each L-line, or link line, describes the relationship between two segments. There are two common ways to understand an L-line. We take L A + C - as an example. First, in the overlap graph view, the L-line indicates sequence A on the forward strand is ahead of C on the reverse strand. Second, in the string graph view, the end of A transits to the start of C (i.e. + for the end of a sequence and - for the start). In the following, we often take the overlap graph view for convenience.

Notably, if L A + C - is a link, L C + A - is also a link because the two are equivalent:

Terminologies

Segment: a sequence. An oriented segment is a 2-tuple (segment,strand). Each oriented segment has a complement oriented segment (segment,¬strand), where operator ¬ gives the opposite strand.
Link: a (full) dovetail overlap between two oriented segments. Each link has a complement link derived by swapping the order of segments and flipping the orientations. For the definition of dovetail overlap, see documentations from GRC or from wgs-assembler.
Gap: an unknown sequence connecting the ends of two oriented segments.
Match: a local alignment between two oriented segments.

Mathematically, GFA models a skew-symmetric graph, where each vertex is an oriented segment, and each edge is a link in GFA.

GFA: Mandatory Fields

In GFA, each line is TAB-delimited and describes only one type of data. On each line, the leading letter indicates the data type and defines the mandatory fields on that line. The following table gives an overview of different line types in GFA:

Line	Type	col1	col2	col3	col4	col5	col6	col7
`H`	Header
`S`	Segment	sid	slen	seq
`L`	Link	sid1	ori1	sid2	ori2	olen1	olen2
`G`	Gap	sid1	ori1	sid2	ori2	dist
`M`	Match	sid1	ori	sid2	beg1	end1	beg2	end2

In the table, sid* are strings, slen, olen, and dist are 32-bit non-negative integers, and ori take values of + or -; beg* may be an integer or ^ for the start of a segment; end* may be an integer or $ for the end of a segment.

Segment line

A segment line or S-line takes the following format:

S   <sid>   <slen>  <seq>

where sid is the segment name, slen is the length of the segment and seq is the sequence which can be * if not available.

Link line

A link line or L-line is

L   <sid1>  <ori1>  <sid2>  <ori2>  <olen1> <olen2>

where sid1/sid2 are the names of segments involved in the link, ori1/ori2 are orientations (either + or -), and olen1/olen2 are the lengths in the overlap as is shown in the following (o1 and o2 in the figure correspond to olen1 and olen2, respectively):

In GFA, each L-line has a complement L-line, which is

L   <sid2>  ¬<ori2>    <sid1>  ¬<ori1>    <olen2> <olen1>

Gap line

A gap line or G-line is

G   <sid1>  <ori1>  <sid2>  <ori2>  <dist>

where dist is the best estimate between the ends of two segments. A G-line is effectively an L-line with negative overlap length.

Match line

A match line or M-line is

M   <sid1>  <ori>   <sid2>  <beg1>  <end1>  <beg2>  <end2>

where sid1/sid2 are the names of segments, [beg1,end1) gives the interval on the forward strand of sid1 and [beg2,end2) gives the interval on the ori strand of sid2.

Each M-line also has a complement M-line.

TODO

Optional fields
Paths
To store both link and complement link or not
Nail down the format of M-line

from gfa-spec.

pb-jchin commented on September 24, 2024

Interesting. However, subgraph is a subset, but path is an ordered list. How do you introduce "order" to a subgraph?

You don't, as many useful structure in an assembly graph can not have strict order. For example, how do you order a non transitive reduced graph? I think the GFA as described is too specific. It may be only useful for something.

Mathematically, GFA models a skew-symmetric graph, where each vertex is an oriented segment, and each edge is a link in GFA.

While it might be mathematically isomorphic, I still don't think this is the original definition. You can change the definition of the vertex, but you should mention that too.

from gfa-spec.

pb-jchin commented on September 24, 2024

also,

Link: a (full) dovetail overlap between two oriented segments. Each link has a complement link derived by swapping the order of segments and flipping the orientations. For the definition of dovetail overlap, see documentations from GRC or from wgs-assembler.

Is this an directed or undirected edge in the graph? or it actually represents two directed edges from one record? Do you also need the "complement link" record in GFA? Or, is it ok to have just one?

from gfa-spec.

lh3 commented on September 24, 2024

You don't, as many useful structure in an assembly graph can not have strict order. For example, how do you order a non transitive reduced graph? I think the GFA as described is too specific. It may be only useful for something.

Let's talk about this when we nail down paths. I am fairly open, as I don't have a good idea myself.

While it might be mathematically isomorphic, I still don't think this is the original definition. You can change the definition of the vertex, but you should mention that too.

Changed to "Mathematically, GFA models a skew-symmetric graph, where each vertex is an oriented segment (in the overlap graph view) or the 5'- or 3'-end of a segment (in the string graph view), and each directed edge is a link in GFA."

Is this an directed or undirected edge in the graph? or it actually represents two directed edges from one record?

I have added "A link is directed" in this paragraph and said that each directed edge in SSG is a link in GFA. For all changes, see here.

Do you also need the "complement link" record in GFA? Or, is it ok to have just one?

See TODO at the end of the draft. It is pending (see also #35). My mild preference is to have one link stored by default, but a header tag may indicate both links are stored. Note that this duality is not only applied to L-lines, but also to G- and M/E-lines.

Also, I know you like to keep the overhang length rather than the overlap length at L-lines. Do you have a strong preference or think overlap length is acceptable? My mild preference is overlap length as it is easier to see if there is an overlap. Another possible way is to encoder zero-length overlap only at Gap lines as @MihaiPop suggested. I am open.

Thanks a lot for the improvements!

from gfa-spec.

pb-jchin commented on September 24, 2024

I have added "A link is directed" in this paragraph and said that each directed edge in SSG is a link in GFA. For all changes, see here.

If so, one will need two edges (two L records) for the full graph. Simple.

See TODO at the end of the draft. It is pending (see also #35). My mild preference is to have one link stored by default, but a header tag may indicate both links are stored. Note that this duality is not only applied to L-lines, but also to G- and M/E-lines.

See above. If you think the "link" is directed, you will have to have both links for both edges. If you want to treat the record as "Link" to represent the overlap, you will have to define the mapping of the record L to the dual edges for SSG. This will be equivalent to Gene's dovetail E records anyway. (If we ignore the orientation, there is only ONE overlap. Such "one overlap -> represented in both directions -> dual edges" is the source of many confusion here.) Such L record as different representations of overlaps and edges in a "string graph" confused me quite a while, see #8

In fact, mathematically, in Gene's original paper, sequences (from the overhangs) are associated with the "dual" edges. If one considers the sequences as the edge labels, we don't have a pure skew-symmetry. (One will have to add an extra mapping definition to map the edge labeling sequences to restore such symmetry.)

EDIT:
by the way, I use "B" and "E" to specify the "ends" of a segment and concatenate sequence id with the "B" or "E" without space as I do treat them as real vertices. In the mean time, "+" is also an characters of common regular expression. I felt that using "+" or "-" were not clear to show they actually represented the ends of the read segments, so I used "B" and "E". Anyway, I don't really care about such detail. I do care about whether some situations can be represented by a format. If not, I will have to add metadata anyway or use other richer formats that can express all scenarios.

from gfa-spec.

pb-jchin commented on September 24, 2024

Also, I know you like to keep the overhang length rather than the overlap length at L-lines. Do you have a strong preference or think overlap length is acceptable? My mild preference is overlap length as it is easier to see if there is an overlap. Another possible way is to encoder zero-length overlap only at Gap lines as @MihaiPop suggested. I am open.

I like @MihaiPop's proposal, but if segments lengths are available, it is a simple conversion anyway. (I just like to make sure the coordinate system is relative to the original sequences not the complements regardless the overlap orientations.

from gfa-spec.

sjackman commented on September 24, 2024

I'm generally in favour of Heng's proposal. I do not however want to break compatibility with the existing spec and the existing implementations of GFA. I feel very strongly about this. Breaking compatibility is a big deal. It requires that

authors update their implementations, if the software is maintained
if they're not maintained (because students have graduated, for example), it requires that users convert between GFAv1 and GFAv2 formats, when using software that support different versions of the spec, which is annoying to users, and in my opinion defeats the purpose of having a standard
authors of software that want to support both formats (like perhaps Bandage) for compatibility with old and new software and old and new versions of the same software will have to maintain code to read both GFAv1 and GFAv2 forevermore
it betrays the trust of the community that incompatible changes to the spec will only happen when absolutely necessary, and when there is no other way to accomplish the same features in a way that does not break compatibility with existing implementations, and that the features are sufficiently beneficial and necessary to warrant the incompatible change (perhaps for example to support sorting and indexing for random access)

Note that the proposed changes importantly fail the fourth test. It is possible to add these features to the existing standard by adding optional fields in a manner that does not break backward compatibility. We can achieve a lot by adding optional tags and new record types, deprecating existing record types, but not modifying the fixed fields of existing record types.

One big reason I think that SAM/BAM was so successful was that it never broke compatibility. If I want to compare the results of the first release of BWA to those of the last release of BWA, I can do that easily because the SAM/BAM format has not changed the core fields.

I propose we keep segment S as is, add olen1 O1 and olen2 O2 optional attributes to the link L record, and add new gap G and match M records.

S   <sid>   <seq>  LN:i:<slen>
L   <sid1>  <ori1>  <sid2>  <ori2>  <cigar> O1:i:<olen1> O2:i:<olen2>
G   <sid1>  <ori1>  <sid2>  <ori2>  <dist>
M   <sid1>  <ori>   <sid2>  <beg1>  <end1>  <beg2>  <end2>

from gfa-spec.

sjackman commented on September 24, 2024

beg may be an integer or ^ for the start of a segment; end may be an integer or $ for the end of a segment.

I would prefer not to use ^ and $ so that these columns have a consistent integer type to make parsing easier and to make the format usable with standard tools like awk and sort.

from gfa-spec.

sjackman commented on September 24, 2024

Also, I know you like to keep the overhang length rather than the overlap length at L-lines. Do you have a strong preference or think overlap length is acceptable? My mild preference is overlap length as it is easier to see if there is an overlap.

I prefer overlap length.

from gfa-spec.

pb-jchin commented on September 24, 2024

I prefer overlap length.

again, this is the dilemma, are the L records for "overlaps" or the string graph "edges" (SSSG edges)? If they are for overlaps, I agree with you. If they are for the "edges", then overhang lengths are proper. (just math. as I said before, as long the conversion is easy, who cares. In the meantime, if we want to advocate that it follows some math structure, then it is better to be consistent.)

from gfa-spec.

sjackman commented on September 24, 2024

In my mind, GFA represents a string overlap skew-symmetric graph of DNA sequences. The vertex U+ is one string, the complementary vertex U- is the reverse complement string, and the link L edges represent the overlaps of strings.

from gfa-spec.

pb-jchin commented on September 24, 2024

In my mind, GFA represents a string overlap skew-symmetric graph of DNA sequences. The vertex U+ is one string, the complementary vertex U- is the reverse complement string, and the link L edges represent the overlaps of strings.

That is fine but just point out it that it is different from other definition. (You can provide the information how to map one to the other.) Also, in this case, I am wondering why we need two fields for one vertex. (BTW, don't get wrong. I am not objecting the format. As long as one can encode the information, I don't care too much. But I do like to point out what I see as inconsistence .)

from gfa-spec.

sjackman commented on September 24, 2024

Also, in this case, I am wondering why we need two fields for one vertex.

I'm not sure that I follow. Can you elaborate?

from gfa-spec.

pb-jchin commented on September 24, 2024

Also, in this case, I am wondering why we need two fields for one vertex.

I'm not sure that I follow. Can you elaborate?

In you early comment, you use U+ and U-. Each of them represent different element in a set and your description in the comment shows that clearly. In the spec, there are U and + or - as separate fields. In such case, it seems to me U is the fundamental "element" in a set but U is not in the SSG technically.

from gfa-spec.

sjackman commented on September 24, 2024

The ABySS implementation adds two vertices to the graph for each segment S record and two edges for each link L record. For example

S U *
S V *
L U + V + *

Results in the following skew-symmetric graph of four vertices and two edges.

digraph {
"U+"
"U-"
"V+"
"V-"
"U+" -> "V+"
"V-" -> "U-"
}

from gfa-spec.

pb-jchin commented on September 24, 2024

The ABySS implementation adds two vertices to the graph for each segment S record and two edges for each link L record.

Yes. In FALCON internal format, a vertex is a vertex derived from an overlap. An overlap (e.g. Gene's dovetail E will create two vertices in the code) is an overlap. If you take this point of view to convert one L record to two vertices, I will call L an overlap. If we force to have two records for two edges, each L encode an edge, in such case, we should just have two fields of the two vertiecs rather than 4 fields. (Again, I don't really care what convention used in the final spec, it encodes the same information. This is not a request for modifying the format, but some opinion which probably does not matter at the end.)

EDIT:
basically, the GFA wants to have a link L to represent both overlaps and directed edges in SSG, but the behavior is not defined within the spec. For directed edges representation, it will be nice to have simple label (one field) than two fields for each single vertex. However, not really a big deal. In fact, I constantly need to separate the vertex label back to two fields for computing the dual edges.

from gfa-spec.

lh3 commented on September 24, 2024

If so, one will need two edges (two L records) for the full graph. Simple.

In GFA1 and my latest proposal, a pair of complement/dual edges on a string graph is preferentially given on only one link line. That link encodes the information of both edges. Nonetheless, I am open to changing an L-line to:

L  <sid1>  <ori1>  <sid2>  <ori2>  <overhangLen1>  <overlapLen1>

Notably, we only keep the length information for one of the complement/dual edges. This breaks the symmetry and forces users to use two link lines, one for each edge. It also resolves your concern about whether to keep overhang length or overlap length – we keep both, though we allow one of them to be missing. The downside is we don't know the overhand/overlap length of sid2. I actually slightly prefer this new L now. What do you think, @pb-jchin?

Also, in this case, I am wondering why we need two fields for one vertex.

I thought about that before. As there is no "vertex line", I feel it is more natural to have explicit segment names on L-lines, too.

@sjackman I really care about compatibility, but in this case, I think it is the best time to break it. There are a few quite significant flaws in GFA1 I have been wanting to fix for a while.

I also have an incomplete prototype implementation of GFA2 here. It allows some corner cases that can be inferred from other information, mostly related to dual edges (see test.gfa in that directory), but this complicates the code quite a bit. If we adopt the new L-line as above, we could avoid this complication. Nonetheless, beware that GFA-Match/DAS-Edge lines still have this complication and are probably worse.

from gfa-spec.

sjackman commented on September 24, 2024

Edge `E` record

I'd prefer that segment records precede the edge E records that reference them. I argued for this ordering for S and L records but lost that debate. Perhaps we could discuss the ordering requirement once again. ABySS for example outputs all the S records before all the L records.

I'd prefer that we use 0 and the numerical length of the sequence rather than special characters or special numerical values like -1. Although -1 is an interesting proposal, it leaves me a bit uneasy. I have occasionally found it useful to refer to the nucleotides immediately before the start of a segment, and use negative to coordinates to refer to this region to the left of the origin. My opinion is that exceptions unnecessarily complicates a storage format for no expressive benefit. The end of the segment can be easily represented by the numerical value of the length of the segment.

Importantly though, my preference is only that, my preference. If the popular preference is for ^ and $ as special markers for the start and end, I'm entirely happy going with the popular vote.

from gfa-spec.

sjackman commented on September 24, 2024

ABySS has a few tools for graph manipulation.

abyss-todot converts between graph file formats: Graphviz, GFA, ASQG, ADJ (deprecated ABySS 1.0 format) and SAM (overlap alignments of contigs, useful for IGV, which doesn't yet read GFA).
abyss-filtegraph implements tip trimming (-t), coverage threshold (-c), compacting non-branching edges (--assemble)
PopBubbles, well, I imagine you can guess. =)

@rchikhi Rayan Chikhi's BCALM2 doesn't implement graph simplification routines other than removing k-mers with coverage below a specified threshold. He had expressed interest in using abyss-filtegraph for simplifying his assembly graph. I'm excited for a time when ABySS, Miniasm, BCALM2 and other assemblers can all share a single implementation of gfatools to eliminate (or at least reduce) this duplication of effort, just as aligner authors don't have to reimplement samtools sort, samtools mpileup, bcftools call, and so forth.

Each of these tools (abyss-todot, abyss-filtergraph, PopBubbles) can read and write GFA as well as Graphviz, ASQG (SGA) and ADJ (ABySS 1.0). Reducing duplication of effort will be great, but I am even more excited for the ability to mix and match components between assemblers, for example using the unitigger from BCALM, sga overlap from SGA, PopBubbles from ABySS, and so on.

from gfa-spec.

richarddurbin commented on September 24, 2024

I have tried to follow all this, but it is very long. I am giving my points with identifiers so people can refer to them separately.

R1) It seems to me that paths need optional start and end coordinates within their initial and terminal segments. vg absolutely requires that, and we clearly might want to start within a long contig or haplotig. But I don't understand @sjackman's proposed tags ST:B:I,0,55 EN:B:I,150,160. Surely we just need a single integer for each, as I interpret @rrwick's suggestions. So ST:I:0 EN:I:160.

R2) I think we need optional cigar for the reasons that vg allows cigar, to support inexact alignments. I would just have a CG:B: optional field at the end of the line as for Links and Matches, which would be a cigar against the whole path as defined. This is different from vg which allows cigar for each component of the path, but they are clearly interconvertible and I prefer the idea that the cigar alignment information is a property of the path, not its components. This also requires being specific about the sequence in a link or edge which itself has a cigar - I propose that the sequence is that of the segment you are coming from while traversing the path.

R3) Heng keeps saying that internal match edges are nothing to do with assembly or graphs, but I think he is clearly wrong about assembly. The graph with links only is perhaps relevant at one point of the assembly process, which is what he has focused on. But following that, when one is trying to make as long scaffolds/haplotigs as possible (which must the point of genome assembly), then you naturally need the other edge types. In particular to support alternative paths as in simple and complex bubbles, and potentially things that are not bubbles at all. And also in the end when you have a diploid assembly it is essential to indicate which sections of the two haplotigs should be identified, so that paths can cross over there, and these can be identified with other sequences that one might have, such as reference sequences. Even if not diploid, I may not be able to resolve long repeats and may want to represent two or more copies of them in long paths, indicating shared segments. We have to support uses all through the assembly process, and also for variant graphs, which they inevitably blend into. I am happy to adopt Match rather than Edge for these, but they are no doubt a form of edge (lower case, meaning a graph edge between vertices that are segments).

R4) Given we have such Matches and hence a multi-graph, then to specify a path we need in the general case to specify which link or match is used. This is why Gene introduced match id in the same namespace as segment id, so that the sequence of ids defining a path can specify matches where necessary as well as segments. This seems clean to me. I presume the default behaviour would be that if there is a link and no match is specified then by default you use the link.

R5) Gene's proposal had both paths and subgraphs using the same record type, with a required field to distinguish them. I see the need for subgraphs - Jason in particular requires them as an intermediate in FALCON, but I can also imagine someone writing a bubble detector that does not pop anything, but just creates subgraph objects to identify them. Then phasing software that is population or pedigree aware could phase them as a separate step.

R6) We should make length mandatory for Segment fields. Mandatory fields should be fixed fields. Consistency in this principle is more important than consistency in the syntax. As Heng has already implemented in his released gfaview (and Gene and I in code we have written) it is no problem for a parser to read both GFA1 without fixed field length and GFA2 with fixed field length. Starting to use optional field for required information is much more likely to create confusion, because it is hard to enforce that people provide the required data. So I am sure, from lots of experience with formats of this type with required and optional fields, that the consistent position is that, once we decide that length is mandatory, we should make it a fixed field. Given that, it should clearly come between the id and the sequence.

from gfa-spec.

lh3 commented on September 24, 2024

R3) Well, I understood why you need internal matches after a day or two into the discussion. I have added Match lines to GFA2 for a while and think it is a great addition. I am not questioning the usefulness of M-lines at all. My major concern is that you are trying to change the semantics and the mathematical foundation of GFA. GFA1 is a directed skew-symmetric graph where most information is conferred by the graph topology itself. You want to change it to an undirected graph where the graph topology is often not meaningful without coordinates. I am strongly in favor of SSG for a few reasons. First, GFA1 is an SSG. Changing the math foundation leads to a new format other than GFA. Second, SSG has a solid root in graph theory. It is almost the only meaningful graph representation of assemblies before you start to merge at forks (de Bruijn graph can be thought as SSG as well). Third, SSG can encode variants. @ekg has said this enough. David Jaffe is using an SSG-equivalent representation as well. Your E/M-based representation is just another one. While I buy yours is nice and effective, I am not seeing it convincingly better than Erik and David's SSGs. I am cautious of locking GFA into a graph representation that has not been repeatedly proven to be the best.

I (of course) think GFA2 proposed by me earlier achieves a good balance between different parties: it is an SSG while allowing flexible internal matches in your style. I really have one condition only: GFA is SSG.

R4) I am open to match/link names. I prefer tags, but can live with new fixed fields.

R5) I am also open to defining "subgraphs" (I would call them subsets) in principle, but this needs more discussions.

R6) I strongly support breaking the forward compatibility of GFA1 (GFA1 parser not reading GFA2), as long as we keep the backward compatibility (GFA2 parser reading GFA1). There are a few important things in GFA1 that are not right.

from gfa-spec.

lh3 commented on September 24, 2024

SSG can encode variants. @ekg has said this enough. David Jaffe is using an SSG-equivalent representation as well.

I will explain how @ekg and David Jaffe would encode variants, for a record myself. @ekg, correct me if I say something wrong about vg.

In my understanding, after initial assembly, we get an SSG. Segments are unitigs. In the vg view, a primary contig is a path, not a segment. We need to follow the path to get the contig sequence. As to haplotigs, there are a few options. One way is just to use one path for one haplotig. Another way is to have a path spelling the other haplotype with phasing switch points encoded somewhere. In addition, vg supports graph alignment on SSG. I believe it translates between the segment and the path coordinates (right, @ekg?). We are yet to have a graph aligner that works on graphs with internal matches. Of course, converting such a graph to SSG definitely works.

David's approach is similar in that in the final assembly graph, which is also an SSG, long-range information is broken into much smaller segments. However, it seems that David is not storing paths. He traverses the graph on the fly to emit long FASTA sequences in different styles (e.g. phased blocks only; long contigs with occasional phasing switch errors; etc). I guess he can do this efficiently because the nested bubble structure of FASTG greatly simplifies such traversal. (PS: @ekg, have you tried vg on 10X's SuperNova graph? The graph can be converted to GFA, but it may take some efforts.)

from gfa-spec.

sjackman commented on September 24, 2024

R6) I strongly support breaking the forward compatibility of GFA1 (GFA1 parser not reading GFA2), as long as we keep the backward compatibility (GFA2 parser reading GFA1)

@lh3 Are you saying that we require that GFA2 implementations must be able to parse both GFA1 and GFA2 segment S record formats to be considered a conforming implementation of GFA2?

S <id> <seq> LN:i:<length>
S <id> <length> <seq>

from gfa-spec.

MihaiPop commented on September 24, 2024

Heng - I don't agree with the requirement that GFA v.X stay a skew symmetric graph and I'm in favor of the more "messy" but more expressive proposal made by Gene and Richard. My world is a lot messier than what you encounter when assembling well behaved genomes with high depths of coverage. A metagenomic assembly is a very messy mixture of genomes at widely varying coverages, and with mobile elements mixing the genomes around in unpredictable ways. The usual view of variants as "pretty bubbles" formed by two haplotypes mixed together simply doesn't exist in my world, and I'd like whatever format gets created to be able to model the complexity of our data.

I'm also concerned that many of our assumptions now are simply due to what technology is available today, and newer technologies will throw additional wrenches in the beautiful mathematical formalisms we may come up with. This has happened again and again as we moved from Sanger to short read next gen data, to long read next gen data, and now to noisy PacBio. Likewise, mate-pairs, optical maps, HI-C, TLSR&10X, etc. are adding more information that is very useful for assembly and which we want to be able to model in the assembly graphs rather than relying on many different ad hoc formats simply because the assemblers subscribe to an idealized view of the world. That's how XML became a "data format" - let's not make that mistake again.

Funky edges (or matchings), paths, groups, and subgroups are all necessary for getting a graph representation that has a chance to survive changes in technology. If we can't achieve that, we'll continue to spend our efforts writing converters, coming up with new file formats, and having debates such as that here instead of actually trying to figure out what the data tell us about biology.

Sorry for the long diatribe.

from gfa-spec.

lh3 commented on September 24, 2024

Sorry for the long diatribe.

No need to say sorry at all. I have made more noises than many of you combined :) Thanks for your opinions.

I don't agree with the requirement that GFA v.X stay a skew symmetric graph and I'm in favor of the more "messy" but more expressive proposal made by Gene and Richard.

My proposal is as expressive as Gene and Richard's. I said GFA2 will have more general "paths" etc. If you don't care what a graph GFA represents, you can ignore the SSG property.

The usual view of variants as "pretty bubbles" formed by two haplotypes mixed together simply doesn't exist in my world

How do you store your assembly graphs currently? Do you have concrete examples? Thanks.

Are you saying that we require that GFA2 implementations must be able to parse both GFA1 and GFA2 segment S record formats to be considered a conforming implementation of GFA2?

No. I wanted to say the reference implementation should be backward compatible.

from gfa-spec.

lh3 commented on September 24, 2024

R4) Given we have such Matches and hence a multi-graph, then to specify a path we need in the general case to specify which link or match is used.

The current Path definition in DAS is:

<group>    <- P[UO] <name:id> <item>([ ]<item>)*
<item>     <- <sid:id> | <eid:id>

How does this work? Do we allow a Path like the following?

P  foo  A  B  C
L  A  +  B  +
L  B  +  C  +
L  A  +  B  -
L  B  -  C  +

Note that we can't tell the orientation of B in path "foo". Or by Path, we really mean:

<group>  <- P  <name:id>  <sid:id>(  <eid:id>  <sid:id>)*

like the definition of walk/path in graph theory?

We know each M/E-line has a complement M/E-line. Do we require both to be present in GFA2 or just one of them? I guess we require one? Do we allow M-loops in GFA like:

M  A  -  A  10  20  30  40

If we allow M-loops, I think we can't use one link/match ID because there will be ambiguity about what orientation is in use. In addition, on an ordered M-path, we can use neighbors to tell which orientation of link/match is being used, except M-loops. On an unordered M-subgraph, it is often impossible to tell the orientation of a Match. Perhaps we need the following format instead:

<group>  <- P  <name:id>  <sid:id>[+-](  <eid:id>[+-]  <sid:id>[+-])*

where <eid:id>+ is the Match stored in GFA2, while <eid:id>- is its complement.

Furthermore, in the following figure

does "B+,f+,A+,e+,B+" spell a valid sequence? Do we forbid such M-paths? Also, when we specify a M-subgraph with {B+,A+}, do we literally mean the M-subgraph without any Matches/Edges, or mean {B+,e+,A+,f+} (i.e. including the associated Matches/Edges)? If we think it is {B+,A+} without Matches/Edges, we may have such a M-subgraph {B+,A+,g+}. Is it a valid M-subgraph?

from gfa-spec.

sjackman commented on September 24, 2024

I have tidied up the formatting of the current GFA 1.0 spec and tagged a release of 1.0 in preparation for the upcoming pull requests for GFA 2.0.

from gfa-spec.

richarddurbin commented on September 24, 2024

I support that. Richard

from gfa-spec.

richarddurbin commented on September 24, 2024

Some comments below.

On 15 Sep 2016, at 07:03, Heng Li [email protected] wrote:

R4) Given we have such Matches and hence a multi-graph, then to specify a path we need in the general case to specify which link or match is used.

The current Path definition in DAS is:

<- P[UO] name:id ([ ])*
<- sid:id | eid:id
How does this work? Do we allow a Path like the following?

P foo A B C
L A + B +
L B + C +
L A + B -
L B - C +
Note that we can't tell the orientation of B in path "foo". Or by Path, we really mean:

<- P name:id sid:id( eid:id sid:id)*
like the definition of walk/path in graph theory?

My idea was that the eids could be dropped if unambiguous.
We know each M/E-line has a complement M/E-line. Do we require both to be present in GFA2 or just one of them? I guess we require one? Do we allow M-loops in GFA like:

M A - A 10 20 30 40
If we allow M-loops, I think we can't use one link/match ID because there will be ambiguity about what orientation is in use.

Just as with links, I have the impression we allow to give either one or both. The semantics if they disagree are different though. With links my understanding is that given the segment ids and orientation they are unique, so disagreement is an error. We allow multiple M/E lines between the same oriented pair of segments, so disagreement in the fixed data creates a separate object. For M-loops, we have a choice. We could either require creation of a Segment, so

S B 10 *
M M1 A + B 10 20 0 10 // NB I have given a match id here.
M M2 B - A 0 10 30 40 // apologies if I have not quite got the syntax right for the coordinates - I hope the idea is clear

or we could require that you give the reverse edge with a different eid if you want to refer to it. Personally I think we should ban self matches, and require use of an intermediate Segment.

In addition, on an ordered M-path, we can use neighbors to tell which orientation of link/match is being used, except M-loops. On an unordered M-subgraph, it is often impossible to tell the orientation of a Match. Perhaps we need the following format instead:

<- P name:id sid:id[+-](eid:id>[+-] sid:id[+-])*
where eid:id+ is the Match stored in GFA2, while eid:id- is its complement.

For an unordered subgraph I don’t see that we need orientations. We simply take the set of segments and edges listed,
plus (implicitly) any links between the listed segments, as the defined subgraph. There is no ordering, just as there is not for the graph itself.

For a path, then if we ban self-matches then we are OK. If not, then we follow the orientation of the match-id that we specify.

Furthermore, in the following figure

https://cloud.githubusercontent.com/assets/480346/18538935/fdda1f9c-7ae0-11e6-8792-81db930e67f1.png
does "B+,f+,A+,e+,B+" spell a valid sequence? Do we forbid such M-paths? Also, when we specify a M-subgraph with {B+,A+}, do we literally mean the M-subgraph without any Matches/Edges, or mean {B+,e+,A+,f+} (i.e. including the associated Matches/Edges)? If we think it is {B+,A+} without Matches/Edges, we may have such a M-subgraph {B+,A+,g+}. Is it a valid M-subgraph?

I don’t see what B+,f+,A+,e+,B+ specifies. The junction is directional. You can have B+,f+,A+,g+,C+, but you don’t need the

and -. BfAgC is fine. In fact in this case BAC is unambiguous. I am not sure you ever need the +/-. I think not.
You are receiving this because you commented.
Reply to this email directly, view it on GitHub #36 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/ADRb5icsoHqmR9PHqIPBh9PGMdPxd5J_ks5qqN-hgaJpZM4J0xVt.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

from gfa-spec.

lh3 commented on September 24, 2024

I don’t see what B+,f+,A+,e+,B+ specifies.

Just to confirm – ChAeB and BfAeB are invalid M-path as they do not spell sequences. The parser needs to check invalid M-Paths. It is in fact a bit tricky to check ChAeB, in that both ChA and AeB are Matches, but when you put them together, the M-Path is invalid. This is one of the reasons why I think a general Match is not qualified as a graph edge.

We could either require creation of a Segment ... or we could require that you give the reverse edge with a different eid if you want to refer to it. Personally I think we should ban self matches, and require use of an intermediate Segment.

It is not rare to see an L-loop:

L  A  +  B  +
L  B  +  B  +
L  B  +  C  +

where B represents a tandom repeat longer than the read length. I am not so sure we should ban loops altogether. Assemblers may generate them anyway. As to your second idea of having different eids for L eidBplus B + B + and L eidBminus B - B -, it works in this case. However, for non-loops, we would demand a Match and its complement to have the same eid; otherwise M-Paths and subgraphs may have weird structures. We have to set an exception for loops. The breaking symmetry may lead to troubles on subgraphs: we have to require both eidBplus and eidBminus to be present in a subgraph?

In addition, not having orientations results in another minor inconsistency. When there are two or more different segments, we can specify a Path in two ways: forward or backward. When there is only one segment in a Path, we can only specify the Path in the forward way. In general, in SSG, it is natural to specify a Path with A+B+B+C+ because A+ etc are vertices in the graph. I would still prefer to have orientations, like the GFA1 Paths, to avoid making exceptions in loops and single-segment Paths.

My idea was that the eids could be dropped if unambiguous.

When you allow eids to be missing on P-lines, you are imposing a constraint on GFA2: there should not be duplicates between eids and sids; otherwise you can't tell if an ID is an eid or an sid. In addition, this causes some troubles in parsing. Because sids and eids may be defined later in the file, when we come to a P-line, we may not tell eids from sids. We have to keep the strings in memory and resolve them when the whole file is read, or we do two-pass parsing. This seems clumsy.

We simply take the set of segments and edges listed, plus (implicitly) any links between the listed segments, as the defined subgraph.

So U-line really represents a subgraph induced from a set of segments and L/M-lines. This is fine if we induce a graph only from segments, but may lead to a chain reaction if we start to induce from Matches. For example, {g} will induce {A,g,C}. Now as A and C are in the set, they will induce {A,g,C,h}. ~~This can be arbitrarily complex. If we disallow inducing subgraphs from Matches, {A,B,g} is an invalid subgraph, which the parser needs to check.~~ In addition, do we always want an induced subgraph? For example, if A and D are two long haplotypes and have many Matches between them. Do we always want to include all these matches in a subgraph? Also note that when we allow to induce a graph from segments, Matches/Edges become redundant. (EDIT: in addition, do we allow disconnected subgraphs?)

For an unordered subgraph I don’t see that we need orientations.

If we specify a M-subgraph without orientation, a U-line is specifying two subgraphs: the subgraph and its complement. This is fine to me, but it leads to a new question: how are subgraphs getting used? Paths spell sequences, so they are useful. I am not sure how we should use subgraphs. You and Jason mentioned bubbles as subgraphs, but we can't easily tell if a subgraph is a bubble or not and we are unable to connect a bubble subgraph as a whole to the rest of graph. Note that FASTG only has bubble subgraphs. Their special syntax makes their connections to the larger graph fairly obvious. In the current definition, GFA subgraphs lack this nice property. I think this is an area we should learn from FASTG.

from gfa-spec.

lh3 commented on September 24, 2024

Just to confirm – ChAeB and BfAeB are invalid M-path as they do not spell sequences. The parser needs to check invalid M-Paths. It is in fact a bit tricky to check ChAeB, in that both ChA and AeB are Matches, but when you put them together, the M-Path is invalid. This is one of the reasons why I think a general Match is not qualified as a graph edge.

Thinking about this, I realize omitting edges on Paths may lead to one more complication. Although ChAeB is not a valid Path, CgAfB is valid. This means CAB is a valid and unambiguous M-Path. This is kind of nice, until we need the parser to understand M-traversals. When there are many Matches between C-A and between A-B, which easily happens when A/B/C represent long haplotypes, it may take non-trivial efforts to determine if CAB is unambiguous because the parser needs to read coordinates. An efficient implementation of this check is also interesting. A naive one would be quadratic. To avoid this complication and the ones mentioned earlier, the parser may choose not to validate the semantics. Garbage in, garbage out.

EDIT: we may require in the CAB case, g/f Matches must be present, but implementing general M-traversals remains complex.

from gfa-spec.

sjackman commented on September 24, 2024

An unordered subgraph induced from a set of vertices should include both segment IDs and orientations. A vertex in GFA is a two-tuple of segment-id and orientation. Consider the simple bubble below. The relative orientations of the four segments is necessary information.

digraph {
rankdir = "LR"
"X+" -> "A+" -> "Y+"
"X+" -> "B-" -> "Y+"
}

from gfa-spec.

lh3 commented on September 24, 2024

@sjackman if we only specify segment IDs, we induce two often disconnected "dual" subgraphs. It is ok to define subgraphs this way, IMO. The question is what to do with them. People use SAM/BAM not only because it stores alignments, but more because SAM/BAM enables them to do things to alignments.

from gfa-spec.

sjackman commented on September 24, 2024

Once a bubble popper has identified the orientations of vertices of a complex bubble, that information ought to be recorded in the GFA file. It could also be the relative orientations of contigs in a chromosome represented as a GFA subgraph.

An oriented subgraph
S1 = (X+, A+, B-, Y+)
naturally has a complementary subgraph
complement(S1) = (X-, A-, B+, Y-)
just as the vertices and edges of a SSG have complements.

from gfa-spec.

richarddurbin commented on September 24, 2024

I understand your point @sjackman that you could want the oriented subgraph, and @lh3’s point that we could also always induce the reverse-complement graph, which is what I had been thinking of. I agree with @lh3 that we need to have a real use case in mind. We should look at @pb-jchin’s requirement for subgraphs,
which is the one that I am aware of. I think he currently just lists segments and edges, but maybe there is an implicit orientation somehow, which we should make
sure is well-defined and covered. I have copied Jason directly because I don’t know whether he is watching this thread (I am now).

Richard

from gfa-spec.

sjackman commented on September 24, 2024

I'd be happy saying that the orientation is optional, so that
U S1 X+,A+,B-,Y+ is possible, and
U S2 X,A,B,Y is shorthand for
U S2 X+,X-,A+,A-,B+,B-,Y+,Y-

from gfa-spec.

lh3 commented on September 24, 2024

Talking about bubbles, here is something similar to FASTG. Note that I am not recommending this in the spec. Just to show why bubble subgraphs are special.

L X + A +
L X + B -
L A + Y +
L B - Y +
B bid X+ Y+ A+,B-
...
L U + bid +
L bid + V +

Then bubble segment "bid" can be treated as a normal segment. For most operations, we can ignore the internal topology of the "bid" bubble segment. This is kind of nice in one way, but may lead to complications in other ways. In addition, not all subgraphs are bubbles.

from gfa-spec.

sjackman commented on September 24, 2024

Pall @pmelsted is going to transfer the repository to the new GFA-spec organization. I'll announce (on Twitter) the final release of GFA 1.0 in preparation for GFA 2.0. Give the GFA 1.0 spec a quick scan if you like to see if there's any minor edits / clarifications you'd like to make before the release.

from gfa-spec.

sjackman commented on September 24, 2024

@lh3 What's the purpose of the RC tag of the containment C record? ~~It doesn't really make sense to me, and I was thinking of removing it.~~

Edit: I think I understand now. Just as RC represents the number of reads supporting a link L record, that segments S1 and S2 overlap, it likewise represents the number of reads supporting the C record, that S2 is contained in S1. It could be the number of reads spanning the segment S2 that place S2 inside S1.

from gfa-spec.

iminkin commented on September 24, 2024

Give the GFA 1.0 spec a quick scan if you like to see if there's any minor edits / clarifications you'd like to make before the release.

Is it released?

from gfa-spec.

sjackman commented on September 24, 2024

Yep, GFA 1.0 is released. We can still merge in minor edits / clarifications if you have some to contribute.

from gfa-spec.

pb-jchin commented on September 24, 2024

@sjackman

An oriented subgraph
S1 = (X+, A+, B-, Y+)
naturally has a complementary subgraph
complement(S1) = (X-, A-, B+, Y-)
just as the vertices and edges of a SSG have complements.

should complement(S1) = (Y-, B+, A-, X-)?

from gfa-spec.

sjackman commented on September 24, 2024

Yes, if S1 is an ordered and oriented subgraph, which I call a path. I should have been more precise. Sorry for the confusion. I use the term subgraph to mean an unordered subgraph, either a set of edges (a subgraph), or the subgraph induced by a set of vertices (a vertex-induced subgraph).

from gfa-spec.

lh3 commented on September 24, 2024

David's approach is similar in that in the final assembly graph, which is also an SSG, long-range information is broken into much smaller segments. However, it seems that David is not storing paths. He traverses the graph on the fly to emit long FASTA sequences in different styles (e.g. phased blocks only; long contigs with occasional phasing switch errors; etc). I guess he can do this efficiently because the nested bubble structure of FASTG greatly simplifies such traversal. (PS: @ekg, have you tried vg on 10X's SuperNova graph? The graph can be converted to GFA, but it may take some efforts.)

I thought SuperNova was using FASTG, until I saw its output today. The SuperNova output is not FASTG. It is a standard FASTA (no stuffed bubbles and the complexity caused by them) with neighbors and paths encoded in the FASTA header. The format is described here. It is conceptually close, though not identical, to the fermi graph format and can be easily converted to GFA1.

from gfa-spec.

stale commented on September 24, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

Comments (79)

Segment S record

Alignment

GFA

Graph

Graphviz

Paths

Table of Contents

Introduction

Terminologies

GFA: Mandatory Fields

Segment line

Link line

Gap line

Match line

TODO

Edge E record

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Segment `S` record

Edge `E` record