gfa-spec / gfa-spec Goto Github PK

Graphical Fragment Assembly (GFA) Format Specification

Home Page: http://gfa-spec.github.io/GFA-spec/

Makefile 100.00%

graph gfa assembly specification file-format

gfa-spec's Introduction

GFA: Graphical Fragment Assembly (GFA) Format Specification

We are developing the specification of the Graphical Fragment Assembly (GFA) format. Your contribution is welcome. Please open up issues or submit pull requests.

GFA 2.0 is at GFA2.md
GFA 1.0 is at GFA1.md
GFA 1.1 is at GFA1.md#gfa-11

Implementations

GFA 2

GFA 1

GFA 1.1

Resources

Examples of sequence overlap graphs (assembly graphs) in a variety of formats

GFA 2.0: Graphical Fragment Assembly (GFA2) Format Specification 2.0

Jason Chin, Richard Durbin, and myself (Gene Myers) found ourselves together at a workshop meeting in Dagstuhl Germany and hammered out an initial proposal for an assembly format. We started with GFA 1 and proceeded to build a more comprehensive design around it. After extensive revision and discussion on Github with the GFA group including Shaun Jackman, Heng Li, and Giorgio Gonnella, we arrived at GFA 2.0. The standard is an evolving effort, and your contribution is welcome. Please open up issues or submit pull requests.

The basic reason for having a standard format is that we find that in general, different development teams build assemblers, visualizers, and editors because of the complexity and distinct nature of the three tasks. While these tools should certainly use tailored encodings internally for efficiency, the nexus between the three efforts benefits from a standard encoding format that would make them all interoperable.

GFA 1.0

GFA 1 was first suggested in a blog post by Heng Li (@lh3) and further developed in a second post.

GFA 1.1

W-lines were suggeseted by Heng Li (@lh3) as an extension to GFA 1 for representing haplotype information in pangenome graphs.

gfa-spec's People

Contributors

Stargazers

Watchers

gfa-spec's Issues

Strict ordering in GFA (?)

I've been working on a C++ parser for GFA to replace the old parser in vg with and was wondering whether there's an expected order for S/L/P/C/(x/a) lines in GFA.

Currently I print things so that the source sequence of a link or containment always appears immediately before links/containments coming from that source. Sequences are sorted alphanumerically by their name. I've run in to a few GFA files that don't follow either convention though, and to be honest these orderings are mostly just so that my output matches the output of the previous parser in vg.

I understand that the lines are independent and we could place the P/L/C lines for a given S line anywhere in the file, but if there's a different convention I'd like to go ahead and follow it.

Start/end positions in paths

Should GFA paths include a way to specify the start and end positions? Perhaps through the use of optional tags? I'm thinking something like ST to give the start position in the first segment and EN to give the end position in the last segment.

For example:

H   VN:Z:1.0
S   1   AGCGTA
S   2   TAACAG
L   1   +   2   +   0M
P   A   1+,2+   0M
P   B   1+,2+   0M  ST:i:4  EN:i:3
P   C   1+      ST:i:5
P   D   2+      ST:i:2  EN:i:5

In this case, path A's sequence is AGCGTATAACAG, which is the entirety of segment 1 followed by the entirety of segment 2. Path B's sequence is GTATAA, starting at position 4 in segment 1 and ending at position 3 in segment 2.

This would also allow for specifying parts of a single segment. Path C's sequence is TA and path D's sequence is AACA.

In those examples I went with 1-based indexing with an inclusive end position. This is in contrast to something like Python which uses 0-based indexing and exclusive range ends. Do people agree with 1-based inclusive ranges?

And it's a bit weird that the paths with only one segment are lacking any CIGAR strings. Since the CIGAR list field is required, I put two tabs between the segment list and the tags.

Thoughts?

alignments to graphs

In vg we use GFA for input and output. It's completely lossless with respect to the graph model, and we need only S, L, and P namespaces to represent the graph. However, there is no textual analog to SAM provided in the GFA format space. We do have cigars applied to edge overlaps and the fragment records in GFAv2. But these seem to be limited to the mapping between a read and a single sequence-containing node in the graph.

We could use Paths in GFAv1 to specify alignments, but there is no standard place to add sequences to these (which is helpful for various processing operations) or clear if an extended CIGAR can be used which we can use to reproduce the sequence of the Path using the node Sequences in the graph. Are others using Paths to represent alignments?

Alignments are useful in resequencing, when we consider the graph as a reference system. But there are other, more fundamental uses that should be considered. We can build graphs progressively with operations align and edit, that map new sequences into the graph and update the graph to include them. Tools that implement these functions can be chained together to make custom assembly pipelines. We could do this all within a coherent data model in GFA provided a SAM or PAF-like alignment schema.

Variations against a graph can also be represented as alignments, and genotypes represented as collections of these at an overlapping bubble (or other generalized site-like entity in the graph). Standard models for variants, genotypes, and bubbles would be a nice thing as well, but that's another conversation.

[GFA1] Representing paths with gaps (e.g. scaffolds)

Sorry if this issue has already been discussed somewhere, but is there a good way for representing a path, corresponding to a "scaffold" within GFA 1?
Since Paths can only go through the linked segments, we currently break a scaffold adding suffixes to the names of the subpaths.
But we do not like this solution. We probably could use Walks extension suggested by @ekg here, but I am not sure if it is within their intended scope and whether those records will be adopted within the community.

Meaning should not depend on the order of records

I propose to strike the text

U/O-lines with the same name are considered to be concatenated together in the order in which they appear, and

See its context here under the heading Group.

The meaning of a GFA file should not depend on the order of its records. It should be possible to sort or shuffle a GFA file without affecting the meaning of the GFA file. Imagine if the meaning of a SAM file depended on the order of the alignments in the SAM file. Sorting by read name or target position would not be possible.

Consider the following example of O P1 2+ 1+ with split across two lines:

O P1 2+
O P1 1+

after a UNIX sort it becomes

O P1 1+
O P1 2+

which has changed the meaning of this path P1 from 2+ 1+ to 1+ 2+.

Header: same tag on multiple lines?

Hi! I have written a GFA implementation in Ruby (RGFA https://github.com/ggonnella/rgfa). During that process there were some points which were not so clear to me, so I will try to write them down into issues here.

The first thing I found out to be a bit unclear regards the header. The specification mentions the header using plural phrasing, namely "Header lines start with H". So I suppose, one can add multiple header lines. In some examples in the discussions here, there are multiple header lines.

Having one or multiple header lines is not very different, if each header line has one or more tags which are not shared by any other header line. But now the question is: should we allow different header lines to have the same tag?

In other word, should the following be valid GFA?

H VN:Z:1.0
H co:Z:this is my comment tag
H co:Z:this is a further comment tag

The way I handle this in my library currently is to allow the same tag on multiple H lines, and to handle the result as a special case of an array.

[error]

[issue was posted in the wrong repo, sorry]

[GFA2] Defining edges/links between two segments

In GFA1, we are using CIGAR to describe overlaps between two segments. It was my idea initially, and I admit it is not a good one. @thegenemyers proposed to replace Link and Containment lines with a single Edge line:

E <sid1> [+-] <sid2> [+-] <beg1> <end1> <beg2> <end2>

However, I have some concerns. I will take the "dovetail" case in @thegenemyers' figure as an example. For simplicity, I assume beg1=900, end1=1000, beg2=0 and end2=95. The most straightforward way to describe this edge is: E sid1 + sid2 + 900 1000 0 95. I believe the correct complementary of this edge should be (from the tail of sid2 to the tail of sid1): E sid2 - sid1 - 95 0 1000 900. Nonetheless, it seems also legitimate to describe the edge with the following, too: E sid2 + sid1 + 0 95 900 1000. When we ignore the coordinates and draw the figure, this will become a problem.

I think this problem is caused by not distinguishing clipped sequences and retained sequences. Given a contig sid1 that has an overlap in region [beg1,end1), we need to know whether [0,beg1) is clipped or [end1,|sid1|) is clipped. Here are two alternative proposals. The first one:

E  sid1  [+-]  sid2  [+-]  lenOvlp1  lenClip1  lenOvlp2 lenClip2

where lenOvlp1 is the length of sid1 in the overlap, lenClip1 is the length that should be clipped out (i.e. not considered in the assembly); lenOvlp2 and lenClip2 are similar. Then the dovetail case is: E sid1 + sid2 + 100 0 95 0, whereas E sid2 + sid1 + 95 0 100 0 represents a distinct overlap (end of sid2 overlapping the start of sid1). This proposal also tells us whether the overlap reaches the end of a read/contig without requiring us to know the read/contig length from the S lines.

The second alternative is a simplified/lossy version of the above:

E  sid1  [+-]  sid2  [+-]  lenOvlp1+lenClip1  lenOvlp2+lenClip2

where we just use one length for each segment. The clipping lengths are written to tags if needed.

What do you think?

Copy text verbatim from Heng Li's blog posts

The text can be edited after a bulk import. I'd like to import the unedited text so that the edits are retained in the git history.

Path support

In vg I am using the P namespace to define paths. See discussion at lh3/gfatools#2 for more details.

This is absolutely essential for working with reference graphs.

IDs namespace in GFA1

The namespace of IDs in GFA2 is clearly identified: S,E,G,U,O line identifiers all belong to the same namespace, while the external sequence identifiers in F lines do not.

However, this problem is not explicitly addressed in GFA1. In most examples I found, the S and P have different IDs.

My proposal is to add a statement to the GFA1 specification, and declare that P and S names (as well as the optional IDs in L and C lines, if #55 is accepted) belong to the same namespace. This is what I assumed in gfapy, I think it is reasonable and helps interconversion to/from GFA2. Would this break any existing implementation?

GFA version 2 with branches, external ids, and non-mandatory CIGAR

Hello all,

I am at Schloss Dagstuhl with Gene Myers, Jason Chin, Adam Phillipy and others. We would like to propose a major revision to GFA that addresses a number of issues raised in particular by long reads with high error rates from PacBio and ONT, on the basis that Gene and Jason and others would adopt the result for their pipelines.

The key issues addressed are:

move CIGAR strings from required fields to CG:Z: optional fields, replacing them by required lengths in the two segments in the Link lines and equivalents elsewhere. CIGAR is not optimal or necessary for long single molecule technology reads.
require a length for Segment fields, before the (optional) string for the sequence
introduce a new optional Piece record, that says how Segments are made from Pieces with external ids. e.g. how reads align to a unitig, or how unitigs align to later longer contigs/scaffolds. This sorts traceability through an assembly process.
support branches where a Segment end inserts into the middle of another Segment. This supports alternates of bubbles, so allowing long primary called contigs with variants. As in ALTs on the human reference, or the results of bubble detection and primary sequence choice to achieve long contiguous primary assembly contigs. We do this by extending the C (Contain) syntax.

Also a number of other standard tags are proposed, both in the header and elsewhere, to facilitate parsing (e.g. number of records of each type specified in the header to enable preallocation) and support for alternate representation of alignments (trace points - see Gene's Dazzler blog).

Are github issues the right way to discuss these proposals? I don't know who monitors this site?
Gene has a document that we have jointly worked on, which will be ready for posting tomorrow I think. I could also create a pull request on the current spec, but am not sure that is the right way to go.

Richard

README URLs need formatting

The URLs at the bottom of the current REAMDE, i.e.

GFA 1 was first suggested in a blog post by Heng Li (@lh3) and further developed in a second post.

are currently broken. The hyperlinks send users to

http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/
http://lh3.github.io/2014/07/23/first-update-on-gfa/

This should be changed to

http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format
http://lh3.github.io/2014/07/23/first-update-on-gfa

Thanks

No version in the spec

The current version of the spec is not explicitly stated. I think it should be added in a visible place for both parser and generator developers.

[GFA1] Regexp for Path overlaps

Still with the intention of finalizing the GFA 1.0 specification, before moving to GFA 2.0, I noticed that the Regexp for Overlaps in the P lines is incomplete, as it codes only single CIGARs and not the entire content of the field, as it is the case of all other Regexps in the specification. Either we write this in the text or, better, we fix the Regexp.

If I get it right, valid values are a single *, muliple comma-separated * or CIGARS, and also comma-separated combinations of * and CIGARS (necessary e.g if some of the underlying links have a * overlap and some have a CIGAR).

I have written a Pull Request #46 in which I fix this, but please double check the combined regexp again.

gfalint: A syntax validator implemented with lex/yacc

I've created a program gfalint that uses a formal grammar of GFA implemented in lex and yacc to check the syntax validity of a GFA file, both GFA 1 and GFA 2.
See https://github.com/sjackman/gfalint
The grammar is at https://github.com/sjackman/gfalint/blob/master/gfay.y#L52
The regular expressions of the tokens is at https://github.com/sjackman/gfalint/blob/master/gfal.l#L32

Please try it out on your GFA files! I'd like to hear whether it reports valid/invalid syntax correctly on your GFA files.

[GFA2] Typos and wrong example

@stefan-kurtz recently noticed that in the GFA2 specification, the example in the "Edge" section is wrong, as it uses an "S" record instead of "E", as it should be. I fixed this, as well as four typos I found in the spec in a new pull request (#63).

Specify different sequences for two complementary strands?

For a given segment in a graph, there are two corresponding sequences, the forward and the reverse complement. As I understand GFA, only the forward sequence is specified and the reverse complement sequence could be deduced if needed. Is that right?

The reason I ask is that in Velvet graphs, the forward and reverse sequences of a segment are NOT exact complements - they are shifted. I tried to illustrate this on the 'Assembler differences' page of the Bandage wiki: https://github.com/rrwick/Bandage/wiki/Assembler-differences

I couldn't see a way for GFA to support this kind of arrangement. So here's my thought for an addition: have an optional field in segment lines of OR (for orientation) which specifies the strand. E.g.:

S   segment_name    CTGATTG OR:Z:+  LN:i:7
S   segment_name    CAGTCTA OR:Z:-  LN:i:7

With this, you could have two segment lines with the same name if and only if one has a '+' orientation and the other has a '-', and then you could manually specify the sequences for each. Thoughts?

How to extract assembly from gfa file?

I have a simple question:

After I get a gfa file, how can I get the assembly of the genome I want?

Someone said all segment sequences in the S lines are my assembly result. Is it correct?

Paths in GFA v2

Hi everyone, if people aren't yet sick of discussing the finer points of GFA implementation, I've got another one! I like the concept of paths in an assembly graph and use them a lot in Bandage, so I'm keen to have a solid definition of the paths in GFA. I've got three points to discuss, in order of simple-to-hard:

1 - Start/end positions
I brought this one up in a previous issue. My suggestion was to have two optional tags in a path line: ST for starting position in the first segment of the path, and EN for the ending position in the last segment. People seem to prefer the zero-based half-close-half-open indexes (as do I).

So does anybody have objections or suggestions to this addition?

2 - CIGAR-free paths
If the new implementation of links is going to have CIGARs as an optional tag, then the current GFA path syntax (contains a comma-delimited list of CIGARs) will need to be revised.

Can we just drop the CIGAR part all together? If there is only one link connecting any two segments, then it doesn't seem necessary to specify anything but the sequence of segments. So I guess that's the big question here: is it valid to have more than one link connecting the same two segments? If we allow this, then the path record is going to have to specify which link is followed in a path.

And on this topic, it should be possible to specify that a path is circular, as was discussed in this issue. I'd be happy to go with the optional flag discussed there: Z:FL:circular.

3 - inexact matches in overlaps
I saved the tough one for last: If a link is not an exact match, how can we determine what sequence the path should take?

An example in the new GFA v2 syntax:

S   A   5   CATGC
S   B   5   TACTT
L   A   +   B   +   3   3
P   path    A+,B+

The sequences overlap like this:

CATGC
  TACTT

So should the path sequence be CATGCTT or CATACTT? Here are a few brainstormed solutions:

Require sequences in path lines: P path A+,B+ CATGCTT
Require comma-delimited overlap sequences in path lines: P path A+,B+ TAC
Instead of commas in the path sequence, use two different characters to indicate which segment provides the overlap sequence: P path A+<B+ vs P path A+>B+
Use a CIGAR-like string for each link which defines exactly how to traverse the two options. Instead of CIGAR's M it could use A for 'take the base from the first sequence' and B for 'take the base from the second sequence', as well as 'I' and 'D' as used in normal CIGARs: P path A+,B+ 1A1B1A

I particularly don't love any of those solutions, so I'm hoping somebody has a better idea :)

CIGARs, reference and query

CIGAR operations such as D and I require a sequence to be the reference, another to be the query.
The specification says that one can use CIGARs to express alignments (e.g. in L lines of GFA1, in E lines of GFA2). But it does not clearly state which segment (or segment overlap) is the reference and which segment is the query. I think we should add this to the spec.

Regex Notation

The Segment line 'Name' regex in GFA-spec.md doesn't make sense. I hope I'm reading it correctly.

[!-)+-<>-~][!-~]* matches "!~" but not "1", "2" etc. which are used in the example at the bottom of the document.

I think the author meant something like [^-)+-<>-~][^-~]*, where ^ at the beginning of a set is used to mean the negation of the set ie. may not start with the special characters given, may not contain either - or ~.

This is the regex syntax used by grep and perl. More info here: http://www.regular-expressions.info/charclass.html

Discussion for storing subsets of links in GFA

Here are some thought about "piece" and "path". Basically, in my experience, I find that it is useful to have a way to store a set of "links". For example, I will like to be able to do some queries like "give me all links that are 'associated' with a contig". The set of "links" associated with a contig may be more the a path. For example, some spurs or bubbles removed during graph-to-contig layout may be throw away but they will be useful for diploid assembly or give assessment of remaining ambiguities related to the contig. It will be useful we have some way to store such subsets, although I think we can discuss whether we need to store the information of such sets in the same GFA file.

To track the discussion related to this issue. I will open a new issue.

Add SPAdes to GFA1 section

SPAdes generates assembly graph in GFA format and paths traversed by scaffolds over it.

Spec in Markdown

Would you be okay with writing the spec in Markdown rather than LaTeX? I much prefer writing in Markdown than LaTeX. I think we're at the point where more people are comfortable with Markdown than LaTeX, so it's more inclusive to the community. GitHub renders Markdown pretty. It's easy to convert Markdown to LaTeX using pandoc. I'm okay if the answer is no, but I thought I'd ask.

Criteria for merging a pull request

I feel that we should adopt some kind of criteria for accepting merges to the spec. Off the top of my head, I suggest

At least three 👍 to merge
Majority rules. Count 👍 and 👎. One vote per person. Anyone can vote.
Wait at least three days for votes to come in, and at least one day since the most recent vote.

GFA2: clarify in the grammar that '*' in alternations is a terminal

The text of the GFA2 grammar includes (* | <var:int>), * | [!-~]+ and * | <trace> | <CIGAR> along with the explanatory/confusin note expounding the meanings of a set of symbols — "* zero-or-more".

This confounds the meaning of "*" in the alternations as a production terminal. Maybe there should be a way to mark the difference between "*" and *, as exists in EBNF (or maybe just use EBNF, then the need for the explanatory note goes away).

Has any implementation of GFA implemented the containment C record?

See #33 (comment)

Comment Lines

I propose that lines beginning with # should be comment lines which are ignored.

Comments would be useful when debugging, demonstrating examples and teaching.

They could also be used as a hacky solution to add user defined annotations into a file in the same way that javadoc / doxygen use source code comments.

How to represent Inversions

This was raised in #3, how can the format represent inversions, or is this something we want.

Since there are 3 main use cases for GFA, assembly graphs, long reads and variation graphs it should be noted that inversions are only explicitly needed in variation graphs.

An assembler would naturally construct two contigs for an inverted segment and long reads would be unaffected.

What is annoying is that the inverted segments are not complemented, so this would mean we would need to come up with a new symbol or mechanism to denote this.

[GFA2] Why is distance in Gap line is 'pos' instead of 'int'

Hi!
Could you clarify why the distance field in a G-line should be of type pos rather than int according to the Grammar section of the specification? Is that a mistake?
Thanks!

Document all implementations of GFA

ABySS 1.9.0

Replace "+/-" with "B/E"?

It has been mentioned several times that +/- is sometimes confusing. I think so, too, especially after proposing side graph to GA4GH when I had the same issue. In my view, it is cleaner to just describe the relationship between sequence ends on the L line.

For example, say we have a GFA graph in the current format:

S   contig1  *
S   contig2  *
L   contig1  +  contig2  +  0M

This means the forward strand of contig2 follows the forward strand of contig1. The proposed new format is:

S   contig1  *
S   contig2  *
J   contig1  E  contig2  B  0M

This means the start of contig2 links the end of contig1. In this example, the E/B notation doesn't seem more advantageous, but in other orientations, E/B is cleaner. For example, instead of figuring out the order and orientation of joins for these cases:

L   contig1  -  contig2  +
L   contig3  +  contig4  -
L   contig5  -  contig6  -

we can just think how contig ends connected to each other.

J   contig1  B  contig2  B
J   contig3  E  contig4  E
J   contig5  B  contig6  E

Notably, the two ends on such an J line are unordered. The first line is the same as:

J   contig2  B  contig1  B

There is actually also such a symmetry in the current GFA format: L A - B + is the same as L B - A + (we flip both orientations when switching the order of the two sequences on an L line). The proposed new format just makes this more obvious.

The new and the old formats can coexist in theory. A GFA parser can easily deal with both formats with trivial changes. I think going forward, we should advocate J lines and deprecate L lines.

what's the meaning of tag "a":

for example:
S utg000001l * LN:i:41537
a utg000001l 0 m54136_170926_195115/39518486/15846_38527:2064-21507 + 3960
a utg000001l 3960 m54061_170914_081333/37748951/9612_35354:3-15698 + 7

each colum represents what?

[GFA2] About "dual" edges (was disallowing dual edges)

In GFA, an edge can be described in two ways e.g. L sid1 + sid2 + vs. L sid2 - sid1 -. We call this dual edges. In #33, @pb-jchin requested to define whether we allow both edges to be present, as the current GFA1 spec seems not very clear about this point. In practice, I believe most existing GFA writers only write one of the dual edge, so we may make it formal: GFA2 only keeps one of the dual edges. Thus L sid1 + sid2 + and L sid2 - sid1 - are considered as two multiple edges in GFA2 (though a particular implementation may choose to collapse multiple edges into one edge).

Describe Link with GFA2 Edges

I'm trying to understand Edge lines that describe GFA1 Links.
While reading the following sentence:

The GFA2 concept of edge generalizes the link and containment lines of GFA. For example a GFA edge which encodes what is called a dovetail overlap (because two ends overlap) is a GFA2 edge where either beg1 = 0 or end1 = x$ and either beg2 = 0 or end2 = y$.

I realized that situations like:
E l12 1+ 2+ 0 3 0 3 *
would be considered as dovetail overlaps. But I think that this situation represents a general
overlap (correctly described by a GFA2 Edge but impossible to be represented by a GFA1 Link).

Could the sentence above be reformulated as follow:

For example a GFA edge which encodes what is called a dovetail overlap (because two ends overlap) is a GFA2 edge where either beg1 = 0 and end2 = y$ or either beg2 = 0 and end1 = x$ .

[GFA1] C line description is ambiguous

In the GFA1 specification, the description of C lines is ambiguous. It is written that "A containment line represents an overlap between two segments where one is contained in the other.". However, nowhere is actually written which segment is contained in which one; the two segments are called "from" and "to", which does not really make clear, which is the container and which is the contained segment. Until recently, I supposed that "from" is the container and "to" the contained segment (either from an old version of the spec, or from examples, I suppose), but this is not actually stated in the text.

Example of the graph representation used in Falcon

Just for a reference. The basic graph I use in the FALCON assembler (https://github.com/PacificBiosciences/FALCON/) is very straigh forward. I decouple the graph information from the sequences information. The sequences used in the graph is just referenced from the text file sg_edges_list. Here is a simple example and some brief description:

$ head -5 sg_edges_list
000017363:B 000007817:E 000007817 10841 28901 10841 99.52 TR
000015379:E 000004331:B 000004331 6891 0 18178 99.35 TR
000006813:B 000000681:E 000000681 7609 23795 7616 99.72 TR
000002258:E 000002505:B 000002505 5850 0 17215 99.62 TR
000013449:B 000012565:B 000012565 3317 0 20570 99.72 G

The first two columns indicates the in and out node of the edge. The node notation contain two files operated by :. The first field is the read identifier. The second field is either B or E. B is the 5' end of the read and E is the 3' end of the reads. The next three field indicates the corresponding sequences of the edges. In this example, the edge in the first line contains the sequence from read 000007817 base [10841, 28901). If the second coordinate is smaller than the first one, it means the corresponded sequence is reverse complimented. The next two column are the number of overlapped base and the overlap identity. The final column is the classification. Currently, there are 4 different types G, TR, R, and S. An edge with type "G" is used for the final string graph. A "TR" means the edge is transitive reducible. "R" means the edge is removed during the local repeat resolution and "S" means the edge is likely to be a "spur" which only one ends is connected.

It won't be too hard to convert such information @lh3's original proposal for GFA which label's sequnces ends rather than the seqments themselves.

Here is an example to compare the different graphs using three reads as an example:
https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg

Wrong regexp for segment sequence

The regexp of col 3 for segment line appears to be broader than intended. It currently reads \*|[A-Za-z=.]+, but I think the dot is unescaped by mistake, it should be \*|[A-Za-z=\.]+.

The future of the GFA format

CC @thegenemyers, @richarddurbin, @ekg, @rrwick, @sjackman, @jts, @pb-jchin, @skoren, @aphillippy, @MihaiPop, @ggonnella, @pmelsted, @edawson

The current status of GFA1

There were two general-purpose assembly formats: FASTG and GFA1. With David Jaffe et al's SuperNova assembler apparently moving away from FASTG, GFA1 is practically the only generic assembly format. I have written converters for ABySS, SGA, Velvet, Spades, SOAPdenovo and fermi short-read assemblers. @jts's SGA and @sjackman's ABySS natively support GFA1 output, I believe. @jts's fork of DALIGNER can emit GFA1. @pb-jchin has written a converter for FALCON. My miniasm and fermi-lite assemblers output GFA. I believe the vast majority of mainstream assemblers are compatible with GFA1, too. For tools working with variations, @ekg's vg supports GFA output and has an internal data representation conceptually equivalent to GFA1 (vg effectively implements a bidirected graph). SuperNova graph can be converted to GFA1 (not implemented yet). DISCOVAR outputs FASTG which can also be converted to GFA in principle (not implemented, either). As to tools consuming GFA, @rrwick's Bandage can visualize GFA graphs. I have written gfaview to perform graph transformation (e.g. transitive reduction, tip trimming and bubble popping) for long-read graphs. @sjackman has implemented similar transformations for short-read graphs in ABySS. There are already a few libraries in C++ and Ruby to read GFA1. In conclusion, GFA1 is getting used. It is fairly simple yet general enough for all the tools and applications mentioned above.

About GFA2

The necessity

GFA2 was proposed because GFA1 does not work when we choose a path at a fork to merge. This leaves an end-to-internal match, which can't be described by GFA1. A few other hypothetical use cases (e.g. alignment between two long haplotypes) have also been raised. So far, I am not sure which implementations output and, more importantly, consume end-to-internal or internal-to-internal matches (NB: containment is a special case of end-to-internal alignment, but it can be described with GFA1). I am happy to update this post if there are any.

The GFA2 graph representation

While GFA1 models a directed skew-symmetric graph that is topologically equivalent to bidirected graphs, overlap graphs and string graphs, GFA2 models an undirected multi-graph where mapping coordinates are playing a central role. They represent fundamentally different types of graphs. Although we can see GFA1 as a special case of GFA2, how to understand the graph will be distinct. GFA2 will also have more complex syntax and implementations for what GFA1 is really good at.

My take

SAM/BAM is popular not only because they can store alignments, but more because they enable us to do things to the alignments that would be complex otherwise. Similarly, I see GFA is not just a storage format; it should be a format that helps our analysis. Due to the lack of clear downstream use cases and implementations of GFA2 (e.g. what information do we want to extract from GFA2? how to?), I am unable to evaluate the necessity of the added complexity, especially given that vg and SuperNova can already achieve part of the GFA2 goal with the GFA1 representation only. I am reluctant to add unproven features too early.

The future of GFA

As the creator of the initial GFA1, I do not see GFA2 is ready to replace GFA1. I foresee the coexistence of GFA1 and GFA2 for a period of time. During this period, developers have to make a choice between GFA1 and GFA2. We may re-evaluate the necessity of GFA2 yearly. I will be happy to phase out GFA1 if GFA2 is proven to be useful with concrete and practical applications. I understand the split is unfortunate, but this seems an unavoidable cost when we explore the unknowns.

The future of GFA1

The current GFA1 spec was modified from a blog post. I would like to replace it with something a little more formal like the one here, with further improvements of course.

I also want to ask developers on the CC list: how much do CIGAR on L-lines and the lack of segment length on S-lines hurt? Personally, I would really like to change the format, but if you all think you can live with these issues, I am ok to keep them as they are. Once we reach a consensus on this issue, we will try best to maintain the compatibility of GFA1 going forward. We may have new line types, but we don't break existing lines.

Change GitHub description

Hi, Pall. Would you please change the GitHub title to:
Graphical Fragment Assembly (GFA) Format Specification
It's more descriptive than the current GFA spec.

Pan-Genome Graphs

There is some recent work on using of de Bruijn graphs on analysis of pan-genome, e.g. see:

So far, there is no common format for such graphs. Does it make sense to generalize GFA to handle both pan-genome/assembly graphs?

Paths must be specified in one record

For clarity, I think it is necessary to specify paths in a single record. If we use node ids to do so, we'd want something like:

P    name    n1,n2,n4,n7,n8...

The problem is that it's not possible to differentiate cycles if paths are specified as I have suggested previously, describing one node/path relationship per line. Then the order of the records would be required to disambiguate things, and I'm not sure that's ideal for GFA.

Add link attributes DI and DE to specify the estimated distance between two sequences

In #8 @pb-jchin wrote…

In reality, we only need the length of the overlap and some summary information, e.g., estimated identity.

I would like an optional field to describe an estimate of the amount of overlap/distance between two reads. I'd suggest the following:

DI:i is an estimate of the distance between the end of sequence 1 and the start of sequence 2. If DI is negative, it is estimated that the two sequences overlap. If DI is positive, it is estimated that there is a gap between them. DE:f is the estimated error of DI.

Such an attribute would be useful both for estimating the amount of overlap of two sequences, as well as representing the estimated distance between two reads that is based on paired-end reads linking the two sequences. ABySS uses the d and e attributes for this purpose in its GraphViz graph.

Add JSON type

@noporpoise proposes in #7 (comment) that the last column of every record is JSON. JSON would replace the current TAG:TYPE:VALUE system.

Fieldnames in paths

In the current single-line path specification there is an inconsistency in the naming of some fields. The SegmentName field actually contain a list, so I think SegmentNames would be more appropriate (or OrientedSegmentNames, as they also contain an orientation).

Furthermore CIGAR is also singular, and indicates the format rather than the content - differently from L lines, to which the CIGAR actually refer to. Therefore I would use overlap instead of CIGAR, and pluralize this too, so "Overlaps".

I have created a pull request which fixes these issues (#43).

Header line first?

As soon there might be different versions of GFA around, It would be useful to mandate the header line with the VN tag to be the first line in the file, so that a parser can immediately start parsing the file with the right specification.

This is almost always the case in examples given here, and the header is at the beginning in the proposed GFA2 whitepaper, but currently it is not mandatory for GFA1 to have it before other tags.

Path line syntax ambiguous ?

The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id, e.g. "S P1+,P2 acgt" defines P1+,P2 as a segment id, I don't see how it is possible to unambiguously parse the path list. Please advise.

Circular paths

I'm working on adding GFA path support to Bandage, and I'm wondering how to deal with the concept of circular paths. In bacterial genomics, the actual DNA is usually circular, so important graph paths (e.g. a plasmid sequence in an assembly graph) will also be circular. As I see it, the difference between a linear and a circular path is whether the overlap between the first and last node is considered.

So I propose that in a linear path:

The segment list does not end in a comma.
The list of CIGAR strings has one fewer part than the segment list.

In a circular path:

The segment list ends in a comma (to indicate the link between last and first segments).
The list of CIGAR strings has the same number of parts as the segment list.

Here's an example GFA to illustrate:

H   VN:Z:1.0
S   1   AGCGTA
S   2   TAACAG
L   1   +   2   +   2M
L   2   +   1   +   2M
P   linear  1+,2+   2M
P   circular    1+,2+,  2M,2M

So in this example, the linear path would have this sequence:
AGCGTAACAG

And the circular path would have this sequence:
AGCGTAAC

Thoughts?

GFA1.0 Path line segment/orientation order is unclear

I've run into some GFA files that my parser cannot parse because the orientation comes before the segment name, and isn't specified at all if its in the forward direction. This made me realize that the spec itself does not specify if orientations come before or after segment names. It simply says:

SegmentNames String [!-)+-<>-~][!-~]*  A comma-separated list of segment names and orientations

"Comma-separated list" is ambiguous as to order. Can we clarify if this is required to be as in the example?

P	14	11+,12-,13+	4M,5M

And whether they must be present, regardless of orientation? Perhaps the description should be come "A comma-separated list of concatenated segmentName / orientation pairs, one for each segment. Both elements are required."

I am happy to make a PR once this has been discussed.

gfa-spec / gfa-spec Goto Github PK

gfa-spec's Introduction

GFA: Graphical Fragment Assembly (GFA) Format Specification

Implementations

GFA 2

GFA 1

GFA 1.1

Resources

GFA 2.0: Graphical Fragment Assembly (GFA2) Format Specification 2.0

GFA 1.0

GFA 1.1

gfa-spec's People

Contributors

Stargazers

Watchers

Forkers

gfa-spec's Issues

The current status of GFA1

About GFA2

The necessity

The GFA2 graph representation

My take

The future of GFA

The future of GFA1

Recommend Projects

Recommend Topics

Recommend Org