Comments (24)
My recommendation is for GFA1, we test regex pattern /[+-],/
at an S-line. If we find a match, we throw an error, saying the current spec does not allow it. I think this treatment is working most of time in practice. I have seen commas in sequence names, but I am yet to see a /[+-],/
pattern – it is very rare.
from gfa-spec.
The only way I can see to separate segment/edge ids in a list is with spaces. All other printable characters are legitimate in ids.
So I advocate that the lists in paths or subgraphs are space separated. This is clearly distinguishable by a parser from a tab, but not so clear to
the human eye. So this list should be the last of the fixed fields, only followed by optional SAM tag:type:data fields, which are clearly distinctive.
I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).
Richard
from gfa-spec.
Richard,
My issue was for GFA 1.0 where there seems to be a problem, whereas
you seem to
be speaking to what to do for DAS/GFA 2.0. There I would proceed as you
suggest.
-- Gene
from gfa-spec.
I propose requiring each node traversal or mapping in the path to be on its own line.
Something like: P [node name] [path name] [rank in path] [orientation] [mapping description]
This is how vg produces GFA. It makes heavy use of paths and so this was really important to get right.
Here is an example in a mini graph.
H VN:Z:1.0
S 1 CAAA
P 1 x 1 + 4M
L 1 + 2 + 0M
S 2 TAAG
P 2 x 2 + 4M
L 2 + 4 + 0M
L 2 + 3 + 0M
S 3 G
P 3 x 3 + 1M
L 3 + 5 + 0M
L 3 + 6 + 0M
S 4 A
L 4 + 5 + 0M
L 4 + 6 + 0M
S 5 C
P 5 x 4 + 1M
L 5 + 7 + 0M
S 6 T
L 6 + 7 + 0M
S 7 TTG
P 7 x 5 + 3M
In addition to solving this issue, there are other benefits to this organization:
This format makes many command line operations on the GFA file easier, as it allows us to sort the graph by node names (in the second column) and visually inspect approximate subgraphs including the paths through them.
For example, I generated this by doing:
vg construct -r tiny/tiny.fa -v tiny/tiny.vcf.gz -m 5 | vg view - | head -22
If the path records come on one line then there is no available technique to subset the graph (with paths) on the command line.
Putting one path element per line also resolves the need for a custom path format. We just use the same tab delimited parsing machinery to read in the path.
This also allows each path element to have a CIGAR. This is not strictly necessary but it adds flexibility and makes the paths that we can store in GFA graphs nearly equivalent to those which we use for alignments.
It also makes it easier for us to add other annotations to the path elements, as each element is serialized with its own id. This doesn't matter so much for GFA but rather in RDF based serializations of assembly graphs, where the objective is to allow linking to specific components of the sequence graph and walks through it. In other words, if the path elements weren't first class entities, we would be unable to use them in semantic linkages.
NB: In vg, paths are composed of a series of mappings. Mappings are ranks (in the lath) positions and edits. Edits are like CIGARs. Positions describe a node strand and an offset. It is not possible to support this data model with comma-delimited paths in GFA. (The vg schema vg.proto defines this. It might be good to write it out in BNF.)
We are basically stuck on this issue because others prefer the one line per path format. As a result we are using incompatible versions of P. I would shift to another namespace but think it is a waste of a common format to do so.
What GFA implementations are using the current one line per path format in P? What are the advantages of the comma delimited path format? As far as I understand it removes the need for the path rank elements (saving some space) and makes it so we can parse each path in one line.
from gfa-spec.
We are basically stuck on this issue because others prefer the one line per path format.
I forgot if I have said anything on the issue where you and others attempted to define Paths. As I see it now, I don't have a strong opionin either way. One advantage of the one-line format is that, at least to me, it is more natural to see a path as one object. When you put a path on multiple lines, the parser needs to read all lines into memory and then compose a Path object. That said, I can see your multi-line version has benefits in other cases.
This also allows each path element to have a CIGAR.
I am not sure why you'd like to have Edits/CIGARs. Do you allow mismatches/gaps between a path and a segment mapped to the path? If not, could we just specify overlap lengths in case of graphs with multiple edges? In my view, the primary goal of a Path is to spell a sequence. Using CIGARs complicates this primary goal.
The path line specifies a comma separated list of segment id's followed immediately by an orientation symbol (+ or -). Since both comma's and + and - can be in a segment id
We have to disallow a pattern /[+-],/
in segment names. As long as a segment name does not match this pattern, parsing a GFA1 Path should be unambiguous.
from gfa-spec.
The point of my raising this issue was not to come up with a new syntax or extension of GFA. I simply want to know how to write a GFA parser given that the grammar is ambiguous. Any "fix" should be minimal, but presumably y'all have written parsers, so what did you assume?
from gfa-spec.
@thegenemyers Good catch, Gene. To resolve this we'd need to either remove comma ,
from valid segment ID characters. I'm hesitant to do that because comma ,
is a valid FASTA ID character, and I'd like FASTA and GFA to be compatible. I like your suggestion, Richard @richarddurbin, of making the separator space
rather than comma,
. I wonder whether there any implementations of the current pathP
line specification.
from gfa-spec.
@richarddurbin wrote…
I would rather not have orientation symbols in the list. As far as I can see there is no need for them if you allow edge ids, and you ban edges that
connect a segment to itself (these can always be split by an artificial segment if you really want them).
The orientations are necessary as it's common to have a segment that begins and ends with the same inverted repeat with unique sequence in the middle between the two repeats. It yields the graph below. The paths A+ R+ B+
and A+ R- B+
yield two different sequences. Note that the segment R
begins and ends with the same inverted repeat, but is not a palindrome. A palindrome yields the same graph, but in that case, both paths yield the same sequence.
Graphviz
digraph {
graph [rankdir = "LR"]
"A+" -> "R+" -> "B+"
"A+" -> "R-" -> "B+"
}
GFA
S A *
S B *
S R *
L A + R +
L A + R -
L R + B +
L R - B +
P P1 A+,R+,B+ *
P P2 A+,R-,B+ *
from gfa-spec.
@ekg The argument for having a path in single/multiple lines was debated in #22. The argument for a single line as I recall was that a path represents a single conceptual entity. It makes parsing easier if that entity exists on a single line. If it's spread over multiple lines, that parser would likely have to create a new path object and modify it as each new element for that path arrives in the input stream. They were good arguments for both formats, but the popular format was a single line per path.
That being said, you have a use case for multi lines per path, and it's quite similar in format to the containment C
and proposed fragmentF
record. Could a set of containment/fragment records be used to compose a path, Erik, if they were grouped together with a common path identifier?
from gfa-spec.
@ekg Let's use this issue to discuss the ambiguity of the current path P
record and move the discussion of a new path record format to #36 "Paths in GFA v2". @pb-jchin and @rrwick have thoughts on this topic as well.
from gfa-spec.
We can always repurpose a namespace but then we can't read each other's
paths...
How many implementations are using the one line per path format?
I understand it is sometimes easier to read the path in one step, but
accumulating it (like we do for everything else in the graph) seems equally
simple.
One line per path element solves the issue described here without any
additional format complexity.
from gfa-spec.
Can we escape commas inside of the segment name using a backslash?
Then we would have:
S P1+,P2 acgt
S S3 cgtt
L P1+,P2 S3 *
P X P1+\,P2+,S3+ *
I think this is parsable and clear to the human eye too.
from gfa-spec.
It is clear to the human eye until the path reaches more than a handful of elements. If the path has a million elements in it then it will be very uncomfortable to try to read in a terminal when it is written on one line.
If the path elements each have their own line then we can sort the file by ID in the second column and maintain a pseudo-local view of the graph and walks though it in text. This is helpful when there are many overlapping paths.
I beg the group not to add more custom sub formats to GFA. We can do everything we need with paths using only tab as a delimiter. I have no appetite for supporting a complex ad-hoc serialization format. The whole point of GFA is that it is easy to parse and write from C and easy for humans to read. Otherwise we could define the format in BNF and write JSON/XML/RDF/protobuf/capnproto/etc. from our codes.
Barring any interest in making the elements of paths first class objects, I guess we are going to have to fork our representation of paths. Any suggestions for a letter? Is W taken?
from gfa-spec.
I would agree with @lh3 proposal.
from gfa-spec.
I think that's a pragmatic solution for GFA1, Heng. If we keep the one-line-per-path format for GFA 2, we may consider changing the separator to space as @richarddurbin suggested.
from gfa-spec.
I've submitted PR #40 to resolve this issue for GFA1 as per Heng's suggestion. Your vote is appreciated.
from gfa-spec.
I am replying to @sjackman's comment about distinguishing A+,R+,B+ from A+,R-,B+.
There are different edges that link A+ to R+ and A+ to R-. In the GFA2 pull request you distinguish between the two paths by giving different identifiers to the two edges (links for GFA1) and explicitly giving the relevant identifier in the path specification. If there is only one edge from A to R then you can omit the edge identifier. This way a path is just a list of identifiers.
from gfa-spec.
Yes, that's right. -- Gene
from gfa-spec.
Edge identifiers are needed when specifying paths through a multigraph, and GFA1 is not a multigraph. See my comment at #49 (comment)
from gfa-spec.
I do not see any reason for requiring GFA1 not to represent a multigraph. Until now, it is possible to represent a multigraph, and I think it is useful.
There are different cases of multiple links connecting two segments.
One are complement links, such as L A + B + 100M
and L B - A - 100M
, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.
Also, connecting segments in different orientations (e.g. L A + B+
and L A + B -
) does not make it a multigraph, as the vertex definition is (segment, orientation).
However, multiple dovetail between two sequences , e.g. L A + B + 100M
and L A + B + 50M
, are not a rarity in assembly and make it a multigraph. In my own assembler, Readjoiner, I am working on an output of each stage of the assembly as GFA, and of course the overlap graph of the reads can contain such multiple overlaps. Also, the multigraph still satisfy the requirements of the SSG definition, as both of the L records define two edges in the graph and the simmetry is still satisfied.
from gfa-spec.
I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.
from gfa-spec.
One are complement links, such as
L A + B + 100M
andL A - B – 100M
, they are simply different representations of the same information (in RGFA if I encounter both during parsing, I ignore them). Having both in the graph does not make it a multigraph.
Do you mean L A + B + 100M
and L B - A – 100M
? Yes, those are the two representations of the same pair of complementary edges, (u, v) and (σ(u), σ(v)), where u = A+ and v = B+.
Also, connecting segments in different orientations (e.g.
L A + B +
andL A + B -
) does not make it a multigraph, as the vertex definition is (segment, orientation).
100% agree.
However, multiple dovetail between two sequences , e.g. L A + B + 100M and L A + B + 50M, are not a rarity in assembly and make it a multigraph.
Yes, using such edges does make the graph a multigraph. @richarddurbin made this point as well. "This assumes that you don’t allow multiple different link lines between the same pair of nodes with the same +/- but different overlap numbers".
I agree with @ggonnella that GFA1 can be a multigraph. This leads to complications in implementations, but it is necessary. Individual tools may choose to squash or remove multiple edges, but the spec should not forbid them.
I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.
from gfa-spec.
Do you mean L A + B + 100M and L B - A – 100M?
Yes, sorry, typo, I fixed that.
from gfa-spec.
I'm okay with supporting such multigraphs in GFA1. It would be helpful to the implementation though to say in the header line whether the graph is a strict graph or a multigraph.
I have nothing against this, if it is useful for some applications.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- What do P lines with zero, one or two Segment ids mean in GFA v1? HOT 11
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.