Comments (23)
I agree we should add paths. For now we can work with the latest version
P segmentName pathName orientation [cigar]
but we should also keep in mind ease of parsing.
from gfa-spec.
but we should also keep in mind ease of parsing.
What do you mean by this?
from gfa-spec.
Mostly I mean that when you see a P tag you don't know how many segments there are going to be, so a parser would have to keep a list of path objects "open", whereas most of the other things are single line constructs. When you see a segment you know that's it.
from gfa-spec.
Yes, that's absolutely right. Also, ordering may not be clear in this format if we allow cycles.
from gfa-spec.
ABySS path file format looks like so: https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats#path
217 3- 140+ 178-
The format consists of two columns separated by a TAB character:
ID of joined sequence
list of sequences IDs to be joined (separated by spaces)
For each contig, the '+' orientation is the exact sequence that appears in the FASTA file, while the '-' orientation is the reverse complement of that sequence. If the line is composed of a single identifier, the specified contig is removed from the assembly.
For GFA, I'd suggest a four-column format, where the third column is a comma-separated list of n vertices (segment IDs and orientations), and the fourth column is a comma-separated list of n-1 edges (CIGAR strings).
P 217 3-,140+,178- 10M,13M
Thoughts?
from gfa-spec.
I like this better, if there is only one link (majority of cases) the CIGAR strings could be * and simply * for all links are default. Is there any specific reason to keep them in there?
Also in @ekg 's format there is a position field indicating starting position within the first segment, correct?
from gfa-spec.
After reading lh3/gfatools#2, I can see there's also a use for having individual records that show which segments are contained in which paths. I think these records could be very similar (nearly identical) to containment C
records to show that segment x is contained in path y.
from gfa-spec.
Also in @ekg 's format there is a position field indicating starting position within the first segment, correct?
Could that be handled by the soft clip operator of the cigar string?
from gfa-spec.
There is not a position field in the vg output. There are no starting positions in segments. I'm actually really confused because I feel like we are talking about radically different graphs!
I'll try to explain what I'm working with. It's basically the kind of thing that you could encode directly in graphviz's dot format. Nodes have sequences as labels, and edges connect to the start or end of nodes. No edges or reverse complements are implied, so A->B
does not also mean B->A
. That said, you could take the reverse complement of the graph by RC-ing the nodes and reversing the edge directions and node traversals (which build paths). Although it is not supported yet in vg, I would like to be able to connect the "start" or "end" of a node to the start or end of another. This means there are four types of edges: -+, --, ++, +-. This allows the representation of inversions without duplicating the sequence nodes that are used, as well as certain kinds of cyclic constructions that also minimize the amount of redundant sequence that is represented in the nodes.
There is also something basic that I don't understand. Why do we need to encode the overlaps as cigars? Isn't it more consistent to represent the alignment between the ends of nodes as part of the graph itself? An alignment is a graph.
from gfa-spec.
Hi, Erik. When two sequences are exactly adjacent with no overlap, the CIGAR string is 0M
. See Link line. The CIGAR string is used to specify the overlap alignment of two reads. e.g. the output of DALIGNER or MHAP
If you have the vertices AAA, CCC and ATA, and you want to construct the sequence path "AAAGGGATA", the path would be:
S 0 AAA
S 1 CCC
S 2 ATA
L 0 + 0 + 2M
L 0 + 1 - 0M
L 1 - 2 + 0M
L 1 + 1 + 2M
P 999 0+,1-,2+ 0M,0M,0M
(added links)
from gfa-spec.
The spec is coming together nicely :)
Here I'm confused:
A link from A to B means that the end of A overlaps with the end of B,
Should this say "A link from A to B means that the end of A overlaps with the start of B" ?
from gfa-spec.
Yes, you're right. I've corrected the text in f7c4207
from gfa-spec.
How do we feel about the current path P
definition and my proposed definition? If mine is preferred, I'll submit a pull request to change the spec.
from gfa-spec.
@sjackman I think it might be good to make sure we get the semantics of a graph right. I feel there is some confusion here. When you describe the vertex "AAA", "CCC" and "ATA", it makes me thinkg the sequences are the vertex. Is that what you mean?
from gfa-spec.
Yes, that's correct. Heng Li @lh3 wrote…
https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#theory
DNA sequence assembly is often (though not always) represented as a graph. There are multiple types of graphs including de Bruijn graph, overlap graph, unitig graph and string graph. They are all birected graph. Briefly, in this graph, each vertex is a sequence and each arc is an overlap.
from gfa-spec.
@sjackman if you take a look of the SVG I generate https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg, you can see the vertex (or node) in the string graph is actually the "end" or the "begin" of a read or seqeunce. The nodes are not associate with sequences at all except using some read identifier as label. Yes, such string graph and the overlap graph (which I consider graph with nodes that are associated with sequences) can be coverted to each other as shown in the SVG plot. However, I do hope the GFA can support the semantics which the nodes are just labels and the sequences are associated with edges.
from gfa-spec.
There has been no proposal to support sequence on the edges, but it's possible that you could do such a thing with the current spec by creating segment vertices with no sequence (S … LN:i:0
), join them with links, and add an optional sequence attribute to the link (e.g. L … sq:Z:ACGT
).
from gfa-spec.
I like the pathspec of the form
P id nodelist cigarlist
Regarding sequences in edges this was something FASTG did exclusively and there are cases where you can't represent graphs without empty edges. This is why we have stuck to segments (sequences) as vertices. Given that segments are vertices, I'm opposed to storing sequences in links (edges) since then you would have to look for a sequence in two places. In your graph @pb-jchin the Bidirected string graph is essentially what would be represented in the GFA format.
from gfa-spec.
@pmelsted We don't have to store the sequence in two places. The edge sequences are sub-string of one of the reads. We only need the coordinates to get the sub-string. I think it might be more efficient to construct path sequences by simply concatnate sequences of those sub-strings rather than calculate the overlap and clip sequences and concatnate them. As for the "zero" length edges, I don't know why such edges can be generated. Do you have an example so I can understand it? Even if is necessary, one have an edge with an empty string associated. Do I miss something here?
from gfa-spec.
0M
edges arise naturally in contig generated graphs or variation graphs, here vertices are assembled contigs or segments and the links are just an indicator that they are contiguous somewhere in the underlying genome. For raw long reads they don't make sense.
from gfa-spec.
@pmelsted so the case "0M" is like this:
1:B 1:E
-------------->
---------------->
2:B 2:E
right? In this case, I will represent the graph edges as 1:E -> 2:E, and the edge sequence happens to be everything between 2:B and 2:E.
from gfa-spec.
We don't have to store the sequence in two places. The edge sequences are sub-string of one of the reads. We only need the coordinates to get the sub-string. I think it might be more efficient to construct path sequences by simply concatnate sequences of those sub-strings rather than calculate the overlap and clip sequences and concatnate them.
I think I understand now. On a link line, you want to store the position on sequence B where the overlap between A and B ends, so that you can easily extract the sequence from the end of the overlap to the end of B. We could add four optional attributes to the L
link line to specify the coordinates of the start and end of the overlap on sequences A and B. Alternatively, these four columns could be mandatory integer fields of some new record type.
from gfa-spec.
@sjackman yes.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- Allow empty string value in optional field like SAM does HOT 1
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.