In vg I am using the P</cod

but we should also keep in mind ease of parsing. <p dir

ABySS path file format looks like so: <a href="https://github.com/bcgsc/abyss/wiki/ABy

After reading <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Also in <a class="user-mention notranslate" data-hovercard-type="user" da

Path support about gfa-spec HOT 23 CLOSED

gfa-spec commented on September 24, 2024

Path support

from gfa-spec.

Comments (23)

pmelsted commented on September 24, 2024

I agree we should add paths. For now we can work with the latest version

P segmentName pathName orientation [cigar]

but we should also keep in mind ease of parsing.

from gfa-spec.

ekg commented on September 24, 2024

but we should also keep in mind ease of parsing.

What do you mean by this?

from gfa-spec.

pmelsted commented on September 24, 2024

Mostly I mean that when you see a P tag you don't know how many segments there are going to be, so a parser would have to keep a list of path objects "open", whereas most of the other things are single line constructs. When you see a segment you know that's it.

from gfa-spec.

ekg commented on September 24, 2024

Yes, that's absolutely right. Also, ordering may not be clear in this format if we allow cycles.

from gfa-spec.

sjackman commented on September 24, 2024

ABySS path file format looks like so: https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats#path
217 3- 140+ 178-

The format consists of two columns separated by a TAB character:
ID of joined sequence
list of sequences IDs to be joined (separated by spaces)
For each contig, the '+' orientation is the exact sequence that appears in the FASTA file, while the '-' orientation is the reverse complement of that sequence. If the line is composed of a single identifier, the specified contig is removed from the assembly.

For GFA, I'd suggest a four-column format, where the third column is a comma-separated list of n vertices (segment IDs and orientations), and the fourth column is a comma-separated list of n-1 edges (CIGAR strings).

P 217 3-,140+,178- 10M,13M

Thoughts?

from gfa-spec.

pmelsted commented on September 24, 2024

I like this better, if there is only one link (majority of cases) the CIGAR strings could be * and simply * for all links are default. Is there any specific reason to keep them in there?

Also in @ekg 's format there is a position field indicating starting position within the first segment, correct?

from gfa-spec.

sjackman commented on September 24, 2024

After reading lh3/gfatools#2, I can see there's also a use for having individual records that show which segments are contained in which paths. I think these records could be very similar (nearly identical) to containment C records to show that segment x is contained in path y.

from gfa-spec.

sjackman commented on September 24, 2024

Also in @ekg 's format there is a position field indicating starting position within the first segment, correct?

Could that be handled by the soft clip operator of the cigar string?

from gfa-spec.

ekg commented on September 24, 2024

There is not a position field in the vg output. There are no starting positions in segments. I'm actually really confused because I feel like we are talking about radically different graphs!

I'll try to explain what I'm working with. It's basically the kind of thing that you could encode directly in graphviz's dot format. Nodes have sequences as labels, and edges connect to the start or end of nodes. No edges or reverse complements are implied, so A->B does not also mean B->A. That said, you could take the reverse complement of the graph by RC-ing the nodes and reversing the edge directions and node traversals (which build paths). Although it is not supported yet in vg, I would like to be able to connect the "start" or "end" of a node to the start or end of another. This means there are four types of edges: -+, --, ++, +-. This allows the representation of inversions without duplicating the sequence nodes that are used, as well as certain kinds of cyclic constructions that also minimize the amount of redundant sequence that is represented in the nodes.

There is also something basic that I don't understand. Why do we need to encode the overlaps as cigars? Isn't it more consistent to represent the alignment between the ends of nodes as part of the graph itself? An alignment is a graph.

from gfa-spec.

sjackman commented on September 24, 2024

Hi, Erik. When two sequences are exactly adjacent with no overlap, the CIGAR string is 0M. See Link line. The CIGAR string is used to specify the overlap alignment of two reads. e.g. the output of DALIGNER or MHAP

If you have the vertices AAA, CCC and ATA, and you want to construct the sequence path "AAAGGGATA", the path would be:

S 0 AAA
S 1 CCC
S 2 ATA
L 0 + 0 + 2M
L 0 + 1 - 0M
L 1 - 2 + 0M
L 1 + 1 + 2M
P 999 0+,1-,2+ 0M,0M,0M

(added links)

from gfa-spec.

ekg commented on September 24, 2024

The spec is coming together nicely :)

Here I'm confused:

A link from A to B means that the end of A overlaps with the end of B,

Should this say "A link from A to B means that the end of A overlaps with the start of B" ?

from gfa-spec.

sjackman commented on September 24, 2024

Yes, you're right. I've corrected the text in f7c4207

from gfa-spec.

sjackman commented on September 24, 2024

How do we feel about the current path P definition and my proposed definition? If mine is preferred, I'll submit a pull request to change the spec.

from gfa-spec.

pb-jchin commented on September 24, 2024

@sjackman I think it might be good to make sure we get the semantics of a graph right. I feel there is some confusion here. When you describe the vertex "AAA", "CCC" and "ATA", it makes me thinkg the sequences are the vertex. Is that what you mean?

from gfa-spec.

sjackman commented on September 24, 2024

Yes, that's correct. Heng Li @lh3 wrote…
https://github.com/pmelsted/GFA-spec/blob/master/GFA-spec.md#theory

DNA sequence assembly is often (though not always) represented as a graph. There are multiple types of graphs including de Bruijn graph, overlap graph, unitig graph and string graph. They are all birected graph. Briefly, in this graph, each vertex is a sequence and each arc is an overlap.

from gfa-spec.

pb-jchin commented on September 24, 2024

@sjackman if you take a look of the SVG I generate https://github.com/pb-jchin/GFA-spec/blob/master/examples/ThreeReads_Summary.svg, you can see the vertex (or node) in the string graph is actually the "end" or the "begin" of a read or seqeunce. The nodes are not associate with sequences at all except using some read identifier as label. Yes, such string graph and the overlap graph (which I consider graph with nodes that are associated with sequences) can be coverted to each other as shown in the SVG plot. However, I do hope the GFA can support the semantics which the nodes are just labels and the sequences are associated with edges.

from gfa-spec.

sjackman commented on September 24, 2024

There has been no proposal to support sequence on the edges, but it's possible that you could do such a thing with the current spec by creating segment vertices with no sequence (S … LN:i:0), join them with links, and add an optional sequence attribute to the link (e.g. L … sq:Z:ACGT).

from gfa-spec.

pmelsted commented on September 24, 2024

I like the pathspec of the form
P id nodelist cigarlist

Regarding sequences in edges this was something FASTG did exclusively and there are cases where you can't represent graphs without empty edges. This is why we have stuck to segments (sequences) as vertices. Given that segments are vertices, I'm opposed to storing sequences in links (edges) since then you would have to look for a sequence in two places. In your graph @pb-jchin the Bidirected string graph is essentially what would be represented in the GFA format.

from gfa-spec.

pb-jchin commented on September 24, 2024

@pmelsted We don't have to store the sequence in two places. The edge sequences are sub-string of one of the reads. We only need the coordinates to get the sub-string. I think it might be more efficient to construct path sequences by simply concatnate sequences of those sub-strings rather than calculate the overlap and clip sequences and concatnate them. As for the "zero" length edges, I don't know why such edges can be generated. Do you have an example so I can understand it? Even if is necessary, one have an edge with an empty string associated. Do I miss something here?

from gfa-spec.

pmelsted commented on September 24, 2024

0M edges arise naturally in contig generated graphs or variation graphs, here vertices are assembled contigs or segments and the links are just an indicator that they are contiguous somewhere in the underlying genome. For raw long reads they don't make sense.

from gfa-spec.

pb-jchin commented on September 24, 2024

@pmelsted so the case "0M" is like this:

1:B           1:E
-------------->
               ---------------->
              2:B               2:E

right? In this case, I will represent the graph edges as 1:E -> 2:E, and the edge sequence happens to be everything between 2:B and 2:E.

from gfa-spec.

sjackman commented on September 24, 2024

@pb-jchin

We don't have to store the sequence in two places. The edge sequences are sub-string of one of the reads. We only need the coordinates to get the sub-string. I think it might be more efficient to construct path sequences by simply concatnate sequences of those sub-strings rather than calculate the overlap and clip sequences and concatnate them.

I think I understand now. On a link line, you want to store the position on sequence B where the overlap between A and B ends, so that you can easily extract the sequence from the end of the overlap to the end of B. We could add four optional attributes to the L link line to specify the coordinates of the start and end of the overlap on sequences A and B. Alternatively, these four columns could be mandatory integer fields of some new record type.

from gfa-spec.

pb-jchin commented on September 24, 2024

@sjackman yes.

from gfa-spec.

Path support about gfa-spec HOT 23 CLOSED

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent