For clarity, I think it is necessary to specify paths in a single record. If we use no

Looking at your example in <a href="https://github.com/pmelsted/GFA-spec/issues/23" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Paths must be specified in one record about gfa-spec HOT 9 CLOSED

gfa-spec commented on June 17, 2024

Paths must be specified in one record

from gfa-spec.

Comments (9)

sjackman commented on June 17, 2024

I agree that one record is preferred. It must also specify sequence orientations.
n1+,n2-,…

from gfa-spec.

ekg commented on June 17, 2024

I'm going to recant this. I think we can represent paths in a dispersed way by extending the representation to include ranks. Then each path record P record could have the form:

P <node id> <path name> <rank> <orientation>

This allows portions of the path to be represented without introducing ambiguity. This is a serious issue, as without the ability to represent a valid portion of a path it is not possible to represent a sub-graph without including the entire path description for all paths overlapping the region.

If we extend the orientation field to represent path starts and ends we also can resolve #24.

from gfa-spec.

rrwick commented on June 17, 2024

Looking at your example in issue 23, this path notation seems a bit unwieldy. It adds a lot of lines to the graph file, would increase the file size and decreases human-readability. I prefer the one-per-path for its neatness.

from gfa-spec.

rrwick commented on June 17, 2024

I also see a potential problem with this notion of pulling out a sub-graph but keeping relevant parts of paths: a path can be broken into discontiguous pieces.

Again using your example from issue 23, let's say I pull out the subgraph including only segments 6 and 7 to get this file:

H   HVN:Z:1.0
S   6   GGACTAA
P   6   s1  6   +   7M
P   6   s2  3   +   7M
P   6   s3  3   +   7M
P   6   s3  10  +   7M
P   6   s3  14  +   7M
L   6   +   7   +   0M
S   7   GGACAAAGGT
P   7   s1  7   +   10M
P   7   s2  4   +   10M
P   7   s3  4   +   10M
P   7   s3  11  +   10M

Path s1 is now 6+,7+ and path s2 is the same. But path s3 is now broken into three pieces: 6+; 6+,7+ and 6+,7+. That strikes me as very odd: a single path, s3, which consists of three separate parts. I feel like that's deviating from the notion of what a graph path is.

Am I correctly interpreting what you're saying about sub-graphs from a GFA file?

from gfa-spec.

ekg commented on June 17, 2024

You perfectly understand what I'm suggesting. I'll try to motivate this further with some discussion.

The ranks explain that the paths are nor completely represented. What this let's us do is see what is going on in a local context. We don't need to read through all of path s3 to know that it crosses node 6 three times. We can also see that there are parts of the path that are missing. We might say this is an invalid graph on that basis. Or, we could combine this piece with others to reconstruct the whole graph, and its paths. If we have a whole valid grqph, it will be clear because of the ranks listed in each path component record.

Breaking the paths into pieces is more verbose, but it allows us to manipulate the GFA representation in various ways to do text processing operations that yield different views into the graph. As in this example, we can order things by node, seeing what paths each node is in. Or, we can organize by path to see the subset of the graph associated with each path. More complex sorts would seem achievable with a bit of code, but in the end they would just be sorts of the lines. We wouldn't need to do anything within each line to be able to manipulate the path representation.

We can work around the subsetting issue by not allowing it for paths, but this also seems like a problem to me because subsetting variation data has been very important for practical use. I know from work with larger graphs that it is here too. An alternative to allow subsets without passing along the entire path would be to make new semantics to describe how the paths are cut up. The logical extension of this is the general case in which each path component has a rank saying where it sits in the path it is part of.

@JervenBolleman helped me work through this in vg and has made an RDF serialization of sequence graphs+paths. He may have some other points.

Sorry to not be stable in my opinion here. As I have worked on problems with sequence graphs my perspective has evolved. I would rather do this in a coherent way than stick to my guns, and so when I saw problems with what I had written I ended up rewriting a fair but of my path handling code.

from gfa-spec.

sjackman commented on June 17, 2024

I think that there are valid arguments for and useful features of both one line per path and one line per path segment. My preference is one line per path, and we've already discussed and nailed down that record format. I think that you could propose an alternate and different record for one line per path segment to complement the existing format.

from gfa-spec.

JervenBolleman commented on June 17, 2024

@ekg I was hoping to stay away from discussing formats like GFA which are about as human readable as Linear B ;) i.e. only readable to those trained in the specific art.

However, in the end you need the rank to accurately describe paths, either implicit or explicit. If you want subpaths the rank must be explicit or you lose information.

@rrwick in your example you only have non joined pieces of the s3 path in the subpath s3 which is why the path looks odd. Unlike subpaths s1, s2 which are only terminally fragemented subpath s3 is internally fragmented as well. If you see each GFA file as limited views on the whole variant graph then it makes more sense (at least to me).

from gfa-spec.

sjackman commented on June 17, 2024

Paths are single-line records. Containment C records could be used to specify the precise location and alignment of segments S within the path P.

from gfa-spec.

stale commented on June 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

from gfa-spec.

Paths must be specified in one record about gfa-spec HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent