Comments (9)
I agree that one record is preferred. It must also specify sequence orientations.
n1+,n2-,…
from gfa-spec.
I'm going to recant this. I think we can represent paths in a dispersed way by extending the representation to include ranks. Then each path record P
record could have the form:
P <node id> <path name> <rank> <orientation>
This allows portions of the path to be represented without introducing ambiguity. This is a serious issue, as without the ability to represent a valid portion of a path it is not possible to represent a sub-graph without including the entire path description for all paths overlapping the region.
If we extend the orientation field to represent path starts and ends we also can resolve #24.
from gfa-spec.
Looking at your example in issue 23, this path notation seems a bit unwieldy. It adds a lot of lines to the graph file, would increase the file size and decreases human-readability. I prefer the one-per-path for its neatness.
from gfa-spec.
I also see a potential problem with this notion of pulling out a sub-graph but keeping relevant parts of paths: a path can be broken into discontiguous pieces.
Again using your example from issue 23, let's say I pull out the subgraph including only segments 6 and 7 to get this file:
H HVN:Z:1.0
S 6 GGACTAA
P 6 s1 6 + 7M
P 6 s2 3 + 7M
P 6 s3 3 + 7M
P 6 s3 10 + 7M
P 6 s3 14 + 7M
L 6 + 7 + 0M
S 7 GGACAAAGGT
P 7 s1 7 + 10M
P 7 s2 4 + 10M
P 7 s3 4 + 10M
P 7 s3 11 + 10M
Path s1 is now 6+,7+
and path s2 is the same. But path s3 is now broken into three pieces: 6+
; 6+,7+
and 6+,7+
. That strikes me as very odd: a single path, s3, which consists of three separate parts. I feel like that's deviating from the notion of what a graph path is.
Am I correctly interpreting what you're saying about sub-graphs from a GFA file?
from gfa-spec.
You perfectly understand what I'm suggesting. I'll try to motivate this further with some discussion.
The ranks explain that the paths are nor completely represented. What this let's us do is see what is going on in a local context. We don't need to read through all of path s3 to know that it crosses node 6 three times. We can also see that there are parts of the path that are missing. We might say this is an invalid graph on that basis. Or, we could combine this piece with others to reconstruct the whole graph, and its paths. If we have a whole valid grqph, it will be clear because of the ranks listed in each path component record.
Breaking the paths into pieces is more verbose, but it allows us to manipulate the GFA representation in various ways to do text processing operations that yield different views into the graph. As in this example, we can order things by node, seeing what paths each node is in. Or, we can organize by path to see the subset of the graph associated with each path. More complex sorts would seem achievable with a bit of code, but in the end they would just be sorts of the lines. We wouldn't need to do anything within each line to be able to manipulate the path representation.
We can work around the subsetting issue by not allowing it for paths, but this also seems like a problem to me because subsetting variation data has been very important for practical use. I know from work with larger graphs that it is here too. An alternative to allow subsets without passing along the entire path would be to make new semantics to describe how the paths are cut up. The logical extension of this is the general case in which each path component has a rank saying where it sits in the path it is part of.
@JervenBolleman helped me work through this in vg and has made an RDF serialization of sequence graphs+paths. He may have some other points.
Sorry to not be stable in my opinion here. As I have worked on problems with sequence graphs my perspective has evolved. I would rather do this in a coherent way than stick to my guns, and so when I saw problems with what I had written I ended up rewriting a fair but of my path handling code.
from gfa-spec.
I think that there are valid arguments for and useful features of both one line per path and one line per path segment. My preference is one line per path, and we've already discussed and nailed down that record format. I think that you could propose an alternate and different record for one line per path segment to complement the existing format.
from gfa-spec.
@ekg I was hoping to stay away from discussing formats like GFA which are about as human readable as Linear B ;) i.e. only readable to those trained in the specific art.
However, in the end you need the rank to accurately describe paths, either implicit or explicit. If you want subpaths the rank must be explicit or you lose information.
@rrwick in your example you only have non joined pieces of the s3 path in the subpath s3 which is why the path looks odd. Unlike subpaths s1, s2 which are only terminally fragemented subpath s3 is internally fragmented as well. If you see each GFA file as limited views on the whole variant graph then it makes more sense (at least to me).
from gfa-spec.
Paths are single-line records. Containment C
records could be used to specify the precise location and alignment of segments S
within the path P
.
from gfa-spec.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- What do P lines with zero, one or two Segment ids mean in GFA v1? HOT 11
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.