Comments (5)
PS: up to here, we may even consider to unify an assembly/string graph and a GA4GH's side graph, but let's reach a consensus on this issue first and then think about the side graph extension.
from gfa-spec.
I have to say that I like the symmetry of E/B, however it does mean that when fetching the sequence the position of the E/B field will determine whether you look for the reverse complement or not (correct me if I'm wrong).
One issue with readability is that the current path spec appends + or - for the segments, e.g.
S seqE *
S seqB *
S seqEB *
L seqB E seqE B 0M
L seqE E seqEB B 0M
P pathBEEB seqBE,seqEB,seqEBB 0M,0M
so that's pretty awkward using E and B in the path. This can be amended to push the orientation (wrong word, but works for now) to a new column
S seqE *
S seqB *
S seqEB *
L seqB E seqE B 0M
L seqE E seqEB B 0M
P pathBEEB seqB,seqE,seqEB E,B,B 0M,0M
The semantics would have to be explained carefully. In this case we start wih seqB from E traverse to B of seqE and neccessarily use E of seqE to B of seqEB.
from gfa-spec.
Hmm... I haven't thought about paths. One option is to use:
J contig1'E contig2'B 0M
P path1 contig1'B,contig2'B 0M
This means that path1 enters the start of contig1 and then the start of contig2. I am using '
as the separator instead of :
because the latter is more frequently seen in sequence names.
At the first sight, this may seem less consistent than the current GFA:
L contig1 + contig2 + 0M
P path1 contig1+,contig2+ 0M
in that the orientation on L
is the same as orientation on P
. However, for a more complex example:
L contig1 - contig2 + 0M
L contig3 + contig2 - 0M
P path1 contig3+,contig2-,contig1+ 0M,0M
The orientation on P
does not always follow the orientation on L
. It doesn't seem better than the B/E notation:
J contig1'B contig2'B 0M
J contig3'E contig2'E 0M
P path1 contig3'B,contig2'E,contig1'B 0M,0M
from gfa-spec.
Hi, @lh3 @pmelsted, the "B" and "E" notation that I am using is from Gene Myers' 2005 paper (http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract). On the second page, it describes how to construct directed (not bi-directed) string graph. Here is some excerpt of the text:
For each such read f there will be two vertices, f.B and f.E, one for each end of the read, in the string graph.
and
one adds a directed edge labeled with the non-matched or overhanging sequence at each end of the proper overlap between two reads
Later, the author describes how the "directed graph" is related to the "bidirected graph" and the equivalence to ealier overlap graph
At this juncture observe that because no read contains the other, every composite edge e = v1 → v2 → v3··· → vn has a complementary edge comp(e) = comp(vn) → ...comp(v3) → comp(v2) → comp(v1) where comp(v) is the other end of the read for f. That is, if v = f.B then comp(v) = f.E and if v = f.E then comp(v) = f.B. This property implies that we may instead think of the endpoint pairs as a single vertex with bidirected edges corresponding to an edge and its complement, where an arrowhead is directed into a vertex if the edge to its .E vertex is the head of the relevant one of the two complementary edges, and directed our of the vertex otherwise. This gives us a framework identical to the one introduced by this author and Kececioglu (KeJ95) where a tour through a vertex must involve one inward arrowhead and one out- ward arrowhead.
Notice that, in the definition of the string graph, it is different from joining the ends of two arbitrary segments. There is alway sub-string associated with each edge from a single read while the the edges are always linking vertices from two reads. The order of the vertex labels is important as it represents the directed edges. This is different from the "J" proposal if I understand it right.
While bidirectional graph representation is indeed more compact than the directed graph + duality tracking representation for assembly string graph. The benefit of using directed graph is that it is easier to apply most graph algorithms using existing graph libraries without constructing customerized classes or objects. It is also easier for visualization and contructing sequences by simply concatente the edges without explicilty specifying which orientation to use. The downside is one needs to be sure to preserve the duality symmetry for any operations modifying the graph.
My preference is that if we like to use the "B" and "E" notation and call the GFA represetning string graph, it will be better to stick with the definition used in Gene's paper to avoid future confusion. Otherwise, it might be better just using "+" and "-" and make it clear that GFA is for overlap graph rather than Gene Myers' version of string graph. The two different graphs are isomorphic beside some notation differences in theory, so one can convert the graphs if necessary. However, it will be useful to keep some extra data for each line of "L" to make conversions and some operations easier.
from gfa-spec.
Yes, I am aware that Myers' string graph is different from my proposal. After a second thought, I think I should back off. The various representations of graph are essentially the same. Perhaps the current GFA is okay after all. I am closing this issue. Thank you.
from gfa-spec.
Related Issues (20)
- Need to specify "reference" in terms of cigar operations in overlap HOT 4
- Do two genes link together in GFA file indicate these two genes associate with each other? HOT 2
- Should a PG line (like in SAM) be codified in the spec? HOT 3
- GFA2: does not mention the encoding expected of file content (ASCII-7bit, UTF-8, etc.) HOT 1
- v1.1 is not semantically distinct from v1 HOT 2
- W lines: no description of '>' and '<' use HOT 2
- Use of GFA2 as a pangenome reference
- Representation of annotations in a GFA2/GFA3 file
- Segment names conflicts in spec
- Translocation and Inversion HOT 2
- Allow lowercase characters in hex strings
- looking for a CLI tool to produce circular candidates from GFA HOT 2
- Allow empty string value in optional field like SAM does HOT 1
- Namespace for S and P lines in GFA1 HOT 1
- Indicating that a path is circular HOT 2
- manipulating .gfa file HOT 5
- Implied adjacent objects in GFA2 groups are problematic HOT 3
- GFA2 specification does not mention optional field reserved tags HOT 4
- making path overlap cigar list optional HOT 3
- GFA has been submitted to the EDAM ontology HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gfa-spec.