Giter Club home page Giter Club logo

Comments (5)

lh3 avatar lh3 commented on September 24, 2024

PS: up to here, we may even consider to unify an assembly/string graph and a GA4GH's side graph, but let's reach a consensus on this issue first and then think about the side graph extension.

from gfa-spec.

pmelsted avatar pmelsted commented on September 24, 2024

I have to say that I like the symmetry of E/B, however it does mean that when fetching the sequence the position of the E/B field will determine whether you look for the reverse complement or not (correct me if I'm wrong).

One issue with readability is that the current path spec appends + or - for the segments, e.g.

S seqE *
S seqB *
S seqEB *
L seqB E seqE B 0M
L seqE E seqEB B 0M
P pathBEEB seqBE,seqEB,seqEBB 0M,0M

so that's pretty awkward using E and B in the path. This can be amended to push the orientation (wrong word, but works for now) to a new column

S seqE *
S seqB *
S seqEB *
L seqB E seqE B 0M
L seqE E seqEB B 0M
P pathBEEB seqB,seqE,seqEB E,B,B 0M,0M

The semantics would have to be explained carefully. In this case we start wih seqB from E traverse to B of seqE and neccessarily use E of seqE to B of seqEB.

from gfa-spec.

lh3 avatar lh3 commented on September 24, 2024

Hmm... I haven't thought about paths. One option is to use:

J  contig1'E  contig2'B  0M
P  path1  contig1'B,contig2'B  0M

This means that path1 enters the start of contig1 and then the start of contig2. I am using ' as the separator instead of : because the latter is more frequently seen in sequence names.

At the first sight, this may seem less consistent than the current GFA:

L  contig1  +  contig2  +  0M
P  path1  contig1+,contig2+  0M

in that the orientation on L is the same as orientation on P. However, for a more complex example:

L  contig1  -  contig2  +  0M
L  contig3  +  contig2  -  0M
P  path1  contig3+,contig2-,contig1+  0M,0M

The orientation on P does not always follow the orientation on L. It doesn't seem better than the B/E notation:

J  contig1'B  contig2'B  0M
J  contig3'E  contig2'E  0M
P path1  contig3'B,contig2'E,contig1'B  0M,0M

from gfa-spec.

pb-jchin avatar pb-jchin commented on September 24, 2024

Hi, @lh3 @pmelsted, the "B" and "E" notation that I am using is from Gene Myers' 2005 paper (http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract). On the second page, it describes how to construct directed (not bi-directed) string graph. Here is some excerpt of the text:

For each such read f there will be two vertices, f.B and f.E, one for each end of the read, in the string graph.

and

one adds a directed edge labeled with the non-matched or overhanging sequence at each end of the proper overlap between two reads

Later, the author describes how the "directed graph" is related to the "bidirected graph" and the equivalence to ealier overlap graph

At this juncture observe that because no read contains the other, every composite edge e = v1 → v2 → v3··· → vn has a complementary edge comp(e) = comp(vn) → ...comp(v3) → comp(v2) → comp(v1) where comp(v) is the other end of the read for f. That is, if v = f.B then comp(v) = f.E and if v = f.E then comp(v) = f.B. This property implies that we may instead think of the endpoint pairs as a single vertex with bidirected edges corresponding to an edge and its complement, where an arrowhead is directed into a vertex if the edge to its .E vertex is the head of the relevant one of the two complementary edges, and directed our of the vertex otherwise. This gives us a framework identical to the one introduced by this author and Kececioglu (KeJ95) where a tour through a vertex must involve one inward arrowhead and one out- ward arrowhead.

Notice that, in the definition of the string graph, it is different from joining the ends of two arbitrary segments. There is alway sub-string associated with each edge from a single read while the the edges are always linking vertices from two reads. The order of the vertex labels is important as it represents the directed edges. This is different from the "J" proposal if I understand it right.

While bidirectional graph representation is indeed more compact than the directed graph + duality tracking representation for assembly string graph. The benefit of using directed graph is that it is easier to apply most graph algorithms using existing graph libraries without constructing customerized classes or objects. It is also easier for visualization and contructing sequences by simply concatente the edges without explicilty specifying which orientation to use. The downside is one needs to be sure to preserve the duality symmetry for any operations modifying the graph.

My preference is that if we like to use the "B" and "E" notation and call the GFA represetning string graph, it will be better to stick with the definition used in Gene's paper to avoid future confusion. Otherwise, it might be better just using "+" and "-" and make it clear that GFA is for overlap graph rather than Gene Myers' version of string graph. The two different graphs are isomorphic beside some notation differences in theory, so one can convert the graphs if necessary. However, it will be useful to keep some extra data for each line of "L" to make conversions and some operations easier.

from gfa-spec.

lh3 avatar lh3 commented on September 24, 2024

Yes, I am aware that Myers' string graph is different from my proposal. After a second thought, I think I should back off. The various representations of graph are essentially the same. Perhaps the current GFA is okay after all. I am closing this issue. Thank you.

from gfa-spec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.