Giter Club home page Giter Club logo

libbdsg's People

Contributors

adamnovak avatar ctcisar avatar edawson avatar ekg avatar emikoifish avatar glennhickey avatar jeizenga avatar jervenbolleman avatar jltsiren avatar jonassibbesen avatar mlin avatar mr-c avatar ocxtal avatar rlorigro avatar xchang1 avatar yhoogstrate avatar yoheirosen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libbdsg's Issues

Needs to be able to read GFA by itself

We need a GFA-to-HandleGraph parser available in the Python library, so people can read GFA and not need vg itself as a pre-importer.

We can still reject overlaps.

divide_handle broken on rev. strand for PackedGraph and HashGraph

Here's an example:

handle_t handle = g->create_handle("TTATATTCCAACTCTCTG");                                                   
// Should get (C,AGAG,AGTTG,GAATATAA)                                                                       
auto parts = g->divide_handle(g->flip(handle), {1, 5, 10});

vg and odgi give the expected division

Part 0 5:1 seq=C
Part 1 4:1 seq=AGAG
Part 2 3:1 seq=AGTTG
Part 3 2:1 seq=GAATATAA

HashGraph gives something else:

Part 0 4:1 seq=CAGAGAGTTG
Part 1 3:1 seq=CAGAG
Part 2 2:1 seq=C
Part 3 1:1 seq=AGAGAGTTGGAATATAA

And PackedGraph enters an infinite loop before producing any output

Speed up path position overlay

The major bottleneck to replacing xg+gbwt with gbz for some workflows is the path position interface. XG provides a fast native version, but gbz would fall back on the overlay. And the path position overlay, which works fine once built, takes too long to create for whole-genome graphs.

So what's needed is:

  • speed up path position overlay construction by parallelizing on paths (as @jeizenga suggests)

@adamnovak seems interested in implementing this ..

Resizing factor bug in PackedGraph

I made a mistake setting the resizing factor of the hash tables in PackedGraph, which it handled too gracefully for me to notice. I want to test out the fix's effects before I decide that it's better than just not fixing it.

Docs/Python bindings broke because of SnarlDistanceIndex

The ReadTheDocs builds have been failing for a while with:

ImportError: overloading a method with both static and instance methods is not supported; error while attempting to bind instance method SnarlDistanceIndex.get_net_handle(self: bdsg.bdsg.SnarlDistanceIndex, pointer: int, connectivity: bdsg.bdsg.SnarlDistanceIndex.connectivity_t) -> bdsg.handlegraph.net_handle_t

I think the code that makes the Python bindings can't handle having both static and non-static methods with the same name, which we do here:

const static handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity, net_handle_record_t type, size_t node_offset=0) {
if (pointer > ((size_t)1 << (64-BITS_FOR_TRIVIAL_NODE_OFFSET-3-4))-1) {
throw runtime_error("error: don't have space in net handle for record offset");
}
net_handle_t handle = as_net_handle((((((pointer << BITS_FOR_TRIVIAL_NODE_OFFSET) | node_offset) << 4) | connectivity)<<3) | type);
return handle;
}
handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity) const {
net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type();
size_t node_record_offset = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL ||
SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL ? 1 : 0;
return get_net_handle(pointer, connectivity, type, node_record_offset);
}
handlegraph::net_handle_t get_net_handle(size_t pointer) const {
net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type();
return get_net_handle(pointer, START_END, type);
}

@xchang1 Can you rename the static version of the method and see if the ReadTheDocs builds start working again? If you make a ReadTheDocs account I can give you access to the project to get the logs.

Name Change

Given the many other projects using the "sglib" name, we're considering renaming this one.

Also, we have the same problem as HTSlib where we end up with libsglib.a and it looks silly.

Names under consideration so far include:

  • odgi: Optimized Dynamic Graph Implementations. @ekg already has the actual odgi; we would just glom the graphs in here together with it and have one repo. The main obstacle here is that we'd have to glue the projects together, and I'm not sure if Erik wants to keep odgi separate, especially since (AFAIK) it includes that whole visualizer that wouldn't make as much sense in a graph implementation library.

Let's get Travis running the unit tests

I was never able to run the unit tests of libbdsg as I don't have a setup to build them. For #20, I just pasted everything inside vg, worked and tested from there, then moved it back out for the PR. Anyway, it'd be nice if Travis somehow managed to build and test PRs.

Deprecate libbdsg odgi

The odgi project's "odgi" graph is now substantially different than the libbdsg version, and they've never quite been serialized-data compatible, and the odgi tool wants to use GFA for interchange anyway. Also, libbdsg's odgi has outstanding issues (#19) and isn't really maintained, because all the real odgi graph data structure development happens in the odgi project.

We should deprecate libbdsg's odgi graph implementation.

Incrementing ids leaves HashGraph is sorry state

If I do

hashgraph->increment_node_ids(k);
copy_graph(hashgraph, target);

I get a segfault (same code works with packed graph).

But if I do

hashgraph->increment_node_ids(k);
hasgraph->reassign_node_ids([](nid_t node_id) { return node_id; });
copy_graph(hashgraph, target);

it's okay. To reproduce (in vg), just comment out the reassign_node_ids() call in vg concat here: https://github.com/vgteam/vg/pull/2692/files#diff-4178eec0eda65c2fd8385e1bd8f03d57R97-R98
and run the test

cd test ;  prove -v t/09_vg_concat.t 

Speed up finding reference paths on a handle: Make PathPositionOverlay also implement its own for_each_step_on_handle_impl?

For surjection, we work with path position overlays for a restricted subset of paths in the backing graph. But searching for all the steps on a handle can involve thinking about all the paths actually in the backing graph, and checking each of them. We would like to be able to just get the reference path steps on the handle efficiently.

One way to do this might be to re-implement for_each_step_on_handle_impl based on the information in the overlay that knows about the subsetting of the paths.

bool PositionOverlay::for_each_step_on_handle_impl(const handle_t& handle,
const function<bool(const step_handle_t&)>& iteratee) const {
return get_graph()->for_each_step_on_handle(handle, iteratee);
}

Deconstruction is still oddly slow with PackedReferencePathOverlay

@glennhickey says that, with the files in /public/home/hickey/dev/work/chicken-pangenome/debug on Courtyard, you can do a deconstruct run in about 32 minutes on 64 threads, including snarl build, when using the xg. But when using just the GBZ, when I od the equivalent of

deconstruct ./chicken-pg-sep23-new.gbz -t 64 -r ./chicken-pg-sep23-new.snarls

Then it crashes complaining that OpenMP can't make enough threads. When I cut the thread count to 16, it takes probably more than 2 hours.

The overlay build seems relatively fast; something important is probably still slow here and needs to be sped up. But first we need to collect actually comparable runtimes and probably profiles from two matched conditions with a consistent number of threads and a consistent choice of generating vs. loading snarls.

When running on (M1?) Mac, tests fail due to trying to refer across chains

After updating the submodules to versions that support M1 Mac and its Clang, when you build and run the tests, test_paged_vector<PagedVector<5, MappedBackend>>(); fails with:

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Attempted to refer across chains!

It looks like the problem is in the assignment operator for the yomo::Pointer<>, and it's trying to point from normal memory into a yomo memory chain. I tried hacking it to allow yomo::Pointer<> to work from non-yomo memory, but that led to a different failure further along in the test, so I think that really we have a mistake somewhere which GCC/Linux doesn't notice.

BUILD_STATIC flag not working

Specifying BUILD_STATIC=ON doesn't actually create static libs when using CMake:

ExternalProject_Add(project_bdsg
        GIT_REPOSITORY https://github.com/vgteam/libbdsg.git
        PREFIX ${CMAKE_SOURCE_DIR}/external/bdsg/
        CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${CMAKE_SOURCE_DIR}/external/bdsg/ -DBUILD_STATIC=ON -DRUN_DOXYGEN=OFF -DBUILD_PYTHON_BINDINGS=OFF
        BUILD_IN_SOURCE True
        INSTALL_COMMAND make install
        )

results in:

[100%] Linking CXX executable bin/test_libbdsg
/usr/bin/ld: attempted static link of dynamic object `lib/libbdsg.so'
collect2: error: ld returned 1 exit status

and a look at the lib dir shows that nothing has changed:

libbdsg.so
libdivsufsort64.so -> libdivsufsort64.so.3
libdivsufsort64.so.3 -> libdivsufsort64.so.3.0.1
libdivsufsort64.so.3.0.1
libdivsufsort.so -> libdivsufsort.so.3
libdivsufsort.so.3 -> libdivsufsort.so.3.0.1
libdivsufsort.so.3.0.1
libgtest_main.so
libgtest.so
libhandlegraph.a
libhandlegraph.so
libsdsl.so -> libsdsl.so.3
libsdsl.so.2.1.0
libsdsl.so.2.3.0
libsdsl.so.3 -> libsdsl.so.2.3.0
pkgconfig

Python bindings don't bind any methods that use std::vector

It looks like @ctcisar turned off binding for std::vector. That makes the bindings build, but Handle Graph API methods that use std::vector (like MutablePathHandleGraph::rewrite_segment) don't get any bindings generated for them.

De-deactivating std::vector in config.cfg triggers what I think is a Binder bug, or at least an incompatibility with something in our project: RosettaCommons/binder#100

We need to fix this, or write other API methods that let you do the same things without std::vector somehow.

We should be able to install via `pip` without checking the `python3` on the PATH

Jonas managed to produce this:

    running build_ext
    cmake /private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg -DRUN_DOXYGEN=OFF -DPYTHON_EXECUTABLE=/usr/local/opt/python/bin/python3.7 -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg/build/lib.macosx-10.14-x86_64-3.7 -DCMAKE_BUILD_TYPE=Release
...
    CMake Error at CMakeLists.txt:182 (message):
      Python version mismatch: CMake wants to build for Python 3.7.6 at
      /usr/local/opt/python/bin/python3.7 but `python3` is Python 3.5.3 at
      /usr/local/bin/python3.  You will not be able to import the module in the
      current Python! To use the version CMake selected, run the build in a
      virtualenv with that Python version activated.  To use the version on your
      PATH, restart the build with -DPYTHON_EXECUTABLE=/usr/local/bin/python3 on
      the command line.

If -DPYTHON_EXECUTABLE is explicitly set, or maybe if some new CMake option is set, we should let setup.py bypass that check, since under setup.py we are pretty sure we have the right python.

missing Python.h on system where it is present

I'm unable to build this using cmake.

/home/erik/libbdsg/bin/_deps/pybind11-src/include/pybind11/detail/common.h:112:10: fatal error: Python.h: No such file or directory

However, I am able to build odgi, which includes pybind11 bindings and depends on Python.h.

What are the dependencies for libbdsg?

Add magic numbers to the serialize/deserialize methods

It would be handy to be able to identify the files' types. Serialize would prefix each file with some constant bytes for each graph type, and deserialize would skip over them.

We could also maybe add a method to SerializableHandleGraph that would return the magic number that the class uses, to let larger programs sniff types auto-magically.

VG's IO system now has support for ID-ing non-VPKG files by magic number, so this would help compatibility with vg.

Overlays don't expose path metadata methods

I added a bunch of path metadata methods in #153 but I never added the stubs to call through to the backing graph in the overlays. So I think right now when you call these methods on e.g. PackedPathPositionOverlay you end up with the default implementations from PathMetadata and you won't use any actually-efficient implementations from the backing graph.

Python bindings install with pip raises legacy-install-failure

Hello, and firstly thanks for this nice software.

Installation of the lib was flawless, however I'm currently having a hard time installing python bindings with pip. It throws legacy-install-failure each time.

I checked dependencies and installed them, tried the pip/setuptools/wheel update (as it's frequently mentioned when encountering this kind of issues) ; nonetheless, it didn't helped a lot. I also tried different versions of python (3.7, 3.10 and 3.11).

      -- Found PythonInterp: /scratch/.env/bin/python3.10 (found version "3.10.8")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Looking for C++ include cstdio
      -- Looking for C++ include cstdio - found
      -- Found Doxygen: /scratch/.env/bin/doxygen (found version "1.9.5") found components: doxygen missing components: dot
      libhandlegraph is add_subdirectory project
      -- Found PythonLibs: /scratch/.env/lib/libpython3.10.so
      -- pybind11 v2.4.dev4
      -- Performing Test HAS_FLTO
      -- Performing Test HAS_FLTO - Success
      -- LTO enabled
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/build/temp.linux-x86_64-cpython-310
      cmake --build . --config Release -- -j2
      [  0%] Building CXX object bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o
      [  0%] Creating directories for 'binder'
      [  0%] Performing download step (git clone) for 'binder'
      Cloning into 'binder'...
      In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:8:0,
                       from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/types.hpp:25:15: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
       typedef nid_t id_t;
                     ^
      In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:0:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp: In member function ‘bool handlegraph::HandleGraph::for_each_edge(const Iteratee&, bool) const’:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:65: error: expected primary-expression before ‘)’ token
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                       ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:68: error: expected primary-expression before ‘]’ token
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                          ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:70: error: expected primary-expression before ‘const’
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                            ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:97: error: expected unqualified-id before ‘bool’
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                                                       ^
      gmake[2]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o] Error 1
      gmake[1]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/all] Error 2
      gmake[1]: *** Waiting for unfinished jobs....
      HEAD is now at 788ab42... Merge pull request #99 from adamnovak/patch-1
      [  0%] Performing update step for 'binder'
      [  0%] No patch step for 'binder'
      [  0%] No configure step for 'binder'
      [  0%] No build step for 'binder'
      [  5%] Performing install step for 'binder'
      [  5%] Completed 'binder'
      [  5%] Built target binder
      gmake: *** [all] Error 2
      error: command '/scratch/.env/bin/cmake' failed with exit code 2
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> bdsg

Thanks by advance!
Best regards.

Overlays don't work well with haplotype paths

Right now, overlays like PackedPathPositionOverlay will index all the paths that for_each_path_handle() returns.

For backward compatibility, I have for_each_path_handle() omitting haplotype paths, at least for graphs like GBWTGraph where there are thousands of them, and you need to use the more advanced PathMetadata search methods to enumerate haplotype paths.

But this means that you can put a PackedPathPositionOverlay over a GBWTGraph, get a handle to a haplotype path by name, but then ask about positions on it when it hasn't actually been indexed during construction of the overlay. This in turn is going to break e.g. vg inject into a haplotype path.

The overlays need to be modified to handle haplotype paths. Either they need to be enumerated and processed (which is probably too slow), or they need to be detected and excluded (which is kind of useless because people probably want to be able to use them) or they need to be lazily indexed by the overlays (which might be inefficient but at least avoids indexing thousands of samples to inject into one).

How to extract most likely position of series of nodes

Hello,
I've imported a pg graph in python using the bdsg module. I'm processing a series of alignments to the graph itself.
For practical reasons I'm processing them as gaf alignments generated with vg convert -G.
For each read, I've got a path represented in the >47102051>47102052>47102053 format. For these I can extract all the possible positions for each node on every path. However, this is quite impractical when it comes to defining the most likely contiguous set of intervals. Is there a way to extract this type of information based on this information?
For example, if node 47102051 can come from "chr1:0-10" and "chr1:50-70", and node 47102052 comes from chr1:11-24, then the interval succession is likely to be: chr1:0-10 > chr1:11-24. Not sure if I'm explaining my problems clearly, but I hope it makes sense.

Thank you in advance,
Andrea

Add low level methods for handle graphs

I've been using paths quite a lot in bdsg and found that a couple methods would simplify things:

pop_path_front/back
rename_path

Based on the existing methods prepend_step and append_step, pop should be equivalently complex for each implementation. As it is now I am using this to "pop" elements:

        auto begin = graph.path_begin(p);
        auto next = graph.get_next_step(begin);
        graph.rewrite_segment(begin, next, vector<handle_t>());

And for the path renaming, I copy the entire path and delete the old one, so it would be cool if there was a copy-free method.

Looks for DYNAMIC in the wrong(?) place

The build system looks for DYANMIC as "dynamic.hpp" and not <dynamic/dynamic.hpp> where the vg build system installs it. The Makefile also ignores any CXXFLAGS trying to tell it to use -I options (or to use a different standard library, or anything else that might be needed).

The ODGI graph does not pass the handle graph tests ported over from VG

I haven't looked into this deeply yet, but we should figure out whether this is the fault of the tests or the implementation, or perhaps a disagreement over the generality of our formalism.

Currently, I've commented out ODGI in these tests so that they would still run for the other graphs.

Example of using bdsg.bdsg.PositionOverlay

Hello,
sorry for opening many issues. I'm trying to work out the problem explained here. In that topic, I've been suggested to use bdsg.bdsg.PositionOverlay() to perform efficient lookup. I'm struggling to understand how to use it though, would it be possible to provide some detailing/example of those?
Thank you in advance
Andrea

reinterpret_cast should maybe be static_cast

Mac Clang is complaining we don't know how to cast right:


src/path_position_overlays.cpp:431:16: warning: 'reinterpret_cast' to class 'handlegraph::MutablePathDeletableHandleGraph *' from its virtual base 'handlegraph::PathHandleGraph *' behaves differently from 'static_cast' [-Wreinterpret-base-class]
        return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/path_position_overlays.cpp:431:16: note: use 'static_cast' to adjust the pointer correctly while downcasting
        return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
               ^~~~~~~~~~~~~~~~
               static_cast

We could get into trouble with pointer offsets depending on how any given compiler is doing virtual base classes.

PackedGraph::get_edge_count() doesn't work

In particular

template<typename Backend>
size_t BasePackedGraph<Backend>::get_edge_count() const {
    // each edge (except reversing self edges) are stored twice in the edge vector
    return (edge_lists_iv.size() / EDGE_RECORD_SIZE + reversing_self_edge_records - deleted_edge_records) / 2;
}

doesn't necessarily return the same answer as the default (and correct) implementation

size_t HandleGraph::get_edge_count() const {
    size_t total = 0;
    for_each_edge([&](const edge_t& ignored) {
        total++;
    });
    return total;
};

Here's an example:

wget -q http://public.gi.ucsc.edu/~hickey/debug/chr12.clip.vg
vg stats -E chr12.clip.vg
1116343
vg convert -a chr12.clip.vg | vg stats -E -
1116345

I noticed this because vg clip uses get_edge_count() to make an array, which leads to an exception when the array is too small on this graph

vg clip -d 1 -P galGal6 chr12.clip.vg > chr12.clean.vg


#6    Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e3281, in vg::clip_low_depth_nodes_and_edges(handlegraph::MutablePathMutableHandleGraph*, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
      Source "src/clip.cpp", line 641, in clip_low_depth_nodes_and_edges [0x5570c70e3281]
#5    Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e2532, in vg::clip_low_depth_nodes_and_edges_generic(handlegraph::MutablePathMutableHandleGraph*, std::function<void (std::function<void (handlegraph::handle_t, vg::Region const*)>)>, std::function<void (std::function<void (std::pair<handlegraph::handle_t, handlegraph::handle_t>, vg::Region const*)>)>, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
      Source "src/clip.cpp", line 538, in clip_low_depth_nodes_and_edges_generic [0x5570c70e2532]

This is has likely been causing a few people crashes with minigraph-cactus, ex ComparativeGenomicsToolkit/cactus#1001, but this is the first time I've been able to reproduce myself.

attn @jeizenga

Too hard to use as an ExternalProject

@rlorigro tried to use libbdsg as a CMake ExternalProject.

It didn't really work; the ExternalProject dependencies that libbdsg pulls in weren't being installed alongside libbdsg itself when Ryan's CMake project installed libbdsg, and headers and libraries were missing.

The way he actually made libbdsg work as an ExternalProject dependency was to depend on libdsg_easy and make IMPORTED libraries for all the transitive dependency libraries:

https://github.com/rlorigro/GetBlunted/blob/aae419d3c2c338d7722cd4de585e80bfad479f9b/CMakeLists.txt#L145-L198

We need to devise a working tutorial for depending on libbdsg via CMake ExternalProject, ideally not using libbdsg_easy, and add it to the docs.

Question on extracted path with bdsg

Hello,
I'm trying to write a script that extract the genome-specific sequences using bdsg in python as suggested here https://github.com/vgteam/vg/issues/3153. I've managed to load a pg graph, extract the paths in there and separate the paths from the reference genome from the others:

pg = bdsg.bdsg.PackedGraph()
pg.deserialize( 'ingenome.pg' )
node_count = pg.get_node_count()
edge_count = pg.get_edge_count()
path_count = pg.get_path_count()

print('Found:')
print( ' - {} nodes\n - {} edges\n - {} paths'.format( node_count, edge_count, path_count ) )

# Get paths
path_handle = []
pg.for_each_path_handle(lambda y: path_handle.append(y) or True)
path_names = [pg.get_path_name(i) for i in path_handle ]

# Get target paths
target_paths = [ i for i in path_names if args.target_genome in i ]
target_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]

# Get query paths
query_paths = [ i for i in path_names if args.target_genome in i ]
query_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]

I've also seen that I can extract the nodes in a path of interest using the approach:

handles = []
pg.for_each_step_in_path(query_path_handle[0], lambda y: handles.append(pg.get_handle_of_step(y)) or True)

And their ids with:

nodes_ids = [pg.get_id(i) for i in handles]

My question is pretty straightforward: am I right that the list of nodes generated by for_each_step_in_path() is ordered by position in the path? So that the first element in the list is the first node and therefore the initial position is 0, whereas the second node will be 0 plus the sum of the preceeding nodes' length.

Thanks in advance
Andrea

Vectoizable Overlays assume that graph iteration order is consistent and thus break vg pack

The vectorizing overlays loop through all the nodes and edges in the graph, and treat that as the vectorized order.

The problem is that:

  1. The overlay calls for_each_handle on the underlying graph twice, and assumes that the iteration order will be the same both times. If the graph doesn't have an internal concept of ordering, I'm not sure we can rely on this.

  2. If you save and load the graph, the overlay will run again and create a possibly different vectorization. This causes trouble when using e.g. vg pack on a HashGraph. The pack file it produces uses the vectorization of the graph as coordinates, but next time you load the HashGraph you might not necessarily populate the hash table in the same way, and you might get a different iteration order, leading to a different vectorization.

The overlay needs to be made to always produce the same vectorization for the same graph contents (i.e. sort everything by node ID), or we need to bring the vectorization along with things that depend on it like pack files. Or, we need to enforce that graphs at least have a stable iteration order across serialization, even if they aren't themselves vectorizable.

@jeizenga Does this make sense to you?

HashGraph allows duplicate edges when it shouldn't.

create_edge is supposed to "ignore" existing edges, which to me means that if you add the same edge twice it exists only once.

https://github.com/vgteam/libhandlegraph/blob/e2c45aae9a19a14176e3dc412b6ebc038385a31b/src/include/handlegraph/mutable_handle_graph.hpp#L30-L31

But HashGraph seems to accept any number of duplicates of an edge. There's no pre-existence check:

void HashGraph::create_edge(const handle_t& left, const handle_t& right) {
if (get_is_reverse(left)) {
graph[get_internal_id(left)].left_edges.push_back(right);
}

HashGraph should deduplicate edges.

Deleting steps while looping through the steps on a node

I want to loop over the steps on a node with HashGraph::for_each_step_on_handle_impl and delete their corresponding paths with HashGraph::destroy_path. However, if I do this, the reordering of the steps on the node that the delete does will not be picked up on by the iteration, and I won't necessarily visit all the steps on the node.

Either we need to make this work, or we need to specify that delete operations are not permitted during iterations over things they might change.

Question on extracting non-reference nodes

Hello,
some time ago I've opened a issue on the vg page asking suggestions on how to extract non-reference nodes (see here).
I've been working on a script, and I'd just like to confirm that I'm not doing anything wrong.
Given a genome genome1, I extract the node ids into the paths for that genome:

    print("Getting reference paths")
    reference_path_handle = [ i for i in path_handle if args.reference_genome in pg.get_path_name(i) ]
    ## Get nodes in reference
    print("Getting reference node ids")
    ref_nodes_ids = id_of_nodes_in_path(pg, reference_path_handle)

Where id_of_nodes_in_path is:

# Get node ids 
def id_of_nodes_in_path(pg, my_path_handle):
    nodes_ids = []
    for path in my_path_handle:
        pg.for_each_step_in_path( path, 
            lambda y: nodes_ids.append(pg.get_id(pg.get_handle_of_step(y))) or True)
    return set(nodes_ids)

I then process the nodes in the remaining nodes and check whether they are into the ref_nodes_ids list:

    print("Getting query paths")
    query_path_handle = [ i for i in path_handle if args.reference_genome not in pg.get_path_name(i) ]

    # Get specific nodes:
    print("Getting query-specific nodes")
    intervals = non_ref_nodes_and_pos_path(pg, query_path_handle, ref_nodes_ids)

Where non_ref_nodes_and_pos_path is:

def non_ref_nodes_and_pos_path(pg, my_path_handles, my_ref_nodes):
    intervals = []
    for path_handle in my_path_handles:
        path_id = pg.get_path_name(path_handle)
        bpi = 0 
        bpe = 0
        strand = ''
        handles = []
        # extract handles for each step into a path
        pg.for_each_step_in_path( path_handle, 
            lambda y: handles.append(pg.get_handle_of_step(y)) or True)
        # get sequence, node id and strand for each handle in path 
        for handle in handles:
            seq = pg.get_sequence(handle)        
            node_id = pg.get_id(handle)
            strand = "-" if pg.get_is_reverse(handle) else "+"
            bpe = bpi + len(seq)
            # Check if a node ID is into the reference node list, if not add to intervals
            if node_id not in my_ref_nodes:
                intervals.append([path_id, bpi, bpe, node_id, strand, seq])
            bpi = bpe # Update initial position
    return intervals

Does this make sense?

Thanks for your help

Andrea

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.