vgteam / libbdsg Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 6.0 14.75 MB

Optimized sequence graph implementations for graph genomics

License: MIT License

C++ 97.19% Makefile 0.25% CMake 0.73% Python 1.00% TeX 0.84%

libbdsg's People

Contributors

Stargazers

Watchers

Forkers

jeizenga ctcisar chfi meredith705 rlorigro

libbdsg's Issues

Needs to be able to read GFA by itself

We need a GFA-to-HandleGraph parser available in the Python library, so people can read GFA and not need vg itself as a pre-importer.

We can still reject overlaps.

divide_handle broken on rev. strand for PackedGraph and HashGraph

Here's an example:

handle_t handle = g->create_handle("TTATATTCCAACTCTCTG");                                                   
// Should get (C,AGAG,AGTTG,GAATATAA)                                                                       
auto parts = g->divide_handle(g->flip(handle), {1, 5, 10});

vg and odgi give the expected division

Part 0 5:1 seq=C
Part 1 4:1 seq=AGAG
Part 2 3:1 seq=AGTTG
Part 3 2:1 seq=GAATATAA

HashGraph gives something else:

Part 0 4:1 seq=CAGAGAGTTG
Part 1 3:1 seq=CAGAG
Part 2 2:1 seq=C
Part 3 1:1 seq=AGAGAGTTGGAATATAA

And PackedGraph enters an infinite loop before producing any output

Who are the copyright holders of this code?

A mention in the license, README, and source files would be great!

Bonus points: https://reuse.software/spec/

Thanks!

Installing the Python module on Mac claims to succeed at building the wheel, but ships a module file that can't find its libraries

We should do something in setup.py to selectively fail the wheel build on Mac, so it falls back on non-wheel cmake-install-directly-to-the-final-path installation, which might work.

Or, we should fix the Mac build so the collection of dylibs we are left with after a CMake build can find each other and are relocatable.

PackedGraph fails to destroy handle after dividing

attn @jeizenga

See the breaking unit test added in vgteam/vg#3172

Speed up path position overlay

The major bottleneck to replacing xg+gbwt with gbz for some workflows is the path position interface. XG provides a fast native version, but gbz would fall back on the overlay. And the path position overlay, which works fine once built, takes too long to create for whole-genome graphs.

So what's needed is:

speed up path position overlay construction by parallelizing on paths (as @jeizenga suggests)

@adamnovak seems interested in implementing this ..

Resizing factor bug in PackedGraph

I made a mistake setting the resizing factor of the hash tables in PackedGraph, which it handled too gracefully for me to notice. I want to test out the fix's effects before I decide that it's better than just not fixing it.

Docs/Python bindings broke because of SnarlDistanceIndex

The ReadTheDocs builds have been failing for a while with:

ImportError: overloading a method with both static and instance methods is not supported; error while attempting to bind instance method SnarlDistanceIndex.get_net_handle(self: bdsg.bdsg.SnarlDistanceIndex, pointer: int, connectivity: bdsg.bdsg.SnarlDistanceIndex.connectivity_t) -> bdsg.handlegraph.net_handle_t

I think the code that makes the Python bindings can't handle having both static and non-static methods with the same name, which we do here:

libbdsg/bdsg/include/bdsg/snarl_distance_index.hpp

Lines 689 to 708 in 5391889

 const static handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity, net_handle_record_t type, size_t node_offset=0) { 

 if (pointer > ((size_t)1 << (64-BITS_FOR_TRIVIAL_NODE_OFFSET-3-4))-1) { 

 throw runtime_error("error: don't have space in net handle for record offset"); 

 } 

 net_handle_t handle = as_net_handle((((((pointer << BITS_FOR_TRIVIAL_NODE_OFFSET) | node_offset) << 4) | connectivity)<<3) | type); 

 return handle; 

 } 

 handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity) const { 

 net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type(); 

 size_t node_record_offset = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL || 

 SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL ? 1 : 0; 

 return get_net_handle(pointer, connectivity, type, node_record_offset); 

 } 

 handlegraph::net_handle_t get_net_handle(size_t pointer) const { 

 net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type(); 

 return get_net_handle(pointer, START_END, type); 

 }

@xchang1 Can you rename the static version of the method and see if the ReadTheDocs builds start working again? If you make a ReadTheDocs account I can give you access to the project to get the logs.

Name Change

Given the many other projects using the "sglib" name, we're considering renaming this one.

Also, we have the same problem as HTSlib where we end up with libsglib.a and it looks silly.

Names under consideration so far include:

odgi: Optimized Dynamic Graph Implementations. @ekg already has the actual odgi; we would just glom the graphs in here together with it and have one repo. The main obstacle here is that we'd have to glue the projects together, and I'm not sure if Erik wants to keep odgi separate, especially since (AFAIK) it includes that whole visualizer that wouldn't make as much sense in a graph implementation library.

Let's get Travis running the unit tests

I was never able to run the unit tests of libbdsg as I don't have a setup to build them. For #20, I just pasted everything inside vg, worked and tested from there, then moved it back out for the PR. Anyway, it'd be nice if Travis somehow managed to build and test PRs.

Deprecate libbdsg odgi

The odgi project's "odgi" graph is now substantially different than the libbdsg version, and they've never quite been serialized-data compatible, and the odgi tool wants to use GFA for interchange anyway. Also, libbdsg's odgi has outstanding issues (#19) and isn't really maintained, because all the real odgi graph data structure development happens in the odgi project.

We should deprecate libbdsg's odgi graph implementation.

Incrementing ids leaves HashGraph is sorry state

If I do

hashgraph->increment_node_ids(k);
copy_graph(hashgraph, target);

I get a segfault (same code works with packed graph).

But if I do

hashgraph->increment_node_ids(k);
hasgraph->reassign_node_ids([](nid_t node_id) { return node_id; });
copy_graph(hashgraph, target);

it's okay. To reproduce (in vg), just comment out the reassign_node_ids() call in vg concat here: https://github.com/vgteam/vg/pull/2692/files#diff-4178eec0eda65c2fd8385e1bd8f03d57R97-R98
and run the test

cd test ;  prove -v t/09_vg_concat.t

Speed up finding reference paths on a handle: Make PathPositionOverlay also implement its own for_each_step_on_handle_impl?

For surjection, we work with path position overlays for a restricted subset of paths in the backing graph. But searching for all the steps on a handle can involve thinking about all the paths actually in the backing graph, and checking each of them. We would like to be able to just get the reference path steps on the handle efficiently.

One way to do this might be to re-implement for_each_step_on_handle_impl based on the information in the overlay that knows about the subsetting of the paths.

libbdsg/bdsg/src/path_position_overlays.cpp

Lines 149 to 152 in 06ab14b

 bool PositionOverlay::for_each_step_on_handle_impl(const handle_t& handle, 

 const function<bool(const step_handle_t&)>& iteratee) const { 

 return get_graph()->for_each_step_on_handle(handle, iteratee); 

 }

Deconstruction is still oddly slow with PackedReferencePathOverlay

@glennhickey says that, with the files in /public/home/hickey/dev/work/chicken-pangenome/debug on Courtyard, you can do a deconstruct run in about 32 minutes on 64 threads, including snarl build, when using the xg. But when using just the GBZ, when I od the equivalent of

deconstruct ./chicken-pg-sep23-new.gbz -t 64 -r ./chicken-pg-sep23-new.snarls

Then it crashes complaining that OpenMP can't make enough threads. When I cut the thread count to 16, it takes probably more than 2 hours.

The overlay build seems relatively fast; something important is probably still slow here and needs to be sped up. But first we need to collect actually comparable runtimes and probably profiles from two matched conditions with a consistent number of threads and a consistent choice of generating vs. loading snarls.

When running on (M1?) Mac, tests fail due to trying to refer across chains

After updating the submodules to versions that support M1 Mac and its Clang, when you build and run the tests, test_paged_vector<PagedVector<5, MappedBackend>>(); fails with:

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Attempted to refer across chains!

It looks like the problem is in the assignment operator for the yomo::Pointer<>, and it's trying to point from normal memory into a yomo memory chain. I tried hacking it to allow yomo::Pointer<> to work from non-yomo memory, but that led to a different failure further along in the test, so I think that really we have a mistake somewhere which GCC/Linux doesn't notice.

BUILD_STATIC flag not working

Specifying BUILD_STATIC=ON doesn't actually create static libs when using CMake:

ExternalProject_Add(project_bdsg
        GIT_REPOSITORY https://github.com/vgteam/libbdsg.git
        PREFIX ${CMAKE_SOURCE_DIR}/external/bdsg/
        CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${CMAKE_SOURCE_DIR}/external/bdsg/ -DBUILD_STATIC=ON -DRUN_DOXYGEN=OFF -DBUILD_PYTHON_BINDINGS=OFF
        BUILD_IN_SOURCE True
        INSTALL_COMMAND make install
        )

results in:

[100%] Linking CXX executable bin/test_libbdsg
/usr/bin/ld: attempted static link of dynamic object `lib/libbdsg.so'
collect2: error: ld returned 1 exit status

and a look at the lib dir shows that nothing has changed:

libbdsg.so
libdivsufsort64.so -> libdivsufsort64.so.3
libdivsufsort64.so.3 -> libdivsufsort64.so.3.0.1
libdivsufsort64.so.3.0.1
libdivsufsort.so -> libdivsufsort.so.3
libdivsufsort.so.3 -> libdivsufsort.so.3.0.1
libdivsufsort.so.3.0.1
libgtest_main.so
libgtest.so
libhandlegraph.a
libhandlegraph.so
libsdsl.so -> libsdsl.so.3
libsdsl.so.2.1.0
libsdsl.so.2.3.0
libsdsl.so.3 -> libsdsl.so.2.3.0
pkgconfig

Python bindings don't bind any methods that use std::vector

It looks like @ctcisar turned off binding for std::vector. That makes the bindings build, but Handle Graph API methods that use std::vector (like MutablePathHandleGraph::rewrite_segment) don't get any bindings generated for them.

De-deactivating std::vector in config.cfg triggers what I think is a Binder bug, or at least an incompatibility with something in our project: RosettaCommons/binder#100

We need to fix this, or write other API methods that let you do the same things without std::vector somehow.

We should be able to install via `pip` without checking the `python3` on the PATH

Jonas managed to produce this:

    running build_ext
    cmake /private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg -DRUN_DOXYGEN=OFF -DPYTHON_EXECUTABLE=/usr/local/opt/python/bin/python3.7 -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg/build/lib.macosx-10.14-x86_64-3.7 -DCMAKE_BUILD_TYPE=Release
...
    CMake Error at CMakeLists.txt:182 (message):
      Python version mismatch: CMake wants to build for Python 3.7.6 at
      /usr/local/opt/python/bin/python3.7 but `python3` is Python 3.5.3 at
      /usr/local/bin/python3.  You will not be able to import the module in the
      current Python! To use the version CMake selected, run the build in a
      virtualenv with that Python version activated.  To use the version on your
      PATH, restart the build with -DPYTHON_EXECUTABLE=/usr/local/bin/python3 on
      the command line.

If -DPYTHON_EXECUTABLE is explicitly set, or maybe if some new CMake option is set, we should let setup.py bypass that check, since under setup.py we are pretty sure we have the right python.

missing Python.h on system where it is present

I'm unable to build this using cmake.

/home/erik/libbdsg/bin/_deps/pybind11-src/include/pybind11/detail/common.h:112:10: fatal error: Python.h: No such file or directory

However, I am able to build odgi, which includes pybind11 bindings and depends on Python.h.

What are the dependencies for libbdsg?

Add magic numbers to the serialize/deserialize methods

It would be handy to be able to identify the files' types. Serialize would prefix each file with some constant bytes for each graph type, and deserialize would skip over them.

We could also maybe add a method to SerializableHandleGraph that would return the magic number that the class uses, to let larger programs sniff types auto-magically.

VG's IO system now has support for ID-ing non-VPKG files by magic number, so this would help compatibility with vg.

Overlays don't expose path metadata methods

I added a bunch of path metadata methods in #153 but I never added the stubs to call through to the backing graph in the overlays. So I think right now when you call these methods on e.g. PackedPathPositionOverlay you end up with the default implementations from PathMetadata and you won't use any actually-efficient implementations from the backing graph.

Python bindings install with pip raises legacy-install-failure

Hello, and firstly thanks for this nice software.

Installation of the lib was flawless, however I'm currently having a hard time installing python bindings with pip. It throws legacy-install-failure each time.

I checked dependencies and installed them, tried the pip/setuptools/wheel update (as it's frequently mentioned when encountering this kind of issues) ; nonetheless, it didn't helped a lot. I also tried different versions of python (3.7, 3.10 and 3.11).

      -- Found PythonInterp: /scratch/.env/bin/python3.10 (found version "3.10.8")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Looking for C++ include cstdio
      -- Looking for C++ include cstdio - found
      -- Found Doxygen: /scratch/.env/bin/doxygen (found version "1.9.5") found components: doxygen missing components: dot
      libhandlegraph is add_subdirectory project
      -- Found PythonLibs: /scratch/.env/lib/libpython3.10.so
      -- pybind11 v2.4.dev4
      -- Performing Test HAS_FLTO
      -- Performing Test HAS_FLTO - Success
      -- LTO enabled
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/build/temp.linux-x86_64-cpython-310
      cmake --build . --config Release -- -j2
      [  0%] Building CXX object bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o
      [  0%] Creating directories for 'binder'
      [  0%] Performing download step (git clone) for 'binder'
      Cloning into 'binder'...
      In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:8:0,
                       from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/types.hpp:25:15: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
       typedef nid_t id_t;
                     ^
      In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:0:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp: In member function ‘bool handlegraph::HandleGraph::for_each_edge(const Iteratee&, bool) const’:
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:65: error: expected primary-expression before ‘)’ token
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                       ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:68: error: expected primary-expression before ‘]’ token
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                          ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:70: error: expected primary-expression before ‘const’
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                            ^
      /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:97: error: expected unqualified-id before ‘bool’
           return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
                                                                                                       ^
      gmake[2]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o] Error 1
      gmake[1]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/all] Error 2
      gmake[1]: *** Waiting for unfinished jobs....
      HEAD is now at 788ab42... Merge pull request #99 from adamnovak/patch-1
      [  0%] Performing update step for 'binder'
      [  0%] No patch step for 'binder'
      [  0%] No configure step for 'binder'
      [  0%] No build step for 'binder'
      [  5%] Performing install step for 'binder'
      [  5%] Completed 'binder'
      [  5%] Built target binder
      gmake: *** [all] Error 2
      error: command '/scratch/.env/bin/cmake' failed with exit code 2
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> bdsg

Thanks by advance!
Best regards.

Needs to be pip installable

Benedict tried to install libbdsg to do a graph manipulation task and it was too hard.

Overlays don't work well with haplotype paths

Right now, overlays like PackedPathPositionOverlay will index all the paths that for_each_path_handle() returns.

For backward compatibility, I have for_each_path_handle() omitting haplotype paths, at least for graphs like GBWTGraph where there are thousands of them, and you need to use the more advanced PathMetadata search methods to enumerate haplotype paths.

But this means that you can put a PackedPathPositionOverlay over a GBWTGraph, get a handle to a haplotype path by name, but then ask about positions on it when it hasn't actually been indexed during construction of the overlay. This in turn is going to break e.g. vg inject into a haplotype path.

The overlays need to be modified to handle haplotype paths. Either they need to be enumerated and processed (which is probably too slow), or they need to be detected and excluded (which is kind of useless because people probably want to be able to use them) or they need to be lazily indexed by the overlays (which might be inefficient but at least avoids indexing thousands of samples to inject into one).

How to extract most likely position of series of nodes

Hello,
I've imported a pg graph in python using the bdsg module. I'm processing a series of alignments to the graph itself.
For practical reasons I'm processing them as gaf alignments generated with vg convert -G.
For each read, I've got a path represented in the >47102051>47102052>47102053 format. For these I can extract all the possible positions for each node on every path. However, this is quite impractical when it comes to defining the most likely contiguous set of intervals. Is there a way to extract this type of information based on this information?
For example, if node 47102051 can come from "chr1:0-10" and "chr1:50-70", and node 47102052 comes from chr1:11-24, then the interval succession is likely to be: chr1:0-10 > chr1:11-24. Not sure if I'm explaining my problems clearly, but I hope it makes sense.

Thank you in advance,
Andrea

Add low level methods for handle graphs

I've been using paths quite a lot in bdsg and found that a couple methods would simplify things:

pop_path_front/back
rename_path

Based on the existing methods prepend_step and append_step, pop should be equivalently complex for each implementation. As it is now I am using this to "pop" elements:

        auto begin = graph.path_begin(p);
        auto next = graph.get_next_step(begin);
        graph.rewrite_segment(begin, next, vector<handle_t>());

And for the path renaming, I copy the entire path and delete the old one, so it would be cool if there was a copy-free method.

Looks for DYNAMIC in the wrong(?) place

The build system looks for DYANMIC as "dynamic.hpp" and not <dynamic/dynamic.hpp> where the vg build system installs it. The Makefile also ignores any CXXFLAGS trying to tell it to use -I options (or to use a different standard library, or anything else that might be needed).

Overlay helper get() ought to be const

libbdsg/include/bdsg/overlays/overlay_helper.hpp

Line 56 in 866af3e

T* get() {

If I have a class that contains an overlay helper member, I ought to be able to call get() on the overlay helper from a const method. It might be a different, const overload of get() that returns a const pointer.

The ODGI graph does not pass the handle graph tests ported over from VG

I haven't looked into this deeply yet, but we should figure out whether this is the fault of the tests or the implementation, or perhaps a disagreement over the generality of our formalism.

Currently, I've commented out ODGI in these tests so that they would still run for the other graphs.

Graph implementations in libbdsg use copy constructors instead of move during vector reallocations

The problem seems to arise from not using noexcept in the constructors:

https://stackoverflow.com/questions/8001823/how-to-enforce-move-semantics-when-a-vector-grows

We may want to make this a requirement for the handle graph interface as well.

OverlayHelper ought to be apply()-able on const graphs

If I have a const HandleGraph*, I ought to be able to put an overlay over it.

Maybe this means we need separate ConstOverlayHelpers?

divide_handle() doesn't properly update paths in PackedGraph

attn: @jeizenga

See the breaking unit test in vgteam/vg#3172

Example of using bdsg.bdsg.PositionOverlay

Hello,
sorry for opening many issues. I'm trying to work out the problem explained here. In that topic, I've been suggested to use bdsg.bdsg.PositionOverlay() to perform efficient lookup. I'm struggling to understand how to use it though, would it be possible to provide some detailing/example of those?
Thank you in advance
Andrea

reinterpret_cast should maybe be static_cast

Mac Clang is complaining we don't know how to cast right:


src/path_position_overlays.cpp:431:16: warning: 'reinterpret_cast' to class 'handlegraph::MutablePathDeletableHandleGraph *' from its virtual base 'handlegraph::PathHandleGraph *' behaves differently from 'static_cast' [-Wreinterpret-base-class]
        return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/path_position_overlays.cpp:431:16: note: use 'static_cast' to adjust the pointer correctly while downcasting
        return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
               ^~~~~~~~~~~~~~~~
               static_cast

We could get into trouble with pointer offsets depending on how any given compiler is doing virtual base classes.

PackedGraph::get_edge_count() doesn't work

In particular

template<typename Backend>
size_t BasePackedGraph<Backend>::get_edge_count() const {
    // each edge (except reversing self edges) are stored twice in the edge vector
    return (edge_lists_iv.size() / EDGE_RECORD_SIZE + reversing_self_edge_records - deleted_edge_records) / 2;
}

doesn't necessarily return the same answer as the default (and correct) implementation

size_t HandleGraph::get_edge_count() const {
    size_t total = 0;
    for_each_edge([&](const edge_t& ignored) {
        total++;
    });
    return total;
};

Here's an example:

wget -q http://public.gi.ucsc.edu/~hickey/debug/chr12.clip.vg
vg stats -E chr12.clip.vg
1116343
vg convert -a chr12.clip.vg | vg stats -E -
1116345

I noticed this because vg clip uses get_edge_count() to make an array, which leads to an exception when the array is too small on this graph

vg clip -d 1 -P galGal6 chr12.clip.vg > chr12.clean.vg


#6    Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e3281, in vg::clip_low_depth_nodes_and_edges(handlegraph::MutablePathMutableHandleGraph*, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
      Source "src/clip.cpp", line 641, in clip_low_depth_nodes_and_edges [0x5570c70e3281]
#5    Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e2532, in vg::clip_low_depth_nodes_and_edges_generic(handlegraph::MutablePathMutableHandleGraph*, std::function<void (std::function<void (handlegraph::handle_t, vg::Region const*)>)>, std::function<void (std::function<void (std::pair<handlegraph::handle_t, handlegraph::handle_t>, vg::Region const*)>)>, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
      Source "src/clip.cpp", line 538, in clip_low_depth_nodes_and_edges_generic [0x5570c70e2532]

This is has likely been causing a few people crashes with minigraph-cactus, ex ComparativeGenomicsToolkit/cactus#1001, but this is the first time I've been able to reproduce myself.

attn @jeizenga

Too hard to use as an ExternalProject

@rlorigro tried to use libbdsg as a CMake ExternalProject.

It didn't really work; the ExternalProject dependencies that libbdsg pulls in weren't being installed alongside libbdsg itself when Ryan's CMake project installed libbdsg, and headers and libraries were missing.

The way he actually made libbdsg work as an ExternalProject dependency was to depend on libdsg_easy and make IMPORTED libraries for all the transitive dependency libraries:

https://github.com/rlorigro/GetBlunted/blob/aae419d3c2c338d7722cd4de585e80bfad479f9b/CMakeLists.txt#L145-L198

We need to devise a working tutorial for depending on libbdsg via CMake ExternalProject, ideally not using libbdsg_easy, and add it to the docs.

Question on extracted path with bdsg

Hello,
I'm trying to write a script that extract the genome-specific sequences using bdsg in python as suggested here https://github.com/vgteam/vg/issues/3153. I've managed to load a pg graph, extract the paths in there and separate the paths from the reference genome from the others:

pg = bdsg.bdsg.PackedGraph()
pg.deserialize( 'ingenome.pg' )
node_count = pg.get_node_count()
edge_count = pg.get_edge_count()
path_count = pg.get_path_count()

print('Found:')
print( ' - {} nodes\n - {} edges\n - {} paths'.format( node_count, edge_count, path_count ) )

# Get paths
path_handle = []
pg.for_each_path_handle(lambda y: path_handle.append(y) or True)
path_names = [pg.get_path_name(i) for i in path_handle ]

# Get target paths
target_paths = [ i for i in path_names if args.target_genome in i ]
target_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]

# Get query paths
query_paths = [ i for i in path_names if args.target_genome in i ]
query_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]

I've also seen that I can extract the nodes in a path of interest using the approach:

handles = []
pg.for_each_step_in_path(query_path_handle[0], lambda y: handles.append(pg.get_handle_of_step(y)) or True)

And their ids with:

nodes_ids = [pg.get_id(i) for i in handles]

My question is pretty straightforward: am I right that the list of nodes generated by for_each_step_in_path() is ordered by position in the path? So that the first element in the list is the first node and therefore the initial position is 0, whereas the second node will be 0 plus the sum of the preceeding nodes' length.

Thanks in advance
Andrea

Building a big `bdsg::yomo::UniqueMappedPointer<bdsg::MappedIntVector>` and then saving it causes total chain size reported by `get_usage()` to increase

@xchang1 says that saving to a file bu filename/FD after building a big MappedIntVector in memory causes the object to be larger on disk and after load then it was before it was saved.

Docs don't seem to properly list C++ methods anymore

The libbdsg ReadTheDocs docs, separate from the vg Doxygen docs, have a Doxygen-derived listing of C++ methods for the handle graph API, at https://bdsg.readthedocs.io/en/master/rst/cppapi.html

Even in the last successful docs build, this seems to have been broken. The Doxygen output didn't make it to Sphinx properly to build the web page.

We should fix it so the stuff included over from Doxygen shows up on the page again.

Vectoizable Overlays assume that graph iteration order is consistent and thus break vg pack

The vectorizing overlays loop through all the nodes and edges in the graph, and treat that as the vectorized order.

The problem is that:

The overlay calls for_each_handle on the underlying graph twice, and assumes that the iteration order will be the same both times. If the graph doesn't have an internal concept of ordering, I'm not sure we can rely on this.
If you save and load the graph, the overlay will run again and create a possibly different vectorization. This causes trouble when using e.g. vg pack on a HashGraph. The pack file it produces uses the vectorization of the graph as coordinates, but next time you load the HashGraph you might not necessarily populate the hash table in the same way, and you might get a different iteration order, leading to a different vectorization.

The overlay needs to be made to always produce the same vectorization for the same graph contents (i.e. sort everything by node ID), or we need to bring the vectorization along with things that depend on it like pack files. Or, we need to enforce that graphs at least have a stable iteration order across serialization, even if they aren't themselves vectorizable.

@jeizenga Does this make sense to you?

HashGraph allows duplicate edges when it shouldn't.

create_edge is supposed to "ignore" existing edges, which to me means that if you add the same edge twice it exists only once.

https://github.com/vgteam/libhandlegraph/blob/e2c45aae9a19a14176e3dc412b6ebc038385a31b/src/include/handlegraph/mutable_handle_graph.hpp#L30-L31

But HashGraph seems to accept any number of duplicates of an edge. There's no pre-existence check:

libbdsg/src/hash_graph.cpp

Lines 37 to 41 in 7b913da

 void HashGraph::create_edge(const handle_t& left, const handle_t& right) { 

 if (get_is_reverse(left)) { 

 graph[get_internal_id(left)].left_edges.push_back(right); 

 }

HashGraph should deduplicate edges.

python documentation largely incomplete

A python API documentation (on readthedoc.io) is linked to, but it's basically empty.

We might want to remove this until it's completed.

Deleting steps while looping through the steps on a node

I want to loop over the steps on a node with HashGraph::for_each_step_on_handle_impl and delete their corresponding paths with HashGraph::destroy_path. However, if I do this, the reordering of the steps on the node that the delete does will not be picked up on by the iteration, and I won't necessarily visit all the steps on the node.

Either we need to make this work, or we need to specify that delete operations are not permitted during iterations over things they might change.

Question on extracting non-reference nodes

Hello,
some time ago I've opened a issue on the vg page asking suggestions on how to extract non-reference nodes (see here).
I've been working on a script, and I'd just like to confirm that I'm not doing anything wrong.
Given a genome genome1, I extract the node ids into the paths for that genome:

    print("Getting reference paths")
    reference_path_handle = [ i for i in path_handle if args.reference_genome in pg.get_path_name(i) ]
    ## Get nodes in reference
    print("Getting reference node ids")
    ref_nodes_ids = id_of_nodes_in_path(pg, reference_path_handle)

Where id_of_nodes_in_path is:

# Get node ids 
def id_of_nodes_in_path(pg, my_path_handle):
    nodes_ids = []
    for path in my_path_handle:
        pg.for_each_step_in_path( path, 
            lambda y: nodes_ids.append(pg.get_id(pg.get_handle_of_step(y))) or True)
    return set(nodes_ids)

I then process the nodes in the remaining nodes and check whether they are into the ref_nodes_ids list:

    print("Getting query paths")
    query_path_handle = [ i for i in path_handle if args.reference_genome not in pg.get_path_name(i) ]

    # Get specific nodes:
    print("Getting query-specific nodes")
    intervals = non_ref_nodes_and_pos_path(pg, query_path_handle, ref_nodes_ids)

Where non_ref_nodes_and_pos_path is:

def non_ref_nodes_and_pos_path(pg, my_path_handles, my_ref_nodes):
    intervals = []
    for path_handle in my_path_handles:
        path_id = pg.get_path_name(path_handle)
        bpi = 0 
        bpe = 0
        strand = ''
        handles = []
        # extract handles for each step into a path
        pg.for_each_step_in_path( path_handle, 
            lambda y: handles.append(pg.get_handle_of_step(y)) or True)
        # get sequence, node id and strand for each handle in path 
        for handle in handles:
            seq = pg.get_sequence(handle)        
            node_id = pg.get_id(handle)
            strand = "-" if pg.get_is_reverse(handle) else "+"
            bpe = bpi + len(seq)
            # Check if a node ID is into the reference node list, if not add to intervals
            if node_id not in my_ref_nodes:
                intervals.append([path_id, bpi, bpe, node_id, strand, seq])
            bpi = bpe # Update initial position
    return intervals

Does this make sense?

Thanks for your help

Andrea

	const static handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity, net_handle_record_t type, size_t node_offset=0) {
	if (pointer > ((size_t)1 << (64-BITS_FOR_TRIVIAL_NODE_OFFSET-3-4))-1) {
	throw runtime_error("error: don't have space in net handle for record offset");
	}
	net_handle_t handle = as_net_handle((((((pointer << BITS_FOR_TRIVIAL_NODE_OFFSET) \| node_offset) << 4) \| connectivity)<<3) \| type);
	return handle;

	}
	handlegraph::net_handle_t get_net_handle(size_t pointer, connectivity_t connectivity) const {
	net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type();
	size_t node_record_offset = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL \|\|
	SnarlTreeRecord(pointer, &snarl_tree_records).get_record_type() == SIMPLE_SNARL ? 1 : 0;
	return get_net_handle(pointer, connectivity, type, node_record_offset);

	}
	handlegraph::net_handle_t get_net_handle(size_t pointer) const {
	net_handle_record_t type = SnarlTreeRecord(pointer, &snarl_tree_records).get_record_handle_type();
	return get_net_handle(pointer, START_END, type);

	}

	bool PositionOverlay::for_each_step_on_handle_impl(const handle_t& handle,
	const function<bool(const step_handle_t&)>& iteratee) const {
	return get_graph()->for_each_step_on_handle(handle, iteratee);
	}

	void HashGraph::create_edge(const handle_t& left, const handle_t& right) {

	if (get_is_reverse(left)) {
	graph[get_internal_id(left)].left_edges.push_back(right);
	}