vgteam / libbdsg Goto Github PK
View Code? Open in Web Editor NEWOptimized sequence graph implementations for graph genomics
License: MIT License
Optimized sequence graph implementations for graph genomics
License: MIT License
We need a GFA-to-HandleGraph parser available in the Python library, so people can read GFA and not need vg itself as a pre-importer.
We can still reject overlaps.
Here's an example:
handle_t handle = g->create_handle("TTATATTCCAACTCTCTG");
// Should get (C,AGAG,AGTTG,GAATATAA)
auto parts = g->divide_handle(g->flip(handle), {1, 5, 10});
vg and odgi give the expected division
Part 0 5:1 seq=C
Part 1 4:1 seq=AGAG
Part 2 3:1 seq=AGTTG
Part 3 2:1 seq=GAATATAA
HashGraph gives something else:
Part 0 4:1 seq=CAGAGAGTTG
Part 1 3:1 seq=CAGAG
Part 2 2:1 seq=C
Part 3 1:1 seq=AGAGAGTTGGAATATAA
And PackedGraph enters an infinite loop before producing any output
A mention in the license, README, and source files would be great!
Bonus points: https://reuse.software/spec/
Thanks!
We should do something in setup.py to selectively fail the wheel build on Mac, so it falls back on non-wheel cmake-install-directly-to-the-final-path installation, which might work.
Or, we should fix the Mac build so the collection of dylibs we are left with after a CMake build can find each other and are relocatable.
attn @jeizenga
See the breaking unit test added in vgteam/vg#3172
The major bottleneck to replacing xg+gbwt with gbz for some workflows is the path position interface. XG provides a fast native version, but gbz would fall back on the overlay. And the path position overlay, which works fine once built, takes too long to create for whole-genome graphs.
So what's needed is:
@adamnovak seems interested in implementing this ..
I made a mistake setting the resizing factor of the hash tables in PackedGraph, which it handled too gracefully for me to notice. I want to test out the fix's effects before I decide that it's better than just not fixing it.
The ReadTheDocs builds have been failing for a while with:
ImportError: overloading a method with both static and instance methods is not supported; error while attempting to bind instance method SnarlDistanceIndex.get_net_handle(self: bdsg.bdsg.SnarlDistanceIndex, pointer: int, connectivity: bdsg.bdsg.SnarlDistanceIndex.connectivity_t) -> bdsg.handlegraph.net_handle_t
I think the code that makes the Python bindings can't handle having both static and non-static methods with the same name, which we do here:
libbdsg/bdsg/include/bdsg/snarl_distance_index.hpp
Lines 689 to 708 in 5391889
@xchang1 Can you rename the static version of the method and see if the ReadTheDocs builds start working again? If you make a ReadTheDocs account I can give you access to the project to get the logs.
Given the many other projects using the "sglib" name, we're considering renaming this one.
Also, we have the same problem as HTSlib where we end up with libsglib.a
and it looks silly.
Names under consideration so far include:
I was never able to run the unit tests of libbdsg as I don't have a setup to build them. For #20, I just pasted everything inside vg, worked and tested from there, then moved it back out for the PR. Anyway, it'd be nice if Travis somehow managed to build and test PRs.
The odgi project's "odgi" graph is now substantially different than the libbdsg version, and they've never quite been serialized-data compatible, and the odgi
tool wants to use GFA for interchange anyway. Also, libbdsg's odgi has outstanding issues (#19) and isn't really maintained, because all the real odgi graph data structure development happens in the odgi project.
We should deprecate libbdsg's odgi graph implementation.
If I do
hashgraph->increment_node_ids(k);
copy_graph(hashgraph, target);
I get a segfault (same code works with packed graph).
But if I do
hashgraph->increment_node_ids(k);
hasgraph->reassign_node_ids([](nid_t node_id) { return node_id; });
copy_graph(hashgraph, target);
it's okay. To reproduce (in vg), just comment out the reassign_node_ids()
call in vg concat
here: https://github.com/vgteam/vg/pull/2692/files#diff-4178eec0eda65c2fd8385e1bd8f03d57R97-R98
and run the test
cd test ; prove -v t/09_vg_concat.t
For surjection, we work with path position overlays for a restricted subset of paths in the backing graph. But searching for all the steps on a handle can involve thinking about all the paths actually in the backing graph, and checking each of them. We would like to be able to just get the reference path steps on the handle efficiently.
One way to do this might be to re-implement for_each_step_on_handle_impl
based on the information in the overlay that knows about the subsetting of the paths.
libbdsg/bdsg/src/path_position_overlays.cpp
Lines 149 to 152 in 06ab14b
@glennhickey says that, with the files in /public/home/hickey/dev/work/chicken-pangenome/debug
on Courtyard, you can do a deconstruct run in about 32 minutes on 64 threads, including snarl build, when using the xg. But when using just the GBZ, when I od the equivalent of
deconstruct ./chicken-pg-sep23-new.gbz -t 64 -r ./chicken-pg-sep23-new.snarls
Then it crashes complaining that OpenMP can't make enough threads. When I cut the thread count to 16, it takes probably more than 2 hours.
The overlay build seems relatively fast; something important is probably still slow here and needs to be sped up. But first we need to collect actually comparable runtimes and probably profiles from two matched conditions with a consistent number of threads and a consistent choice of generating vs. loading snarls.
After updating the submodules to versions that support M1 Mac and its Clang, when you build and run the tests, test_paged_vector<PagedVector<5, MappedBackend>>();
fails with:
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Attempted to refer across chains!
It looks like the problem is in the assignment operator for the yomo::Pointer<>
, and it's trying to point from normal memory into a yomo
memory chain. I tried hacking it to allow yomo::Pointer<>
to work from non-yomo
memory, but that led to a different failure further along in the test, so I think that really we have a mistake somewhere which GCC/Linux doesn't notice.
Specifying BUILD_STATIC=ON
doesn't actually create static libs when using CMake:
ExternalProject_Add(project_bdsg
GIT_REPOSITORY https://github.com/vgteam/libbdsg.git
PREFIX ${CMAKE_SOURCE_DIR}/external/bdsg/
CMAKE_ARGS -DCMAKE_INSTALL_PREFIX=${CMAKE_SOURCE_DIR}/external/bdsg/ -DBUILD_STATIC=ON -DRUN_DOXYGEN=OFF -DBUILD_PYTHON_BINDINGS=OFF
BUILD_IN_SOURCE True
INSTALL_COMMAND make install
)
results in:
[100%] Linking CXX executable bin/test_libbdsg
/usr/bin/ld: attempted static link of dynamic object `lib/libbdsg.so'
collect2: error: ld returned 1 exit status
and a look at the lib
dir shows that nothing has changed:
libbdsg.so
libdivsufsort64.so -> libdivsufsort64.so.3
libdivsufsort64.so.3 -> libdivsufsort64.so.3.0.1
libdivsufsort64.so.3.0.1
libdivsufsort.so -> libdivsufsort.so.3
libdivsufsort.so.3 -> libdivsufsort.so.3.0.1
libdivsufsort.so.3.0.1
libgtest_main.so
libgtest.so
libhandlegraph.a
libhandlegraph.so
libsdsl.so -> libsdsl.so.3
libsdsl.so.2.1.0
libsdsl.so.2.3.0
libsdsl.so.3 -> libsdsl.so.2.3.0
pkgconfig
It looks like @ctcisar turned off binding for std::vector
. That makes the bindings build, but Handle Graph API methods that use std::vector
(like MutablePathHandleGraph::rewrite_segment
) don't get any bindings generated for them.
De-deactivating std::vector
in config.cfg
triggers what I think is a Binder bug, or at least an incompatibility with something in our project: RosettaCommons/binder#100
We need to fix this, or write other API methods that let you do the same things without std::vector somehow.
Jonas managed to produce this:
running build_ext
cmake /private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg -DRUN_DOXYGEN=OFF -DPYTHON_EXECUTABLE=/usr/local/opt/python/bin/python3.7 -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/private/var/folders/zy/0_dym7ln2b9bc0ywr4by65jr0000gn/T/pip-install-e9mb__6t/bdsg/build/lib.macosx-10.14-x86_64-3.7 -DCMAKE_BUILD_TYPE=Release
...
CMake Error at CMakeLists.txt:182 (message):
Python version mismatch: CMake wants to build for Python 3.7.6 at
/usr/local/opt/python/bin/python3.7 but `python3` is Python 3.5.3 at
/usr/local/bin/python3. You will not be able to import the module in the
current Python! To use the version CMake selected, run the build in a
virtualenv with that Python version activated. To use the version on your
PATH, restart the build with -DPYTHON_EXECUTABLE=/usr/local/bin/python3 on
the command line.
If -DPYTHON_EXECUTABLE
is explicitly set, or maybe if some new CMake option is set, we should let setup.py
bypass that check, since under setup.py
we are pretty sure we have the right python.
I'm unable to build this using cmake.
/home/erik/libbdsg/bin/_deps/pybind11-src/include/pybind11/detail/common.h:112:10: fatal error: Python.h: No such file or directory
However, I am able to build odgi, which includes pybind11 bindings and depends on Python.h.
What are the dependencies for libbdsg?
It would be handy to be able to identify the files' types. Serialize would prefix each file with some constant bytes for each graph type, and deserialize would skip over them.
We could also maybe add a method to SerializableHandleGraph that would return the magic number that the class uses, to let larger programs sniff types auto-magically.
VG's IO system now has support for ID-ing non-VPKG files by magic number, so this would help compatibility with vg.
I added a bunch of path metadata methods in #153 but I never added the stubs to call through to the backing graph in the overlays. So I think right now when you call these methods on e.g. PackedPathPositionOverlay
you end up with the default implementations from PathMetadata
and you won't use any actually-efficient implementations from the backing graph.
Hello, and firstly thanks for this nice software.
Installation of the lib was flawless, however I'm currently having a hard time installing python bindings with pip. It throws legacy-install-failure
each time.
I checked dependencies and installed them, tried the pip/setuptools/wheel update (as it's frequently mentioned when encountering this kind of issues) ; nonetheless, it didn't helped a lot. I also tried different versions of python (3.7, 3.10 and 3.11).
-- Found PythonInterp: /scratch/.env/bin/python3.10 (found version "3.10.8")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Looking for C++ include cstdio
-- Looking for C++ include cstdio - found
-- Found Doxygen: /scratch/.env/bin/doxygen (found version "1.9.5") found components: doxygen missing components: dot
libhandlegraph is add_subdirectory project
-- Found PythonLibs: /scratch/.env/lib/libpython3.10.so
-- pybind11 v2.4.dev4
-- Performing Test HAS_FLTO
-- Performing Test HAS_FLTO - Success
-- LTO enabled
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/build/temp.linux-x86_64-cpython-310
cmake --build . --config Release -- -j2
[ 0%] Building CXX object bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o
[ 0%] Creating directories for 'binder'
[ 0%] Performing download step (git clone) for 'binder'
Cloning into 'binder'...
In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:8:0,
from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/types.hpp:25:15: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
typedef nid_t id_t;
^
In file included from /tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/handle.cpp:1:0:
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp: In member function ‘bool handlegraph::HandleGraph::for_each_edge(const Iteratee&, bool) const’:
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:65: error: expected primary-expression before ‘)’ token
return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
^
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:68: error: expected primary-expression before ‘]’ token
return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
^
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:70: error: expected primary-expression before ‘const’
return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
^
/tmp/pip-install-u5q46l2o/bdsg_ea729c3a17ea4a8fa6bdfc176c2e68bf/bdsg/deps/libhandlegraph/src/include/handlegraph/handle_graph.hpp:243:97: error: expected unqualified-id before ‘bool’
return for_each_handle((std::function<bool(const handle_t&)>)[&](const handle_t& handle) -> bool {
^
gmake[2]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/src/handle.cpp.o] Error 1
gmake[1]: *** [bdsg/deps/libhandlegraph/CMakeFiles/handlegraph_objs.dir/all] Error 2
gmake[1]: *** Waiting for unfinished jobs....
HEAD is now at 788ab42... Merge pull request #99 from adamnovak/patch-1
[ 0%] Performing update step for 'binder'
[ 0%] No patch step for 'binder'
[ 0%] No configure step for 'binder'
[ 0%] No build step for 'binder'
[ 5%] Performing install step for 'binder'
[ 5%] Completed 'binder'
[ 5%] Built target binder
gmake: *** [all] Error 2
error: command '/scratch/.env/bin/cmake' failed with exit code 2
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
× Encountered error while trying to install package.
╰─> bdsg
Thanks by advance!
Best regards.
Benedict tried to install libbdsg to do a graph manipulation task and it was too hard.
Right now, overlays like PackedPathPositionOverlay
will index all the paths that for_each_path_handle()
returns.
For backward compatibility, I have for_each_path_handle()
omitting haplotype paths, at least for graphs like GBWTGraph where there are thousands of them, and you need to use the more advanced PathMetadata
search methods to enumerate haplotype paths.
But this means that you can put a PackedPathPositionOverlay
over a GBWTGraph
, get a handle to a haplotype path by name, but then ask about positions on it when it hasn't actually been indexed during construction of the overlay. This in turn is going to break e.g. vg inject
into a haplotype path.
The overlays need to be modified to handle haplotype paths. Either they need to be enumerated and processed (which is probably too slow), or they need to be detected and excluded (which is kind of useless because people probably want to be able to use them) or they need to be lazily indexed by the overlays (which might be inefficient but at least avoids indexing thousands of samples to inject into one).
Hello,
I've imported a pg graph in python using the bdsg module. I'm processing a series of alignments to the graph itself.
For practical reasons I'm processing them as gaf alignments generated with vg convert -G
.
For each read, I've got a path represented in the >47102051>47102052>47102053
format. For these I can extract all the possible positions for each node on every path. However, this is quite impractical when it comes to defining the most likely contiguous set of intervals. Is there a way to extract this type of information based on this information?
For example, if node 47102051
can come from "chr1:0-10" and "chr1:50-70", and node 47102052
comes from chr1:11-24, then the interval succession is likely to be: chr1:0-10 > chr1:11-24. Not sure if I'm explaining my problems clearly, but I hope it makes sense.
Thank you in advance,
Andrea
I've been using paths quite a lot in bdsg and found that a couple methods would simplify things:
pop_path_front/back
rename_path
Based on the existing methods prepend_step
and append_step
, pop should be equivalently complex for each implementation. As it is now I am using this to "pop" elements:
auto begin = graph.path_begin(p);
auto next = graph.get_next_step(begin);
graph.rewrite_segment(begin, next, vector<handle_t>());
And for the path renaming, I copy the entire path and delete the old one, so it would be cool if there was a copy-free method.
The build system looks for DYANMIC as "dynamic.hpp"
and not <dynamic/dynamic.hpp>
where the vg build system installs it. The Makefile also ignores any CXXFLAGS trying to tell it to use -I
options (or to use a different standard library, or anything else that might be needed).
If I have a class that contains an overlay helper member, I ought to be able to call get()
on the overlay helper from a const
method. It might be a different, const
overload of get()
that returns a const
pointer.
I haven't looked into this deeply yet, but we should figure out whether this is the fault of the tests or the implementation, or perhaps a disagreement over the generality of our formalism.
Currently, I've commented out ODGI in these tests so that they would still run for the other graphs.
The problem seems to arise from not using noexcept
in the constructors:
https://stackoverflow.com/questions/8001823/how-to-enforce-move-semantics-when-a-vector-grows
We may want to make this a requirement for the handle graph interface as well.
If I have a const HandleGraph*
, I ought to be able to put an overlay over it.
Maybe this means we need separate ConstOverlayHelpers?
attn: @jeizenga
See the breaking unit test in vgteam/vg#3172
Hello,
sorry for opening many issues. I'm trying to work out the problem explained here. In that topic, I've been suggested to use bdsg.bdsg.PositionOverlay()
to perform efficient lookup. I'm struggling to understand how to use it though, would it be possible to provide some detailing/example of those?
Thank you in advance
Andrea
Mac Clang is complaining we don't know how to cast right:
src/path_position_overlays.cpp:431:16: warning: 'reinterpret_cast' to class 'handlegraph::MutablePathDeletableHandleGraph *' from its virtual base 'handlegraph::PathHandleGraph *' behaves differently from 'static_cast' [-Wreinterpret-base-class]
return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/path_position_overlays.cpp:431:16: note: use 'static_cast' to adjust the pointer correctly while downcasting
return reinterpret_cast<MutablePathDeletableHandleGraph*>(graph);
^~~~~~~~~~~~~~~~
static_cast
We could get into trouble with pointer offsets depending on how any given compiler is doing virtual base classes.
In particular
template<typename Backend>
size_t BasePackedGraph<Backend>::get_edge_count() const {
// each edge (except reversing self edges) are stored twice in the edge vector
return (edge_lists_iv.size() / EDGE_RECORD_SIZE + reversing_self_edge_records - deleted_edge_records) / 2;
}
doesn't necessarily return the same answer as the default (and correct) implementation
size_t HandleGraph::get_edge_count() const {
size_t total = 0;
for_each_edge([&](const edge_t& ignored) {
total++;
});
return total;
};
Here's an example:
wget -q http://public.gi.ucsc.edu/~hickey/debug/chr12.clip.vg
vg stats -E chr12.clip.vg
1116343
vg convert -a chr12.clip.vg | vg stats -E -
1116345
I noticed this because vg clip
uses get_edge_count()
to make an array, which leads to an exception when the array is too small on this graph
vg clip -d 1 -P galGal6 chr12.clip.vg > chr12.clean.vg
#6 Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e3281, in vg::clip_low_depth_nodes_and_edges(handlegraph::MutablePathMutableHandleGraph*, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
Source "src/clip.cpp", line 641, in clip_low_depth_nodes_and_edges [0x5570c70e3281]
#5 Object "/home/hickey/dev/vg/bin/vg", at 0x5570c70e2532, in vg::clip_low_depth_nodes_and_edges_generic(handlegraph::MutablePathMutableHandleGraph*, std::function<void (std::function<void (handlegraph::handle_t, vg::Region const*)>)>, std::function<void (std::function<void (std::pair<handlegraph::handle_t, handlegraph::handle_t>, vg::Region const*)>)>, long, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long, bool)
Source "src/clip.cpp", line 538, in clip_low_depth_nodes_and_edges_generic [0x5570c70e2532]
This is has likely been causing a few people crashes with minigraph-cactus
, ex ComparativeGenomicsToolkit/cactus#1001, but this is the first time I've been able to reproduce myself.
attn @jeizenga
@rlorigro tried to use libbdsg as a CMake ExternalProject.
It didn't really work; the ExternalProject dependencies that libbdsg pulls in weren't being installed alongside libbdsg itself when Ryan's CMake project installed libbdsg, and headers and libraries were missing.
The way he actually made libbdsg work as an ExternalProject dependency was to depend on libdsg_easy and make IMPORTED libraries for all the transitive dependency libraries:
We need to devise a working tutorial for depending on libbdsg via CMake ExternalProject, ideally not using libbdsg_easy, and add it to the docs.
Hello,
I'm trying to write a script that extract the genome-specific sequences using bdsg in python as suggested here https://github.com/vgteam/vg/issues/3153. I've managed to load a pg graph, extract the paths in there and separate the paths from the reference genome from the others:
pg = bdsg.bdsg.PackedGraph()
pg.deserialize( 'ingenome.pg' )
node_count = pg.get_node_count()
edge_count = pg.get_edge_count()
path_count = pg.get_path_count()
print('Found:')
print( ' - {} nodes\n - {} edges\n - {} paths'.format( node_count, edge_count, path_count ) )
# Get paths
path_handle = []
pg.for_each_path_handle(lambda y: path_handle.append(y) or True)
path_names = [pg.get_path_name(i) for i in path_handle ]
# Get target paths
target_paths = [ i for i in path_names if args.target_genome in i ]
target_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]
# Get query paths
query_paths = [ i for i in path_names if args.target_genome in i ]
query_path_handle = [ i for i in path_handle if args.target_genome in pg.get_path_name(i) ]
I've also seen that I can extract the nodes in a path of interest using the approach:
handles = []
pg.for_each_step_in_path(query_path_handle[0], lambda y: handles.append(pg.get_handle_of_step(y)) or True)
And their ids with:
nodes_ids = [pg.get_id(i) for i in handles]
My question is pretty straightforward: am I right that the list of nodes generated by for_each_step_in_path()
is ordered by position in the path? So that the first element in the list is the first node and therefore the initial position is 0, whereas the second node will be 0 plus the sum of the preceeding nodes' length.
Thanks in advance
Andrea
@xchang1 says that saving to a file bu filename/FD after building a big MappedIntVector in memory causes the object to be larger on disk and after load then it was before it was saved.
The libbdsg ReadTheDocs docs, separate from the vg Doxygen docs, have a Doxygen-derived listing of C++ methods for the handle graph API, at https://bdsg.readthedocs.io/en/master/rst/cppapi.html
Even in the last successful docs build, this seems to have been broken. The Doxygen output didn't make it to Sphinx properly to build the web page.
We should fix it so the stuff included over from Doxygen shows up on the page again.
The vectorizing overlays loop through all the nodes and edges in the graph, and treat that as the vectorized order.
The problem is that:
The overlay calls for_each_handle
on the underlying graph twice, and assumes that the iteration order will be the same both times. If the graph doesn't have an internal concept of ordering, I'm not sure we can rely on this.
If you save and load the graph, the overlay will run again and create a possibly different vectorization. This causes trouble when using e.g. vg pack
on a HashGraph
. The pack file it produces uses the vectorization of the graph as coordinates, but next time you load the HashGraph you might not necessarily populate the hash table in the same way, and you might get a different iteration order, leading to a different vectorization.
The overlay needs to be made to always produce the same vectorization for the same graph contents (i.e. sort everything by node ID), or we need to bring the vectorization along with things that depend on it like pack files. Or, we need to enforce that graphs at least have a stable iteration order across serialization, even if they aren't themselves vectorizable.
@jeizenga Does this make sense to you?
create_edge
is supposed to "ignore" existing edges, which to me means that if you add the same edge twice it exists only once.
But HashGraph seems to accept any number of duplicates of an edge. There's no pre-existence check:
Lines 37 to 41 in 7b913da
HashGraph should deduplicate edges.
A python API documentation (on readthedoc.io) is linked to, but it's basically empty.
We might want to remove this until it's completed.
I want to loop over the steps on a node with HashGraph::for_each_step_on_handle_impl
and delete their corresponding paths with HashGraph::destroy_path
. However, if I do this, the reordering of the steps on the node that the delete does will not be picked up on by the iteration, and I won't necessarily visit all the steps on the node.
Either we need to make this work, or we need to specify that delete operations are not permitted during iterations over things they might change.
Hello,
some time ago I've opened a issue on the vg page asking suggestions on how to extract non-reference nodes (see here).
I've been working on a script, and I'd just like to confirm that I'm not doing anything wrong.
Given a genome genome1
, I extract the node ids into the paths for that genome:
print("Getting reference paths")
reference_path_handle = [ i for i in path_handle if args.reference_genome in pg.get_path_name(i) ]
## Get nodes in reference
print("Getting reference node ids")
ref_nodes_ids = id_of_nodes_in_path(pg, reference_path_handle)
Where id_of_nodes_in_path
is:
# Get node ids
def id_of_nodes_in_path(pg, my_path_handle):
nodes_ids = []
for path in my_path_handle:
pg.for_each_step_in_path( path,
lambda y: nodes_ids.append(pg.get_id(pg.get_handle_of_step(y))) or True)
return set(nodes_ids)
I then process the nodes in the remaining nodes and check whether they are into the ref_nodes_ids
list:
print("Getting query paths")
query_path_handle = [ i for i in path_handle if args.reference_genome not in pg.get_path_name(i) ]
# Get specific nodes:
print("Getting query-specific nodes")
intervals = non_ref_nodes_and_pos_path(pg, query_path_handle, ref_nodes_ids)
Where non_ref_nodes_and_pos_path
is:
def non_ref_nodes_and_pos_path(pg, my_path_handles, my_ref_nodes):
intervals = []
for path_handle in my_path_handles:
path_id = pg.get_path_name(path_handle)
bpi = 0
bpe = 0
strand = ''
handles = []
# extract handles for each step into a path
pg.for_each_step_in_path( path_handle,
lambda y: handles.append(pg.get_handle_of_step(y)) or True)
# get sequence, node id and strand for each handle in path
for handle in handles:
seq = pg.get_sequence(handle)
node_id = pg.get_id(handle)
strand = "-" if pg.get_is_reverse(handle) else "+"
bpe = bpi + len(seq)
# Check if a node ID is into the reference node list, if not add to intervals
if node_id not in my_ref_nodes:
intervals.append([path_id, bpi, bpe, node_id, strand, seq])
bpi = bpe # Update initial position
return intervals
Does this make sense?
Thanks for your help
Andrea
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.