odenas / indexed_ms Goto Github PK

Fast matching statistics in small space

License: GNU General Public License v3.0

Makefile 2.08% C++ 57.35% Python 2.09% C 1.28% Jupyter Notebook 1.80% HTML 33.02% Shell 0.13% R 0.67% CMake 0.50% TeX 1.03% GDB 0.02% Batchfile 0.01% Gnuplot 0.01% Assembly 0.01% Perl 0.01%

algorithms bioinformatics bwt comparative-genomics data-structures matching-statistics maximal-repeat suffix-tree

indexed_ms's People

Contributors

Stargazers

Watchers

Forkers

dbelaz

indexed_ms's Issues

avoid testing for repmaximality

One easy optimization that we could is to not test for repmaximality when we do parent operation from already repmaximal node, since the new node is certainly repmaximal.

experiments over indexed range max

you don't really need `IInterval` when using cst++

rename to `comp_sdsl`

indexed_ms/fast_ms/src/range_queries.cpp

Line 127 in 8c9af47

 size_type comp(const string& ms_path, const string& ridx_path, const InputFlags& flags) { 

Use backward steps to update intervals.

As of March 31, fd_ms runs much slower due to the removal of operations on the bwt (like the backward step). This was motivated by the equivalence of (v.i, v.j) with the interval of the word of locus v.

Revert the code to use the intervals and bsteps.

Report timing results with Sadakane's cst

transparent lazy wl

Is there a way to make the lazy_wl completely transparent, in the sense that you don't need to call the followup explicitly.

lazy mode not supported

Why is the lazy mode not supported, while it is advertised as a parameter?

Running matching_stats_parallel.x -lazy_wl 1 results in the error message

ERROR from buiold_runs: lazy mode not supported

psum should take ms_path in the constructor

indexed_ms/fast_ms/src/range_queries.cpp

Line 120 in b862939

 partial_sums_vector<size_type, ms_type, ms_sel_1_type> psum(ridx_path, (size_type) flags.block_size); 

more experiments in parallel

Try with mut_xxx strings where the mutation period is potentially larger than the block

Todo for next report

I also think we could report the time taken by normal parent operation vs. lazy parent when it comes after a Weiner link (the latter should be slower)

dump range index vector with max over chunks

fixed size chunks defined over the bit-vector

lazy mode for runs in parallel mode

Use malloc_count for measuring space

https://github.com/bingmann/malloc_count
used in papers which use parallelism:
https://users.dcc.uchile.cl/~jfuentess/thesis/files/sea2015.pdf
https://arxiv.org/pdf/1702.07578.pdf

incorporate implementation of sequence of parent calls with `lca` into the `ms` construction

implement trivial range_max for rrr compression

implement doubleRankAndFail ( and accessAndFail)

build RUNS in parallel

Split the text into d equal-length pieces, do the matching statistics on each piece independently. Then, we can compute the missing matching statistics which are the ones that cross the boundaries between pieces.

benchmark max range queries

Restructure the snakefile so that it runs max operations,

Benchmark max operations

partial max/sum for rle methods and constructor similar to sdsl counterpart

rle_partial_max_vector constructor signature rle_partial_max_vector(const string&) and rle_partial_max_vector(const string&, counter_t&)

rle_partial_max_vector.noindex(int from, int to, algo) which calls either trivial or djamal

rank tests

Current rank tests are very superficial and not unified.

Create two types of tests:

sandbox: run each function a number of times and measure the exec time
full: run the full algorithm on different input types and measure time

The double_rank comes in a vanilla flavor and an extra trick which fails early on the WT search for the symbol.

Perform sandbox tests on nodes that

span small or big intervals
have a WL for only some characters of the alphabet. Call the rank procedure on all characters

Perform the full test on all inputs available (you can run this in parallel).

double rank

csa.double_rank() was not calling wt.double_rank, I made the change, but the code is failing. inspect!

queries over rrr vectors

segafault when range is within the 1st block

add tests with large blocks

remove sequential

indexed_ms/fast_ms/src/matching_stats_parallel.cpp

Line 39 in 8f248ba

#define SEQUENTIAL

deal with all 0 blocks

"-ms_path", "../experiments/range_tests/vanilla_compression_techniques/mm.t.ms",
"-compression", "none", "-op", "max", "-algo", "t",
"-block_size", "4",
"-ridx_path", "../experiments/range_tests/vanilla_compression_techniques/max/HG03061-2.t.ms.none.4.ridx",
"-from_idx", "265581", "-to_idx", "265583",

implement djamal's method for range_max

Reduce MS space in multithreaded version

for a n-threaded run, n ms vectors are allocated. See if you can reduce this space requirement

matching_stats_parallel.x broken?

What kind of restriction do you have for the input text and the pattern? For arbitrary files, I get the following error:

bin/matching_stats_parallel.x -s_path makefile -t_path readme.md                                                                                                            [master] ?? [ 134 ]
building RUNS ... 
 * loadding index string  from makefile 
 * building the CST of length 10 DONE (0 seconds, 11 leaves)
 ** filling 1 slices with : 1 threads ...
 *** [0 .. 1149)
matching_stats_parallel.x: src/fd_ms/p_runs_vector.hpp:354: fdms::p_runs_vector<cst_t>::p_runs_state fdms::p_runs_vector<cst_t>::fill_slice(fdms::InputSpec, const cst_t&, fdms::p_runs_vector<cst_t>::size_type) [with cst_t = fdms::StreeOhleb<>; fdms::p_runs_vector<cst_t>::size_type = long unsigned int]: Assertion `!st.is_root(u)' failed.

Parallel construction is buggy

python script_tests.py --v --vv --slow_prg slow_ms.py --nthreads 5 datasets/testing rnd_20_10
INFO:root:testing on input msinput_pair(s_path='datasets/testing/rnd_20_10.s', t_path='datasets/testing/rnd_20_10.t')
DEBUG:utils:running: /Users/denas/Library/Developer/Xcode/DerivedData/fast_ms-dtwaybjykudaehehgvtglnvhcjbp/Build/Products/Debug/fd_ms -lca_parents 0 -nthreads 5 -rank_fail 0 -s_path datasets/testing/rnd_20_10.s -t_path datasets/testing/rnd_20_10.t -use_maxrep 0 -answer 1 -lazy_wl 0
loading input . . |s| = 113, |t| = 20. DONE (0 seconds)
building RUNS ... 
 * building the CST of length 113 DONE (0 seconds, 114 nodes)
 * computing with a, double_rank_no_fail / consecutive_parents strategy, over 5 threads ... 
 ** launching runs computation over : [0 .. 4)
 ** launching runs computation over : [4 .. 8)
 ** launching runs computation over : [8 .. 12)
 ** launching runs computation over : [12 .. 16)
 ** launching runs computation over : [16 .. 20)
 ** merging over 4 threads ... 
 *** launching runs merge of slices 4 and 3 ... 
 *** launching runs merge of slices 3 and 2 ... 
 *** launching runs merge of slices 2 and 1 ... 
 *** launching runs merge of slices 1 and 0 ... 
DONE (0 seconds)
 * reversing string s of length 113 DONE (0 seconds)
building MS ... 
 * building the CST of length 113 DONE (0 seconds, 114 nodes)
 * computing with a non-lazy, double_rank_no_fail strategy, over 5 threads ... 
 ** launching ms computation over : [0 .. 4)
 ** launching ms computation over : [4 .. 8)
 ** launching ms computation over : [8 .. 12)
 ** launching ms computation over : [12 .. 16)
 ** launching ms computation over : [16 .. 20)
 * total ms length : 77 (with |t| = 20)
DONE (0 seconds)
DEBUG:utils:got: 19 18 17 16 15 14 13 12 11 10 9 8 8 7 6 5 4 3 2 1 

DEBUG:utils:running: python slow_ms.py datasets/testing/rnd_20_10.s datasets/testing/rnd_20_10.t
DEBUG:utils:got: 19 18 17 16 15 14 13 12 11 11 10 9 8 7 6 5 4 3 2 1 

	+ 11
	- 8

remove need to install a python virtual env

Having users setup a virtual env for the software is onerous. Remove

trivial method range max

scan as you do in the trivial method
same for other scans in this method

indexed_ms/fast_ms/src/fd_ms/partial_max_vector.hpp

Line 317 in 0209dc8

if(block_to > block_to_inside){

rank could use a bitvector of maximal repeats

report

Fixes

try lazy vs non-lazy experiments with an even bigger tree
often the context requires a speedup in terms of the whole algorithm vs just the operation, might be better to report both

queries over uncompressed vectors

overflow if `bit_from == 0`

indexed_ms/fast_ms/src/fd_ms/partial_max_vector.hpp

Line 188 in aae6787

cerr << "TODO: possible overflow" << endl;

maxrep can be saved for later use

It takes too long to compute the maxrep vector.

Allow for the script to load a previously computed one (just as with the -load_cst flag)
Create a script that dumps a maxrep vector.

useless condition?

indexed_ms/fast_ms/src/fd_ms/partial_sums_vector.hpp

Line 311 in 807bb2a

if (ms_idx and (ms_idx + 1) % block_size == 0) {

why not just if ((ms_idx + 1) % block_size == 0) ?

make use of snakemake config & modularization

optimize

avoid division and write as a for loop

indexed_ms/fast_ms/src/fd_ms/partial_max_vector.hpp

Line 322 in ce67f6f

if(i / bsize < block_to)

range max signature none-indexed

ridx is not needed and should be de-allocated after rmq construction
int_from should be a const

indexed_ms/fast_ms/src/fd_ms/partial_max_vector.hpp

Line 256 in 0209dc8

 size_type indexed(const sdsl::int_vector<64>& ridx, const sdsl::rmq_succinct_sct<false> &rmq, 

experiments with maxrep, double rank, rank and fail

The code can now leverage a maxrep bitvector indicating the nodes that correspond to maximal repeats.

This information can be used to speedup the WL operations.

Experiment in sandbox to test the effect of the fail mechanism, double rank and the maxrep vector.

experiments for range max in rrr with method=d should fail

Update range_query_profile.cpp

optimize over the select data structure

The current optimization is trivial, can we do it at the rank data structure level?

build a tests for range max

implement a correct method (e.g, using python/numpy) and integrate into own folder with smake

rename the existing test folder for sum

keys

this key is not consistent

indexed_ms/fast_ms/src/range_query_commons.hpp

Line 181 in a61473e

time_usage.reg["bit_range"] = static_cast<size_type>(0);

std::map<RangeOperation, string> a2s = {

odenas / indexed_ms Goto Github PK

indexed_ms's People

Contributors

Stargazers

Watchers

Forkers

indexed_ms's Issues

Recommend Projects

Recommend Topics

Recommend Org