Giter Club home page Giter Club logo

odenas / indexed_ms Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 1.0 32.04 MB

Fast matching statistics in small space

License: GNU General Public License v3.0

Makefile 2.08% C++ 57.35% Python 2.09% C 1.28% Jupyter Notebook 1.80% HTML 33.02% Shell 0.13% R 0.67% CMake 0.50% TeX 1.03% GDB 0.02% Batchfile 0.01% Gnuplot 0.01% Assembly 0.01% Perl 0.01%
algorithms bioinformatics bwt comparative-genomics data-structures matching-statistics maximal-repeat suffix-tree

indexed_ms's People

Contributors

fcunial avatar odenas avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

dbelaz

indexed_ms's Issues

avoid testing for repmaximality

One easy optimization that we could is to not test for repmaximality when we do parent operation from already repmaximal node, since the new node is certainly repmaximal.

Use backward steps to update intervals.

As of March 31, fd_ms runs much slower due to the removal of operations on the bwt (like the backward step). This was motivated by the equivalence of (v.i, v.j) with the interval of the word of locus v.

Revert the code to use the intervals and bsteps.

transparent lazy wl

Is there a way to make the lazy_wl completely transparent, in the sense that you don't need to call the followup explicitly.

Todo for next report

I also think we could report the time taken by normal parent operation vs. lazy parent when it comes after a Weiner link (the latter should be slower)

build RUNS in parallel

Split the text into d equal-length pieces, do the matching statistics on each piece independently. Then, we can compute the missing matching statistics which are the ones that cross the boundaries between pieces.

rank tests

Current rank tests are very superficial and not unified.

Create two types of tests:

  • sandbox: run each function a number of times and measure the exec time
  • full: run the full algorithm on different input types and measure time

The double_rank comes in a vanilla flavor and an extra trick which fails early on the WT search for the symbol.

Perform sandbox tests on nodes that

  • span small or big intervals
  • have a WL for only some characters of the alphabet. Call the rank procedure on all characters

Perform the full test on all inputs available (you can run this in parallel).

double rank

csa.double_rank() was not calling wt.double_rank, I made the change, but the code is failing. inspect!

deal with all 0 blocks

"-ms_path", "../experiments/range_tests/vanilla_compression_techniques/mm.t.ms",
"-compression", "none", "-op", "max", "-algo", "t",
"-block_size", "4",
"-ridx_path", "../experiments/range_tests/vanilla_compression_techniques/max/HG03061-2.t.ms.none.4.ridx",
"-from_idx", "265581", "-to_idx", "265583",

matching_stats_parallel.x broken?

What kind of restriction do you have for the input text and the pattern? For arbitrary files, I get the following error:

bin/matching_stats_parallel.x -s_path makefile -t_path readme.md                                                                                                            [master] ?? [ 134 ]
building RUNS ... 
 * loadding index string  from makefile 
 * building the CST of length 10 DONE (0 seconds, 11 leaves)
 ** filling 1 slices with : 1 threads ...
 *** [0 .. 1149)
matching_stats_parallel.x: src/fd_ms/p_runs_vector.hpp:354: fdms::p_runs_vector<cst_t>::p_runs_state fdms::p_runs_vector<cst_t>::fill_slice(fdms::InputSpec, const cst_t&, fdms::p_runs_vector<cst_t>::size_type) [with cst_t = fdms::StreeOhleb<>; fdms::p_runs_vector<cst_t>::size_type = long unsigned int]: Assertion `!st.is_root(u)' failed.

Parallel construction is buggy

python script_tests.py --v --vv --slow_prg slow_ms.py --nthreads 5 datasets/testing rnd_20_10
INFO:root:testing on input msinput_pair(s_path='datasets/testing/rnd_20_10.s', t_path='datasets/testing/rnd_20_10.t')
DEBUG:utils:running: /Users/denas/Library/Developer/Xcode/DerivedData/fast_ms-dtwaybjykudaehehgvtglnvhcjbp/Build/Products/Debug/fd_ms -lca_parents 0 -nthreads 5 -rank_fail 0 -s_path datasets/testing/rnd_20_10.s -t_path datasets/testing/rnd_20_10.t -use_maxrep 0 -answer 1 -lazy_wl 0
loading input . . |s| = 113, |t| = 20. DONE (0 seconds)
building RUNS ... 
 * building the CST of length 113 DONE (0 seconds, 114 nodes)
 * computing with a, double_rank_no_fail / consecutive_parents strategy, over 5 threads ... 
 ** launching runs computation over : [0 .. 4)
 ** launching runs computation over : [4 .. 8)
 ** launching runs computation over : [8 .. 12)
 ** launching runs computation over : [12 .. 16)
 ** launching runs computation over : [16 .. 20)
 ** merging over 4 threads ... 
 *** launching runs merge of slices 4 and 3 ... 
 *** launching runs merge of slices 3 and 2 ... 
 *** launching runs merge of slices 2 and 1 ... 
 *** launching runs merge of slices 1 and 0 ... 
DONE (0 seconds)
 * reversing string s of length 113 DONE (0 seconds)
building MS ... 
 * building the CST of length 113 DONE (0 seconds, 114 nodes)
 * computing with a non-lazy, double_rank_no_fail strategy, over 5 threads ... 
 ** launching ms computation over : [0 .. 4)
 ** launching ms computation over : [4 .. 8)
 ** launching ms computation over : [8 .. 12)
 ** launching ms computation over : [12 .. 16)
 ** launching ms computation over : [16 .. 20)
 * total ms length : 77 (with |t| = 20)
DONE (0 seconds)
DEBUG:utils:got: 19 18 17 16 15 14 13 12 11 10 9 8 8 7 6 5 4 3 2 1 

DEBUG:utils:running: python slow_ms.py datasets/testing/rnd_20_10.s datasets/testing/rnd_20_10.t
DEBUG:utils:got: 19 18 17 16 15 14 13 12 11 11 10 9 8 7 6 5 4 3 2 1 

	+ 11
	- 8

report

Fixes

  • try lazy vs non-lazy experiments with an even bigger tree
  • often the context requires a speedup in terms of the whole algorithm vs just the operation, might be better to report both

maxrep can be saved for later use

It takes too long to compute the maxrep vector.

  • Allow for the script to load a previously computed one (just as with the -load_cst flag)
  • Create a script that dumps a maxrep vector.

experiments with maxrep, double rank, rank and fail

The code can now leverage a maxrep bitvector indicating the nodes that correspond to maximal repeats.

This information can be used to speedup the WL operations.

Experiment in sandbox to test the effect of the fail mechanism, double rank and the maxrep vector.

build a tests for range max

implement a correct method (e.g, using python/numpy) and integrate into own folder with smake

rename the existing test folder for sum

call to construct_im

construct_im() will write input to a file and call construct() to build the data structure. However, you have already loaded the input (you just passed it to construct_im), which means it is being loaded twice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.