Giter Club home page Giter Club logo

lpg's Introduction

A grammar self-index based on LMS substrings

A grammar self-index of a text $T$ (Claude et al. 2012) consists of a grammar $\mathcal{G}$ that only produces $T$ and a geometric data structure that indexes the string cuts of the right-hand sides of $\mathcal{G}$'s rules. This representation uses space proportional to $G$, the size of the grammar, which is small when the text is repetitive. However, the index is slow for pattern matching; it finds the $occ$ occurrences of a pattern $P[1..m]$ in $O((m^{2}+occ)\log G)$ time. The most expensive part is a set of binary searches for the different cuts $P[1..j]P[j+1..m]$ in the geometric data structure. Christiansen et al. (2019) solved this problem by building a locally consistent grammar that only searches for $O(\log m)$ cuts of $P$. Their representation, however, requires significant extra space (tough still in $O(G)$) to store a set of permutations for the nonterminal symbols. In this work, we propose another locally consistent grammar that builds on the idea of LMS substrings (Nong et al. 2009). Our grammar also requires to try $O(\log m)$ cuts when searching for $P$, but it does not need to store permutations. As a result, we obtain a self-index that searches in time $O((m\log m+occ) \log G)$ and is of practical size. Our experiments showed that our index is faster than previous grammar-indexes at the price of increasing the space by a 1.8x factor on average. Other experimental results showed that our data structure becomes convenient when the patterns to search for are long.

Third-party libraries

  1. SDSL-lite
  2. xxHash

Prerequisites

  1. C++ >= 17
  2. CMake >= 3.7
  3. SDSL-lite

The xxHash library is already included in the source files. We include a CMake module that will search for the local installation of the SDSL-library. No need to indicate the path during the compilation.

Installation

Clone repository, enter the project folder and execute the following commands:

mkdir build
cd build
cmake ..
make

Creating the index

./lpg index tests/sample_file.txt

The command above will produce the file sample_file.txt.lpg_idx

Input for the index

The current implementation expects a string ending with the null '\0' character. If you have a collection rather than a single string, you need to concatenate the input into one sequence and then append '\0'. Notice that the program will crash if '\0' appears in other places of the file different from the end.

Search for a pattern

Assuming you are in the folder build inside the repository. You can run a search example as:

./lpg search sample_file.txt.lpg_idx -F ../tests/sample_file.rand_pat_100_10

The -F flag expects a file with the pattern list (one element per line). Alternatively, you can use --p to pass a pattern in place. The in-place option can take multiple inputs. For instance, --p pat1 pat2 pat3 .. or --p pat1 --p pat2 --p pat3. The options -F and --p are complementary, meaning that the program will search for the combined pattern collection.

Result of the search

The -r flag in the command line will report the number of occurrences and the elapsed time individually per input pattern. If you do not use this flag, the program will print the sum of all the pattern occurrences and the total elapsed time to get them.

Disclaimer

This repository is a legacy implementation that has yet to be tested in massive inputs. If you find bugs, please report them here.

Citation

If you use this code, please cite the following paper:

Díaz-Domínguez, D., Navarro, G., & Pacheco, A..
An LMS-based grammar self-index with local consistency properties.
In Proc. 28th Symposium on String Processing and Information (SPIRE 2021). 

lpg's People

Contributors

apachecom avatar ddiazdom avatar hobossphdt avatar

Stargazers

Diego Ortego Prieto avatar

Watchers

Diego Ortego Prieto avatar  avatar

lpg's Issues

Search functionality does report wrong number of occurrences

It seems like the ./lpg search -F patterns.txt index_file command does report a total number of occurrences smaller than the correct one. I am reporting an example using the index constructed using the sample_file.txt file.
The ./lpg search sample_file.lpg_idx -p AAACT command gives the output:

Searching for patterns in the self-index
Index stats
  Index name:                                              sample_file.lpg_idx
  Index size:                                              965407 bytes 
  Orig. text len:                                          605265
  Number of bits in the index per input text symbol (bps): 12.7601
Searching for the patterns 
Locating 1 patterns 
Stats for the pattern collection
  Total elap. time (microsec): 3106
  Total occ: 585
  Time/occ (microsec): 5.3094

However, the correct number of occurrences of the AAACT pattern in the file is 949.

Index construction fails with sample text

The ./lpg index -o ../tests/test_index ../tests/sample_file.txt command to test the index construction with the sample text fails with the output:

measuring peak memory
Input file: ../tests/sample_file.txt
TEXT:605264
Reading input file
  Number of characters: 605264
  Alphabet:             5
  Smallest symbol:      10
  Greatest symbol:      84
Temporal folder: /tmp/lpg_index.NYEWKy
Computing the grammar for the self-index
  Generating the LMS-based locally consistent grammar:    
    Parsing round 1
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          168570
        New nonterminals:    2922
    Parsing round 2
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          54301
        New nonterminals:    24548
    Parsing round 3
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          18523
        New nonterminals:    13338
    Parsing round 4
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          6654
        New nonterminals:    5831
    Parsing round 5
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          4041
        New nonterminals:    2172
    Parsing round 6
      Computing the LMS phrases in the text
      Assigning identifiers to the phrases
      Creating the parse of the text
      Updating symbols status
      Stats:
        Parse size:          3982
        New nonterminals:    55
    Parsing round 7
      Computing the LMS phrases in the text
      Stats:
        Parse size:          3982
        New nonterminals:    0
  Resulting locally consistent grammar:    
    Number of terimnals:    5
    Number of nonterminals: 48947
    Grammar size:           170957
    Compressed string:      3982
  Run-length compressing the grammar
    Stats:
      Grammar size before:        170957
      Grammar size after:         157399
      Number of new nonterminals: 253
      Compression ratio:          0.920694
  Simplifying the grammar
    Stats:
      Grammar size before:  157399
      Grammar size after:   122780
      Deleted nonterminals: 34738 (70.5985%)
      Compression ratio:    0.780056
  Reordering nonterimnals in Colex
  Final grammar: 
    Number of terminals:    5
    Number of nonterminals: 14462
    Grammar size:           122775
    Compressed string:      3982
  Elap. time (microsec): 217327
Building the self-index
ERROR[RULES FILE] 0 APPEARS MORE THAN 1 TIME IN THE GRAMMAR:716

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.