Giter Club home page Giter Club logo

smart's Introduction

SMART

SMART (String Matching Algorithm Research Tool) is a tool which provides a standard framework for researchers in string matching. It helps users to test, design, evaluate and understand existing solutions for the exact string matching problem. Moreover it provides the implementation of (almost) all string matching algorithms and a wide corpus of text buffers.

In the last 40 years of research in computer science string matching was one of the most extensively studied problem, mainly due to its direct applications to such diverse areas as text, image and signal processing, speech analysis and recognition, data compression, information retrieval, computational biology and chemistry. Moreover String matching algorithms are also basic components used in implementations of practical softwares existing under most operating systems.

Since 1970 more than 80 string matching algorithms have been proposed, and more than 50% of them in the last ten years. The smart tool provides a comprehensive collection of most string matching algorithms (222 as of now), implemented in C programming language, and helps researcher to perform experimental results and compare them from a practical point of view. Smart provides a practical and standard platform for testing string matching algorithms and sharing results with the community.

The smart tool provides also a corpus of 12 texts on which the string matching algorithms can be tested. Texts in the corpus are of different types, including natural language texts, genome sequences, protein sequences, and random texts with a uniform distribution of characters.

The release of smart will be available here

How to use?

The documentation about smart is available here

How to compile it from source

To compile the source just download (or clone) this repository and run the file build.sh from terminal (with ./build.sh), it will compile the smart binaries and all the algorithms (the algorithms binaries will be created into bin/).

Constraints

Several algorithms are constraint in a minimal or maximal pattern length m. Then fallbacks are used, which are skipped in the timing benchmarks.

The NUL character is allowed in the pattern and the text (i.e. the memmem() API). Only the new strstr() reference algos libc and musl are skipped then.

The original tests ensured a terminating NUL (i.e. strstr(), not memmem()), but this was fixed, as it was unrealistic. Only this colussi implementation violated this, but the original paper had the missing check included.

Formal verification

Most algos could be formally verified by Reini with the model checker CBMC. But since some algos use double-nested for loops, the depth is quadratic to the input length. Thus formal verification is tuned to low m and n; 32 * 32 = --depth 1024, which would time out within 10m. We use max m 10 and max n (the text length) 36, i.e depth 360. A helper algocfg was added to list all test properties and constraints, and use the best CBMC arguments.

Tests

We test now with added assertions, using the sanitizers (address and UB), and are fuzzing with AFL+. Lots of buffer overflow, memory leaks and undefined behavior errors were fixed. Most allocating algos got optimizations with static stack buffers. The cppcheck and clang-tidy linters were added and all violations fixed, resp. false positives noted. github actions run all the tests, linters and some verifications on all changes.

Summary

According to our experimental results in 2010 (until KBNDM), we conclude that the following algorithms are the most efficient in the following situations. MUSL1 and EPSM added later as the current best.

  • MUSL1 memmem(): short patterns.
  • EPSM: The best SSE2 algo, but unsafe.
  • SA: very short patterns and very small alphabets.
  • TVSBS: very short patterns and small alphabets, and long patterns and large alphabets.
  • FJS: very short patterns and large and very large alphabets.
  • EBOM: short patterns and large and very large alphabets.
  • SBNDM-BMH and BMH-SBNDM: short patterns and very large alphabets.
  • HASHq: short and large patterns and small alphabets.
  • FSBNDM: long patterns and large and very larghe alphabets.
  • SBNDMq: long pattern and small alphabets.
  • LBNDM: very long patterns and very large alphabets.

However the old tests were done with temp. buffers on the heap, not static stack buffers. And some implementations didn't free their temp. buffers.

Benchmarks

Reference

Stringology Conference article url

Lecroq's EXACT STRING MATCHING ALGORITHMS

If you work with SMART, please cite:

@INPROCEEDINGS( PSC2016-9, 
 author = "Simone Faro and Thierry Lecroq and Stefano Borz\`i and Simone Di Mauro and Alessandro Maggio",
 title = "The String Matching Algorithms Research Tool",
 booktitle = "Proceedings of the Prague Stringology Conference 2016",
 address = "Czech Technical University in Prague, Czech Republic",
 editor = "Jan Holub and Jan {\v{Z}}{\v{d}}{\'{a}}rek",
 isbn = "978-80-01-05996-8",
 year = 2016,
 pages = "99--111",
)

Troubleshooting

Problem with shared memory (shmget)? see the documentation

smart's People

Contributors

helias avatar nishihatapalmer avatar rurban avatar simonefaro avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

yodamaster

smart's Issues

const the search args?

Some algos may change the buffers, or m or n.
Don't do that.

E.g.

if (m > 32)
    m = 32;

caps the search space to 32, which must then be cross-checked with an exact algo for the rest. so we need to keep the old m.

randomly crashing algos

mostly with longer patterns, >=2048 in rand*. detected with the new error log. Some require a trailing zero, some even room after the text!
./test algo or ./test-asan algo do not detect these yet!

  • colussi, gg (Fixed preColussi with 5eeb75e)
  • hor (unrelated test problem, fixed by 15eceef)
  • gs, zt, om, ms, fs, ffs, bfs (unrelated test problem, fixed by 8ef2b56)
  • faoso4
  • tsw (m < n + 3, fixed with 46005a6)
  • graspm (fixed overflow and loops, but still fails)
  • bsdm
  • bsdm2
  • bsdm3
  • bsdm4
  • bsdm8
  • bxs (but fails now)
  • bxs1
  • bxs3
  • bxs4
  • bxs6
  • bxs8
  • sbndm
  • trf
  • tsa-q2 (fixed with ffbf3b8)
  • tso5
  • tvsbs
  • ufndmq4
  • ufndmq6
  • fsbsdm
  • fsbndm-w1
  • fsbndm-w2
  • fsbndm-w4
  • fsbndm-w6
  • fsbndm-w8
  • sbndm-w2
  • sbndm-w4
  • sbndm-w6
  • epsm
  • libc1

run it via make; ./select -all; (sleep 2s; ./kill-tests.sh &); ./smart -text all

missing algos

  • block 2-Block BM
  • BMH2 Boyer-Moore-Horspool with q-grams
  • BMH4 Boyer-Moore-Horspool with q-grams
  • SM Wu-Manber
  • SWM Simplified Wu-Manber
  • SBDM Succint Backward DAWG Matching
  • BSOM Backward Set Oracle Matching
  • Turbo_RF Crochemore_1992.pdf
  • Turbo_BDM Crochemore_1992.pdf
  • SOG Parallel Shift-Or
  • BM_BNDM
  • Turbo_BNDM
  • HG BNDM with q-grams (Salmela et al. 2006)
  • BG BNDM with q-grams (Salmela et al. 2006)
  • HPBM HP Parallel Boyer-Moore, Jeong et al 2015
  • doublehash Bicer, Zhang 2019
  • FT3 https://github.com/lecroq/goodsuff
  • MAG Multi AOSO on q-Gram (c++, https://github.com/rsusik/mag)
  • libc strstr
  • libc memmem
  • musl strstr
  • musl memmem
  • SSEKR https://github.com/WojciechMula/sse4-strstr/ (the generic variants only)
  • KR-SIMD https://mattsills.github.io/2024/03/02/rabin-karp/
  • QF53
  • QF72
  • QF82
  • BRAM3
  • BRAM5
  • BRAM7

static stack alloc with a cutoff (32), less dynamic allocations

I've put them into static globals, but this is not thread-safe. Rather put them into the search function.

  • askip.c
  • akc.c
  • ww.c
  • fdm.c
  • aut.c
  • blim.c
  • bndml.c
  • bom.c (static cells)
  • simon.c (static cells)
  • bom2.c
  • dfdm.c
  • ebom.c
  • epsm.c (too hard)
  • fbom.c
  • graspm.c (WIP, static linked list)
  • ildm1.c
  • ildm2.c
  • include/AUTOMATON.h
  • include/GRAPH.h
  • ldm.c
  • rf.c
  • sebom.c
  • sfbom.c
  • skip.c (static list)
  • skip2.c
  • skip3.c
  • skip4.c
  • skip5.c
  • skip6.c
  • skip7.c
  • skip8.c
  • ssef.c (linked list)
  • trf.c

missing OUTPUT(pos)

some algos do just count++
but this misses the abstraction to report the found index.
Add a DEBUG mode to print the found indices

TODO OUTPUT position MACROS for the following popcount algos:

  • epsm
  • ssecp
  • tsa
  • tsa-q2
  • tso5

document and fix all the constraints

so far algorithms.h lists the pattern minlen.
needed is also pattern and text maxlen for some algos.

and some algos perform the search only for the first 32byte of the pattern: inexact m>32.

also add them to the comments, and provide search_small fallbacks, so that all algos would pass all tests.

detect tty for the CI

eg. getenv("CI") to disable the tty artefacts in the output

  • smart
  • compilesm
  • textgen

micro devices compat (avr, armv6)

Check with avr-gcc, arm32, mips32, ... Add them to the CI.

make CC=avr-gcc BINDIR=\"bin/avr-gcc\"

  • avr
  • armv6
  • mips32
  • risc32
  • stm8-sdcc

avr compat builds are already supported and in the makefile, but we need a linkerscript still.

test: add timeout support observing limit

alarm(limit) SIGALRM and kill it then. see coreutils/src/timeout.c e.g.
change system() to fork/exec

also print the algo to stderr for the errorlog.txt

currently I have to use the following workaround:

kill-tests.sh

#!/bin/sh
smart=`ps xw | perl -anle'$F[4] eq "./smart" && print $F[0]'`
while [ $smart -gt 0 ]
do
    sleep 2s
    ps xw | perl -anle'$F[4]=~m{^\./bin/} && $F[5] eq "shared" && $F[3]!~/^0:0[0-5]/ && kill("TERM",$F[0])'
    smart=`ps xw | perl -anle'$F[4] eq "./smart" && print $F[0]'`
done

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.