Giter Club home page Giter Club logo

grab's Introduction

grab - simple but very fast grep

This is my own, experimental, parallel version of grep so I can test various strategies to speed up access to large directory trees. On SSD's you can easily outsmart common greps by up to 100%.

Options:


 -O     -- print file offset of match
 -l     -- do not print the matching line (Useful if you want
           to see _all_ offsets; if you also print the line, only
           the first match in the line counts)
 -s     -- single match; dont search file further after first match
           (similar to grep on a binary)
 -L     -- machine has low mem; half chunk-size (default 1GB)
           may be used multiple times
 -I     -- enable highlighting of matches
 -c <n> -- Use n cores in parallel (useless and even slower in most situations)
           n <= 1 uses single-core
 -r     -- recurse on directory
 -R     -- same as -r

grab uses the pcre library, so basically its equivalent to a grep -P -a

Why is it faster?

grab is using mmap(2) and matches the whole file blob without counting newlines (which grep is doing even if there is no match) which is a lot faster than reading the file in chunks and counting the newlines. If available, grab also uses the PCRE JIT feature. However, speedups are only measurable on fast HDD's or SSD's. In the later case, the speedup can be really drastically (even up to 100%) if matching recursively. So clearly, the storage is the bottleneck, and parallelizing the search is in most cases even slower, as the seeking takes more time than just doing stuff in linear; even on SSD's.

Additionally, grab is skipping files which are too small to contain the regular expression. For larger regex's in a recursive search, this can skip quite good amount of files without even opening them.

A quite new pcre lib is required, on some older systems the build can fail due to PCRE_INFO_MINLENGTH and pcre_study().

Files are mmaped and matched in chunks of 1Gig. For files which are larger, the last 4096 byte (1 page) of a chunk are overlapped, so that matches on a 1 Gig boundary can be found. In this case, you see the match doubled (but with the same offset).

If you measure grep vs. grab, keep in mind to drop the dentry and page caches between each run: echo 3 > /proc/sys/vm/drop_caches

grab was made to quickly grep through large directory trees. The original grep has by far a more complete option-set. The speedup for a single file match is very small, if at all (stdin cannot be mmapped and I am too lazy to add a pread() workaround just for this useless case)

For SSD's, the multicore option can make sense. For HDD's it doesnt since the head has to be positioned back and forth between the threads, which kills performance.

grab's People

Contributors

stealth avatar grimreaper avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.