Giter Club home page Giter Club logo

pyindex's Introduction

A leaner, faster backend for Whoosh

Whoosh is great, but was working a bit too slow for my purposes (which require fast access to postings and slightly big data (~200GB)). Rather than jump ship to Lucene or some other IR framework, I built this backend for Whoosh, which in addition to having fast postings access also ended up being quite a bit faster than Whoosh's default backend.

Benefits of this backend

  • over 50% reduction in indexing speed
  • over 50% reduction in query time
  • substantial reduction in index size
  • supports the default BM25 scoring, and should play nice with much of Whoosh
  • codebase is small and should be easy to understand/modify

Limitations of this backend

  • very lean / only supports a static index; there is no delete document, only add documents; there is no support for segments, so adding individual documents requires the entire index to be rewritten (i.e., should add documents in bulk).
  • no block quality (TODO)
  • bare minimum testing (in the notebook)
  • the reduction in index size, and a small part of the speed boost is from using this and this via a cython wrapper for postings compression. Very fast, but requires a recent Intel processor (e.g., Haswell). You may need to recompile the cython for this to work in your environment (run python setup.py build_ext --inplace in the streamvbyte directory to make a compatible .so file). It will fall back on pickle if you don't have this, which is not nearly as good (but still faster than default whoosh).
  • built in Python 3.5 with no eye for backward compatibility, and will not work with Python 2 without modification
  • takes up a lot of memory! All stored data is held in memory, and entire postings are read into memory; so this takes up a lot more memory than Whoosh (fixing this is a TODO)
  • lots of Whoosh features are not supported (e.g., term vectors, "unique" properties in the schema, etc.)

Todo

  • add block quality and stop storing all postings in memory

The IPython notebook has the benchmark calculations + shows how to use this backend with Whoosh.

Benchmarks

Datasets used are text collections from this site.

  • TCP-ECCO (170mb uncompressed) can be downloaded here
  • Lincoln (700kb uncompressed) can be downloaded here

Index time

Dataset Whoosh Swhoosh Speedup
Lincoln ~1.03s ~0.32s 69%
TCP-ECCO (single process) ~175.1s ~66.6s 62%
TCP-ECCO (multi process) ~147.7s ~27.7s 81%

Index Size

Dataset Whoosh Swhoosh Space saved
Lincoln 1.5mb 700kb 53%
TCP-ECCO 170mb 102mb 40%

Query Time

All queries disjunctive OR, on TCP-ECCO, using default BM25 scoring.

Query length Whoosh Swhoosh Speedup
3 words 9.07 ms 3.83 ms 58%
6 words 14.36 ms 5.54 ms 61%
30 words 92.54 ms 48.19 ms 48%

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.