Giter Club home page Giter Club logo

search-index-adder's Introduction

search-index-adder's People

Contributors

fergiemcdowall avatar giladshoham avatar ihenshaw avatar ngpixel avatar nickclaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

search-index-adder's Issues

Possible optimisation strategies

I'm opening this issue to discuss possible optimisations, either in the code or in how users operate their indices. This is following on from fergiemcdowall/search-index#261

This is just my first batch of ideas - they might be terrible or wrong. Please feel free to criticise them, suggest better alternatives, etc.

Memory

  • Streaming: It looks like all the db instructions for a batch are kept in memory. If the steps in the pipeline were streamed together, you could have the memory cleaned up as each item is processed.
  • Compressed indexing: A compressed index transform like the Burrows Wheeler could reduce the index size (potentially by a lot) with a potentially very small speed penalty. Both keys (at least the part after the prefix) and values can be compressed this way. Note that LevelDB includes a compression already, but it is highly generalised and handles arbitrary data. Because search-index deals with text, an alternative or additional compression could perform better. The BWT has performed extremely well in other sequence-search applications.

Speed

I think speed reductions will come mainly from reducing the number of db operations.

  • Reducing writes: Currently every item in a batch is processed syncronously while they, and the resulting db instructions, are kept in memory. This could be made more efficient (by how much depends on the corpus) by collapsing the batches in memory before doing anything that uses the database. So for example, before writing to the database, you could iterate over a batch and merge any operations that increment counts for the same term. You could even do the term counting using some kind of streaming count, like using a count min sketch, and write the final sketch to database all at once. This changes the number of writes (for the count) to be the number of unique terms, rather than the total number of words in the corpus.
  • Reducing reads: It looks like for every insert, there's a query to see if the key already exists. Again this could be reduced by some within-batch collapsing. First look over all the TF or RI operations in the batch, then aggregate them to create a single update operation per key.

Final index size

  • Normalisation: There's currently no in-built stemming or other normalisation. Stemming, tense normalisation and case normalisation would all reduce the index size, possibly by a very large factor depending on the corpus. Although this might be out of scope for this module, it could be mentioned in the documentation to help users optimise their own indices.
  • Index filtering: Once an index has been constructed, it could be reduced in size by removing the long tail of terms with a low IDF. This is like an automated stopword filtering - take the IDFs and look at their distribution. Choose some cutoff below which the terms are probably just noise, and discard all those terms from the index.

TypeError: Cannot read property 'info' of undefined

I'm getting this when finishing an indexing

/home/.../node_modules/search-index/lib/siUtil.js:33
       siOptions.log.info('closed...')
                    ^

TypeError: Cannot read property 'info' of undefined
    at /home/.../node_modules/search-index/lib/siUtil.js:33:21
    at /home/.../node_modules/levelup/lib/levelup.js:141:18
    at /home/.../node_modules/abstract-leveldown/abstract-leveldown.js:65:7

Save the document vector in the index

At present doc vectors are split up and distributed around the index in sorted term vector arrays. This makes it easy to find out which documents have the highest term frequency for a given document. However, in cases where there are two or more search terms this approach alone becomes computationally less efficient since all term vector arrays have to be traversed to find out scores.

It would be better if the document vector was saved on a per-document basis in the index. This would give the following advantages:

  1. Search would be (much) faster since you only have to find one doc vector which will then give you the term frequencies of all terms in the document
  2. It would allow the large DELETE-DOCUMENT entries to be removed, since all index entries for a document can be derived from its vector
  3. If opens up the possibility of custom scoring, and other operations that require a vector.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.