Light

fergiemcdowall / search-index-adder Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 13.0 5.78 MB

The indexing module for search-index

License: MIT License

JavaScript 100.00%

search-index-adder's Introduction

search-index-adder

The indexing module for search-index

search-index-adder's People

Contributors

Stargazers

Watchers

Forkers

connorkrammer ihenshaw blahah nickclaw divsbhalala donaldfoss ngpixel kokeksibir teambit kenshyx darobin viktoriyaabakumova-okta

search-index-adder's Issues

Possible optimisation strategies

I'm opening this issue to discuss possible optimisations, either in the code or in how users operate their indices. This is following on from fergiemcdowall/search-index#261

This is just my first batch of ideas - they might be terrible or wrong. Please feel free to criticise them, suggest better alternatives, etc.

Memory

Streaming: It looks like all the db instructions for a batch are kept in memory. If the steps in the pipeline were streamed together, you could have the memory cleaned up as each item is processed.
Compressed indexing: A compressed index transform like the Burrows Wheeler could reduce the index size (potentially by a lot) with a potentially very small speed penalty. Both keys (at least the part after the prefix) and values can be compressed this way. Note that LevelDB includes a compression already, but it is highly generalised and handles arbitrary data. Because search-index deals with text, an alternative or additional compression could perform better. The BWT has performed extremely well in other sequence-search applications.

Speed

I think speed reductions will come mainly from reducing the number of db operations.

Reducing writes: Currently every item in a batch is processed syncronously while they, and the resulting db instructions, are kept in memory. This could be made more efficient (by how much depends on the corpus) by collapsing the batches in memory before doing anything that uses the database. So for example, before writing to the database, you could iterate over a batch and merge any operations that increment counts for the same term. You could even do the term counting using some kind of streaming count, like using a count min sketch, and write the final sketch to database all at once. This changes the number of writes (for the count) to be the number of unique terms, rather than the total number of words in the corpus.
Reducing reads: It looks like for every insert, there's a query to see if the key already exists. Again this could be reduced by some within-batch collapsing. First look over all the TF or RI operations in the batch, then aggregate them to create a single update operation per key.

Final index size

Normalisation: There's currently no in-built stemming or other normalisation. Stemming, tense normalisation and case normalisation would all reduce the index size, possibly by a very large factor depending on the corpus. Although this might be out of scope for this module, it could be mentioned in the documentation to help users optimise their own indices.
Index filtering: Once an index has been constructed, it could be reduced in size by removing the long tail of terms with a low IDF. This is like an automated stopword filtering - take the IDFs and look at their distribution. Choose some cutoff below which the terms are probably just noise, and discard all those terms from the index.

Publish recent changes

It'd be nice to be able to use e9980a1 without pointing at my fork.

TypeError: Cannot read property 'info' of undefined

I'm getting this when finishing an indexing

/home/.../node_modules/search-index/lib/siUtil.js:33
       siOptions.log.info('closed...')
                    ^

TypeError: Cannot read property 'info' of undefined
    at /home/.../node_modules/search-index/lib/siUtil.js:33:21
    at /home/.../node_modules/levelup/lib/levelup.js:141:18
    at /home/.../node_modules/abstract-leveldown/abstract-leveldown.js:65:7

Algorithm documentation

Apart from the code itself, is there a description of the algorithm written up anywhere?

Set up with commitizen and semantic release

as per search-index

Save the document vector in the index

At present doc vectors are split up and distributed around the index in sorted term vector arrays. This makes it easy to find out which documents have the highest term frequency for a given document. However, in cases where there are two or more search terms this approach alone becomes computationally less efficient since all term vector arrays have to be traversed to find out scores.

It would be better if the document vector was saved on a per-document basis in the index. This would give the following advantages:

Search would be (much) faster since you only have to find one doc vector which will then give you the term frequencies of all terms in the document
It would allow the large DELETE-DOCUMENT entries to be removed, since all index entries for a document can be derived from its vector
If opens up the possibility of custom scoring, and other operations that require a vector.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.