Giter Club home page Giter Club logo

terrier-ef's Introduction

Elias-Fano Compression in Terrier 5

Build Status Codacy Badge License: LGPL v3

This package provides Elias-Fano compression for docids, frequencies and positions in Terrier 5. At its core, it is a refactoring of the Elias-Fano compression included in the MG4J free full-text search engine for large document collections written in Java, and described in the paper:

@inproceedings{Vigna:2013:QI:2433396.2433409,
	author = {Vigna, Sebastiano},
	title = {Quasi-succinct Indices},
	booktitle = {Proceedings of the Sixth ACM International Conference on Web Search and Data Mining},
	series = {WSDM '13},
	year = {2013},
	isbn = {978-1-4503-1869-3},
	location = {Rome, Italy},
	pages = {83--92},
	numpages = {10},
	doi = {10.1145/2433396.2433409},
	acmid = {2433409},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {compressed indices, succinct data structures},
}

This package is free software distributed under the GNU Lesser General Public License.

Pre-requisites

None.

Generating an Elias-Fano Inverted Index using CLITools

This package plugs the encoding-decoding procedures for quasi-succinct indexes implemented by MG4J into the Terrier index data structures.

Given a Terrier plain old index, the following steps can be used to generate a new quasi-succinct index compatible with Terrier 5 APIs.

If not already available, e.g. from Maven Central, you should git clone and install terrier-eliasfano:

mvn -DskipTests clean install

Tell Terrier that you wish to add a plugin, by appending the following to your terrier.properties file in your Terrier distribution:

terrier.mvn.coords=it.cnr.isti.hpclab:terrier-eliasfano:1.5

Then, to convert an existing index:

bin/terrier ef-recompress /path/to/new/index cw09b

The output quasi-succinct index will have the prefix cw09b. You can change the source index using the -I option, e.g.,

bin/terrier ef-recompress -I /path/to/old/index/data.properties /path/to/new/index cw09b

The degree of parallelism and whether block positions should be compressed are varied using the -p and -b options, respectively. You can view the help information for ef-recompress:

bin/terrier help ef-recompress

Generating an Elias-Fano Inverted Index using scripts

This package plugs the encoding-decoding procedures for quasi-succinct indexes implemented by MG4J into the Terrier index data structures.

Given a Terrier plain old index, the following steps can be used to generate a new quasi-succinct index compatible with Terrier 5 APIs.

If not already available, e.g. from Maven Central, you should git clone and install terrier-eliasfano:

mvn -DskipTests clean package appassembler:assemble

Then, to convert an existing index:

./target/bin/ef-convert -index /path/to/old/index/cw09b.properties -path /path/to/new/index/ -prefix cw09b.ef

The input index has the prefix cw09b. The output quasi-succinct index will have the prefix cw09b.ef.

The ef-convert tool accepts the following options.

-path [String] (required)

Path of the directory that will hold the output Terrier index.

-prefix [String] (required)

Prefix of the output Terrier index. If an index with the given prefix already exists, the execution will be aborted.

-index [String] (required)

Fully qualified filename of one of the files of a existing Terrier index. The parameter will be split automatically into a Terrier path and prefix.

-b (optional)

Compress positions with Elias-Fano. Default: false

-p [Number] (optional)

Number of threads to use. Anyway the maximum value will be the number of available cores. Default: 1.

Multi-threaded compressions is experimental -- caution advised due to threads competing for available memory!

Notes

  • supports (block) positions
  • does not support indices using fields

Credits

Developed by Nicola Tonellotto, ISTI-CNR. Contributions by Craig Macdonald, University of Glasgow, and Matteo Catena, ISTI-CNR.

terrier-ef's People

Contributors

cmacdonald avatar codacy-badger avatar dependabot[bot] avatar tonellotto avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

terrier-ef's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.