Giter Club home page Giter Club logo

halva's Introduction

halva

Front-compressed lexicon data structure.

Purpose

This is a set-like data structure that uses front-coding to compress a lexicon. Such a data structure supports the same operations as an ordered set (checking for the presence of a word, iterating over the lexicon in lexicographical order), but is much smaller.

The following table shows the size of a few dictionaries before and after compression. The decompressed column gives the size of the dictionary as encoded in a text file, one word per line. The compressed column gives the size of the corresponding compressed lexicon in memory.

dictionary      language    decompressed  compressed
---             ---         ---           ---
Unix            English     920K          384K
Corriere        Italian     404K          216K
Duden           German      2.5M          1.4M
Robert          French      2.0M          748K
Monier-Williams Sanskrit    2.5M          1.2M

This data structure supports ordered minimal perfect hashing: there is a one-to-one correspondence between a word and an ordinal representing its position in the lexicon. This allows finding a word given its ordinal, and, conversely, finding the ordinal corresponding to a word, as in a sorted array.

Finally, strings containing embedded zeroes are supported.

If you're interested in these matters, you might want to check my minimal acyclic finite-state automaton library, which implements functionalities similar to this one.

Building

There is no build process. Compile halva.c together with your source code, and use the interface described in halva.h. You'll need a C99 compiler, which means GCC or CLang on Unix.

A command-line tool halva is included. Compile and install it with the usual invocation:

$ make && sudo make install

A Lua binding is also available. See the file README.md in the lua directory for instructions about how to build and use it.

Usage

The C API is documented in halva.h. See the file example.c for a concrete example.

Details

References

This implementation draws from the chapters on inverted index compression in each of these books:

Encoding

Lexica contain three sections: a header, an array of bucket pointers, and a series of buckets of variable length.

The header contains the following fields, encoded as 32-bit integers, in network order:

byte offset   field
---           ---
0             magic identifier (the string "hlva")
4             data format version (currently, 1)
8             number of words in the lexicon
12            size in bytes of the buckets region

The bucket pointers array encodes the position, in the buckets region, of each nth word in the lexicon, n being hardcoded to the value HV_BLOCKING_FACTOR. Pointers are encoded as 32-bit integers, in network order.

The bucket region consists in a series of buckets. Each bucket (except, maybe, the last one) encodes HV_BLOCKING_FACTOR words. The first word of each bucket is prefixed with a single byte encoding its length. Remaining words are not written in full. The prefix a given word shares with the word that precedes it is replaced with one or two byte encoding the length of this prefix and the number of remaining bytes in the word. When possible, each of these numbers is stored into a nibble, otherwise a byte.

halva's People

Contributors

michaelnmmeyer avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.