Giter Club home page Giter Club logo

zini's Introduction

Zini

Zini (Zig + Mini) is a Zig library providing some succinct data structures.

The main contribution is zini.pthash which is an implementation of PTHash, a minimal perfect hash function construction algorithm. Given a set of n elements, with the only requirement being that you can hash them, it generates a hash function which maps each element to a distinct number between 0 and n - 1. The generated hash function is extremely small, typically consuming less than 4 bits per element, regardless of the size of the input type. The algorithm provides multiple parameters to tune making it possible to optimize for (small) size, (short) construction time, or (short) lookup time.

To give a practical example: In ~0.6 seconds Zini was able to create a hash function for /usr/share/dict/words containing 235886 words. The resulting hash function required in total 865682 bits in memory. This corresponds to 108.2 kB in total or 3.67 bits per word. In comparison, the original file was 2.49 MB and compressing it with gzip -9 only gets it down to 754 kB (which you can't use directly in memory without decompressing it). It should of course be noted that they don't store the equivalent data as you can't use the generated hash function to determine if a word is present or not in the list. The comparison is mainly useful to get a feeling of the magnitudes.

In addition, Zini provides various functionality for dealing with arrays of numbers:

  • zini.CompactArray stores n-bit numbers tightly packed, leaving no bits unused. If the largest value in an array is m then you actually only need n = log2(m) + 1 bits per element. E.g. if the largest value is 270, you will get 7x compression using CompactArray over []u64 as it stores each element using only 9 bits (and 64 divided by 9 is roughly 7).
  • zini.DictArray finds all distinct elements in the array, stores each once into a CompactArray (the dictionary), and creates a new CompactArray containing indexes into the dictionary. This will give excellent compression if there's a lot of repetition in the original array.
  • zini.EliasFano stores increasing 64-bit numbers in a compact manner.
  • zini.darray provides constant-time support for the select1(i) operation which returns the i-th set bit in a std.DynamicBitSetUnmanaged.

Usage

Zini is intended to be used as a library, but also ships the command-line tool zini-pthash. As the documentation is a bit lacking it might be useful to look through tools/zini-pthash/main.zig to understand how it's used.

Note that building zini-pthash depends on having parg cloned in ../parg. You may want to tweak this in build.zig.

USAGE
  ./zig-out/bin/zini-pthash [build | lookup] <options>

COMMAND: build
  Builds hash function for plain text file.

  -i, --input <file>
  -o, --output <file>
  -c <int>
  -a, --alpha <float>
  -s, --seed <int>

COMMAND: lookup

  -i, --input <file>
  -k, --key <key>
  -b, --benchmark

And here's an example run of using zini-pthash.

# Build zini-pthash:
$ zig build -Drelease-safe

# Build a hash function:
$ ./zig-out/bin/zini-pthash build -i /usr/share/dict/words -o words.pth
Reading /usr/share/dict/words...

Building hash function...

Successfully built hash function:
  seed: 12323441790160983030
  bits: 865554
  bits/n: 3.6693741892269993

Writing to words.pth

# Look up an index in the hash function:
$ ./zig-out/bin/zini-pthash lookup -i words.pth --key hello
Reading words.pth...

Successfully loaded hash function:
  seed: 12323441790160983030
  bits: 865554
  bits/n: 3.6693741892269993

Looking up key=hello:
112576

Acknowledgments

Zini is merely an implementation of existing algorithms and techniques already described in the literature:

  • The PTHash algorithm is described by Giulio Ermanno Pibiri and Roberto Trani in arXiv:2104.10402.
  • They also implemented PTHash as a C++ library in https://github.com/jermp/pthash under the MIT license. Zini uses no code directly from that repository, but it has been an invaluable resource for understanding how to implement PTHash in practice.

License

Zini is licensed under the 0BSD license.

zini's People

Contributors

judofyr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.