Giter Club home page Giter Club logo

quantile-compression's Introduction

Quantile Compression

This rust library compresses and decompresses sequences of numerical data very well. It currently supports the following data types: i32, i64, u32, u64, f32, f64. Smaller data types like i16 can be efficiently compressed by casting to i32. Timestamp support may come soon in the future.

For natural data, it typically shrinks data to 25-40% smaller than what gzip -9 produces, compresses much faster, and decompresses equally quickly.

The intended use case for this algorithm is compressing columnar data, especially for use by Spark and other execution engines.

This IS:

  • lossless
  • order-preserving and bit-preserving (including NaN floats)
  • moderately fast

This is NOT:

  • optimal for sorted data or time series without first taking differences
  • competing for record-breaking decompression speed

For compression and decompression speed benchmarks, see benchmarks.md.

Usage

See the following basic usage. To run something right away, see the example.

use q_compress:{BitReader, I64Compressor, I64Decompressor};

fn main() {
  // your data
  let mut my_ints = Vec::new();
  for i in 0..100000 {
    my_ints.push(i as i64);
  }
  
  // compress
  // max_depth is basically compression level - 6 is generally good.
  // Max depths between 4 and 8 are reasonable.
  let max_depth = 6;
  let compressor = I64Compressor::train(&my_ints, max_depth).expect("failed to train");
  let bytes = compressor.compress(&my_ints).expect("out of range");
  println!("compressed down to {} bytes", bytes.len());
  
  // decompress
  let bit_reader = &mut BitReader::from(bytes);
  let decompressor = I64Decompressor::from_reader(bit_reader).expect("couldn't read compression scheme");
  let recovered = decompressor.decompress(bit_reader);
  println!("got back {} ints from {} to {}", recovered.len(), recovered[0], recovered.last().unwrap());
}

Method

This works by describing each number with a range and an offset. The range specifies an inclusive range [lower, upper] that the number might be in, and the offset specifies the exact position within that range. The compressor chooses a prefix for each range via Huffman codes.

For data sampled from a random distribution, this compression algorithm can reduce byte size to near the theoretical limit of the distribution's Shannon entropy. Ideally it encodes a number k in b bits if 2^-b ~= P(k). We can plot Q(k) = 2^-b to see how close quantile compression gets to the ideal in this example with max_depth=3:

The inefficiency of quantile compression in bits per number is the KL divergence from the approximated distribution Q to the true distribution P.

.qco File Format

Quantile-compressed files consist of a lightweight header (usually <1KB) and then very many short number blocks, each of which usually encodes a single number.

The header is expected to start with a magic sequence of 4 bytes for "qco!" in ascii. The next byte encodes the data type (e.g. i64). The next few bytes encode the count of numbers in the file, and then the count of ranges (or prefixes) used to compress. It then contains metadata for each range: lower and upper bound, the length of its prefix in bits, the prefix. The next bit for each range encodes whether it uses repetitions, which are only used for the most common range in a sparse distribution. If so, the file then encodes a "jumpstart", which is used in number blocks to describe how many repetitions of the range to use.

Each number block has just 2 or 3 parts. First comes the prefix, which indicates a range to use. If that range uses repetitions, a varint for the exact number of repetitions follows, leveraging the jumpstart from earlier. Then an offset (for each repetition if necessary) follows, specifying the exact value within the range.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.