Giter Club home page Giter Club logo

hyperloglog's Introduction

HyperLogLog Distinct Value Estimator

HyperLogLog (HLL) is a probabilistic estimator of the cardinality of a stream of values. Given a bounded amount of memory, it can estimate the cardinality of a stream with bounded relative error and it is possible to trade off memory usage for precision. Formally, the standard error for an HLL with n registers is less than 1.04/sqrt(n).

Example

var HyperLogLog = require('hyperloglog');
var hll = HyperLogLog(12);

// Insert three values, two of them distinct.
hll.add(HyperLogLog.hash("value1"));
hll.add(HyperLogLog.hash("value2"));
hll.add(HyperLogLog.hash("value1"));

assert(2 === hll.count());

API

HyperLogLog.hash(string)

In order to count items, they must first be hashed. The hash() function provides a suitable hash. Its output is an array of four 32 bit postive integers, which, taken together constitute the complete hash of the input string. Currently the implementation is MurmurHash3-128.

HyperLogLog(n)

Construct an HLL data structure with n bit indices into the register array. This implies that there will be 2^n registers. Typical values for n are around 12, which would use 4096 registers and yield less than 1.625% relative error. Higher values use more memory, but provide greater precision.

hll.add(hash)

Adds a hash to the HLL. The hash must be in the format emitted by hash(). If

hll.count()

Get the current estimate of the number of distinct values that have been added.

hll.relative_error()

Get the expected relative error, based on the number of registers. This will not change as values are added. The absolute standard error is the relative error multiplied by the estimated cardinality from count().

hll.output()

Return an external representation of the internal HLL state. This may be useful for serializing, storing, and migrating an HLL. The format returned is the same as that accepted by merge().

hll.merge(data)

Merge another HLL's state into this HLL. The data must be of the same form as that returned by output(). If the incoming data has fewer registers than this HLL, this one will be folded down to be the same size as the incoming data, with a corresponding loss of precision. If the incoming data has more registers, it will be folded down as it is merged. The result is that this HLL will be updated as though it had processed all values that were previously processed by either HLL.

hll1.add(hash1);
hll1.add(hash2);

hll2.add(hash2);
hll2.add(hash3);

hll1.merge(hll2.output());

assert(3 === hll1.count());

Possible Improvements

  • Make HLL use a compressed representation from Google's paper
  • Bit shift registers to use 6 bits per register instead of 8 since we only ever actually use 6 for up to 2^64 (2^2^6).
  • Go to 5 bits per register and add the high cardinality correction. Save 16% on storage for effectively the same standard error.

hyperloglog's People

Contributors

bollenberger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hyperloglog's Issues

It does not work in node v4.0

Please, update dependencies in package.json. For example, murmurhash3 v0.3.3 and vows v0.8.x works correctly in node v4.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.