dynatrace-oss / hash4j Goto Github PK

Dynatrace hash library for Java

License: Apache License 2.0

Java 99.05% Python 0.20% C++ 0.73% Shell 0.01% Go 0.01%

murmur3 hash-functions hashing-algorithm non-cryptographic-hash-functions java hash wyhash hash-algorithm stream-processing hyperloglog

hash4j's Introduction

hash4j

hash4j is a Java library by Dynatrace that includes various non-cryptographic hash algorithms and data structures that are based on high-quality hash functions.

First steps

To add a dependency on hash4j using Maven, use the following:

<dependency>
  <groupId>com.dynatrace.hash4j</groupId>
  <artifactId>hash4j</artifactId>
  <version>0.17.0</version>
</dependency>

To add a dependency using Gradle:

implementation 'com.dynatrace.hash4j:hash4j:0.17.0'

Hash algorithms

hash4j currently implements the following hash algorithms:

Murmur3
- 32-bit
- 128-bit
Wyhash
- final version 3
- final version 4
Komihash
- version 4.3 (compatible with version 4.7)
- version 5.0 (compatible with version 5.10)
FarmHash
- farmhashna
- farmhashuo
PolymurHash 2.0

All hash functions are thoroughly tested against the native reference implementations and also other libraries like Guava Hashing, Zero-Allocation Hashing, Apache Commons Codec, or crypto (see CrossCheckTest.java).

Usage

The interface allows direct hashing of Java objects in a streaming fashion without first mapping them to byte arrays. This minimizes memory allocations and keeps the memory footprint of the hash algorithm constant regardless of the object size.

class TestClass { 
    int a = 42;
    long b = 1234567890L;
    String c = "Hello world!";
}

TestClass obj = new TestClass(); // create an instance of some test class
    
Hasher64 hasher = Hashing.komihash5_0(); // create a hasher instance

// variant 1: hash object by passing data into a hash stream
long hash1 = hasher.hashStream().putInt(obj.a).putLong(obj.b).putString(obj.c).getAsLong(); // gives 0x89a90f343c3d4862L

// variant 2: hash object by defining a funnel
HashFunnel<TestClass> funnel = (o, sink) -> sink.putInt(o.a).putLong(o.b).putString(o.c);
long hash2 = hasher.hashToLong(obj, funnel); // gives 0x90553fd9c675dfb2L

More examples can be found in HashingDemo.java.

Similarity hashing

Similarity hashing algorithms are able to compute hash signature of sets that allow estimation of set similarity without using the original sets. Following algorithms are currently available:

MinHash
SuperMinHash
SimHash
FastSimHash: A fast implementation of SimHash using a bit hack (see this blog post)

Usage

ToLongFunction<String> stringHashFunc = s -> Hashing.komihash5_0().hashCharsToLong(s);

Set<String> setA = IntStream.range(0, 90000).mapToObj(Integer::toString).collect(toSet());
Set<String> setB = IntStream.range(10000, 100000).mapToObj(Integer::toString).collect(toSet());
// intersection size = 80000, union size = 100000
// => exact Jaccard similarity of sets A and B is J = 80000 / 100000 = 0.8

int numberOfComponents = 1024;
int bitsPerComponent = 1;
// => each signature will take 1 * 1024 bits = 128 bytes

SimilarityHashPolicy policy =
SimilarityHashing.superMinHash(numberOfComponents, bitsPerComponent);
SimilarityHasher simHasher = policy.createHasher();

byte[] signatureA = simHasher.compute(ElementHashProvider.ofCollection(setA, stringHashFunc));
byte[] signatuerB = simHasher.compute(ElementHashProvider.ofCollection(setB, stringHashFunc));

double fractionOfEqualComponents = policy.getFractionOfEqualComponents(signatureA, signatuerB);

// this formula estimates the Jaccard similarity from the fraction of equal components
double estimatedJaccardSimilarity =
    (fractionOfEqualComponents - Math.pow(2., -bitsPerComponent))
        / (1. - Math.pow(2., -bitsPerComponent)); // gives a value close to 0.8

Approximate distinct counting

Counting the number of distinct elements exactly requires space that must increase linearly with the count. However, there are algorithms that require much less space by counting just approximately. The space-efficiency of those algorithms can be compared by means of the storage factor which is defined as the state size in bits multiplied by the squared relative standard error of the estimator

$\text{storage factor} := (\text{relative standard error})^2 \times (\text{state size})$.

This library implements two algorithms for approximate distinct counting:

HyperLogLog: This implementation uses 6-bit registers. The default estimator, which is an improved version of the original estimator, leads to an asymptotic storage factor of $18 \ln 2 - 6 = 6.477$. Using the definition of the storage factor, the corresponding relative standard error is roughly $\sqrt{\frac{6.477}{6 m}} = \frac{1.039}{\sqrt{m}}$. The state size is $6m = 6\cdot 2^p$ bits, where the precision parameter $p$ also defines the number of registers as $m = 2^p$. Alternatively, the maximum-likelihood estimator can be used, which achieves a slightly smaller asymptotic storage factor of $6\ln(2)/(\frac{\pi^2}{6}-1)\approx 6.449$ corresponding to a relative error of $\frac{1.037}{\sqrt{m}}$, but has a worse worst-case runtime performance. In case of non-distributed data streams, the martingale estimator can be used, which gives slightly better estimation results as the asymptotic storage factor is $6\ln 2 = 4.159$. This gives a relative standard error of $\sqrt{\frac{6\ln 2}{6m}} = \frac{0.833}{\sqrt{m}}$. The theoretically predicted estimation errors have been empirically confirmed by simulation results.
UltraLogLog: This algorithm is described in detail in this paper. Like for HyperLogLog, a precision parameter $p$ defines the number of registers $m = 2^p$. However, since UltraLogLog uses 8-bit registers to enable fast random accesses and updates of the registers, $m$ is also the state size in bytes. The default estimator leads to an asymptotic storage factor of 4.895, which corresponds to a 24% reduction compared to HyperLogLog and a relative standard error of $\frac{0.782}{\sqrt{m}}$. Alternatively, if performance is not an issue, the slower maximum-likelihood estimator can be used to obtain a storage factor of $8\ln(2)/\zeta(2,\frac{5}{4}) \approx 4.631$ corresponding to a 28% reduction and a relative error of $\frac{0.761}{\sqrt{m}}$. If the martingale estimator can be used, the storage factor will be just $5 \ln 2 = 3.466$ yielding an asymptotic relative standard error of $\frac{0.658}{\sqrt{m}}$. These theoretical formulas again agree well with the simulation results.

Both algorithms share the following properties:

Constant-time add-operations
Allocation-free updates
Idempotency, adding items already inserted before will never change the internal state
Mergeability, even for data structures initialized with different precision parameters
Final state is independent of order of add- and merge-operations
Fast estimation algorithm that is fully backed by theory and does not rely on magic constants

Usage

Hasher64 hasher = Hashing.komihash5_0(); // create a hasher instance

UltraLogLog sketch = UltraLogLog.create(12); // corresponds to a standard error of 1.2% and requires 4kB

sketch.add(hasher.hashCharsToLong("foo"));
sketch.add(hasher.hashCharsToLong("bar"));
sketch.add(hasher.hashCharsToLong("foo"));

double distinctCountEstimate = sketch.getDistinctCountEstimate(); // gives a value close to 2

Compatibility

HyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches with smaller precision parameter p using sketch.downsize(p). UltraLogLog sketches can be also transformed into HyperLogLog sketches with same precision parameter using HyperLogLog hyperLogLog = HyperLogLog.create(ultraLogLog); as demonstrated in ConversionDemo.java. HyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros. Therefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.

File hashing

This library contains an implementation of Imohash that allows fast hashing of files. It is based on the idea of hashing only the beginning, a middle part and the end, of large files, which is usually sufficient to distinguish files. Unlike cryptographic hashing algorithms, this method is not suitable for verifying the integrity of files. However, this algorithm can be useful for file indexes, for example, to find identical files.

Usage

// create some file in the given path
File file = path.resolve("test.txt").toFile();
try (FileWriter fileWriter = new FileWriter(file, StandardCharsets.UTF_8)) {
    fileWriter.write("this is the file content");
}

// use ImoHash to hash that file
HashValue128 hash = FileHashing.imohash1_0_2().hashFileTo128Bits(file);
// returns 0xd317f2dad6ea7ae56ff7fdb517e33918

Consistent hashing

This library contains various algorithms for the distributed agreement on the assignment of hash values to a given number of buckets. In the naive approach, the hash values are assigned to the buckets with the modulo operation according to bucketIdx = abs(hash) % numBuckets. If the number of buckets is changed, the bucket index will change for most hash values. With a consistent hash algorithm, the above expression can be replaced by bucketIdx = consistentBucketHasher.getBucket(hash, numBuckets) to minimize the number of reassignments while still ensuring a fair distribution across all buckets.

The following consistent hashing algorithms are available:

JumpHash: This algorithm has a calculation time that scales logarithmically with the number of buckets
Improved Consistent Weighted Sampling: This algorithm is based on improved consistent weighted sampling with a constant computation time independent of the number of buckets. This algorithm is faster than JumpHash for a large number of buckets.
JumpBackHash: In contrast to JumpHash, which traverses "active indices" (see here for a definition) in ascending order, JumpBackHash does this in the opposite direction. In this way, floating-point operations can be completely avoided. Further optimizations minimize the number of random values that need to be generated to reach the largest "active index" within the given bucket range in amortized constant time. The largest "active index", defines the bucket assignment of the given hash value. In the worst case, this algorithm consumes an average of 5/3 = 1.667 64-bit random values.

Usage

// create a consistent bucket hasher
ConsistentBucketHasher consistentBucketHasher =
    ConsistentHashing.jumpBackHash(PseudoRandomGeneratorProvider.splitMix64_V1());

long[] hashValues = {9184114998275508886L, 7090183756869893925L, -8795772374088297157L};

// determine assignment of hash values to 2 buckets
Map<Integer, List<Long>> assignment2Buckets =
    LongStream.of(hashValues)
        .boxed()
        .collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 2)));
// gives {0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}

// determine assignment of hash values to 3 buckets
Map<Integer, List<Long>> assignment3Buckets =
    LongStream.of(hashValues)
        .boxed()
        .collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 3)));
// gives {0=[7090183756869893925], 1=[9184114998275508886], 2=[-8795772374088297157]}
// hash value -8795772374088297157 got reassigned from bucket 0 to bucket 2
// probability of reassignment is equal to 1/3

Benchmark results

Benchmark results for different revisions can be found here.

Contribution FAQ

Coding style

To ensure that your contribution adheres to our coding style, run the spotlessApply Gradle task.

Python

This project contains python code. We recommend using a python virtual environment in a .venv directory. If you are new, please follow the steps outlined in the official Python documentation for creation and activation. To install the required dependencies including black, please execute pip install -r requirements.txt.

Reference implementations

Reference implementations of hash algorithms are included as git submodules within the reference-implementations directory and can be fetched using git submodule update --init --recursive.

hash4j's People

Contributors

Stargazers

Watchers

Forkers

leicmi mrgreywater acruise thaingo lvjinbiao adedayoominiyi jofernmorais sullis deripas

hash4j's Issues

Add HashValue::toCharArray, HashValue::toString()

Is your feature request related to a problem? Please describe.
There is currently no easy way to get the HashValue result as a byte array. Also the toString function doesn't return a human-readable version of the hash (like it does for e.g. UUID). This is especially a problem for hashes with a size of 128Bit (or larger).

Describe the solution you'd like
A simple implementation of HashValue::toCharArray() and HashValue::toString() for all variants of HashValueXXX.

Describe alternatives you've considered
HashValue128::toCharArray():

ByteBuffer bb = ByteBuffer.wrap(new byte[16]);
bb.putLong(hash.getMostSignificantBits());
bb.putLong(hash.getLeastSignificantBits());
return bb.array();

HashValue128::toString():

new UUID(hash.getMostSignificantBits(), hash.getLeastSignificantBits()).toString()

`merge` methods for SimilarityHash variants

Problem

After adding UltraLogLog support to Apache Pinot I've been looking at adding some of the MinHash variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.

Solution

I'd like the SimilarityHasher interface to also have a merge method that takes two byte[] and returns a byte[] that represents the merged state.

Alternatives

I've tried implementing the merge functions myself, and run into problems like #169
I did consider a half way solution of just streaming hashes into it, but that's also not available in the current interface

Komihash5 tag references in README

May I suggest you to remove compatibility references to all versions beside 5.0, and add a reference to a recently released 5.10? I'd like to remove intermediate tags as to not propagate sub-optimal releases.

Benchmark publishing

Could you please add a topic on the mainpage that lists benchmark results of all zero-allocation hash functions available? I may be asking for a trouble for komihash, but it would be interesting to know anyway.

MinHash output is not mergeable with hash sizes under 64

Describe the bug
The output of MinHash.compute cannot be merged with another output on hash sizes of under 64

To Reproduce

import com.dynatrace.hash4j.hashing.Hasher64;
import com.dynatrace.hash4j.hashing.Hashing;
import com.dynatrace.hash4j.similarity.ElementHashProvider;
import com.dynatrace.hash4j.similarity.SimilarityHasher;
import com.dynatrace.hash4j.similarity.SimilarityHashing;
import com.dynatrace.hash4j.util.PackedArray;

import java.util.Arrays;
import java.util.stream.IntStream;

public class MinHashBugRepro {

    public static byte[] merge(int components, int bits, byte[] left, byte[] right) {
        PackedArray.PackedArrayHandler pah = PackedArray.getHandler(bits);
        byte[] output = pah.create(components);
        IntStream.range(0, components).forEach(idx ->
            pah.set(output, idx, Math.min(pah.get(left, idx), pah.get(right, idx)))
        );
        return output;
    }

    public static void main(String[] args) {
        SimilarityHasher simHasher32 = SimilarityHashing.minHash(10, 32).createHasher();
        SimilarityHasher simHasher64 = SimilarityHashing.minHash(10, 64).createHasher();
        Hasher64 hasher = Hashing.wyhashFinal4();

        Long hello = hasher.hashCharsToLong("hello");
        Long world = hasher.hashCharsToLong("world");

        // when using a 32-bit hash, this merge function doesn't work
        byte[] one32 = simHasher32.compute(ElementHashProvider.ofValues(hello));
        byte[] two32 = simHasher32.compute(ElementHashProvider.ofValues(world));
        byte[] three32 = simHasher32.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged32 = merge(10, 32, one32, two32);
        System.out.println(Arrays.equals(merged32, three32));

        // when using a 64-bit hash this merge function does work
        byte[] one64 = simHasher64.compute(ElementHashProvider.ofValues(hello));
        byte[] two64 = simHasher64.compute(ElementHashProvider.ofValues(world));
        byte[] three64 = simHasher64.compute(ElementHashProvider.ofValues(hello, world));
        byte[] merged64 = merge(10, 64, one64, two64);
        System.out.println(Arrays.equals(merged64, three64));
    }
}

Expected behavior
I expect the data stored in the byte[]s to be mergeable, even if the library does not include a merge function

Additional context
I've tried writing different merge functions where I decode it signed, unsigned, swap the byte order, and have been unable to find a version that would work with the 32-bit output.

Possible fix
I am reasonably sure that this is because of this section:

          if (hash < work[i]) {
            work[i] = hash;
          }

and that it would work if it used an unsigned lessthan, rather than the signed one, or if it masked before comparing

Even more context
I ran across this when testing semilattice + distributive laws in https://github.com/andimiller/schrodinger
~~SuperMinHash merges fine with this merge method on all bit sizes, it does not work on SimHash or FastSimHash~~

Implement imohash or a similar sampling hash

Is your feature request related to a problem? Please describe.
When comparing large files for duplicates, usual hashing algorithms are limited by the speed of which the full file content can be read. This starts to be a problem when hashing large files. By only hashing parts of the data and its size you don't have to read all the data but get a hash that can indicate duplicates with sufficient confidence.

Describe the solution you'd like
Implement imohash or a similar sampling hash algorithm.

Describe alternatives you've considered
Implementing it on top of hash4j.

Discrete-Incremental Hashing

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.

`UltraLogLog::estimate` does not match Algorithm 6

There may be two bugs in the normal- and large-range computations
(respectively) of UltraLogLog::estimate.

When evaluating the contribution of a register not in the small or large
range, UltraLogLog::estimate makes use of the precomputed register
contribution table¹ (due to equation 14). There are
$236=4w_{max}-12-4=4(65-p_{min})-16$ values in this table to cover the
range of register values encountered in practice ($p \geq p_{min}=3$).
Importantly, the table does not vary with respect to p. However, when
indexing into UltraLogLog::REGISTER_CONTRIBUTIONS, the index is taken
to be $r - 4p-4$², which varies with $p$. My assumption was that the
index would be $r - 12$ (when $r < 4w$) here.

Meanwhile, in the large-range calculation, it appears that register
values can only contribute to $c_{4w+k,0 \leq k < 4}$ if they are in the
range $[252,256)$³. In contrast to the register contribution table,
(valid) register values for the large-range contribution do vary with
$p$; I expected the algorithm to check r == 4 * (65 - p) + {0,1,2,3},
instead of fixed values (seemingly corresponding to $p=2$).

Please let me know if I have misunderstood anything in your paper⁴
or UltraLogLog::estimate. Thank you for your time.

Create Hash4j logo

JDK8 support?

Hey folks,

I need a JVM version of WyHash that operates in a compatible way with a particular version of upstream. I don't know offhand what version of upstream I need to match, I can get that with a bit of effort.

It turns out hash4j matches that specific implementation already, unlike any other Java implementation I can find, so it'd be nice to use it, but I need to target JDK8 at the moment.

Would there be interest in adding a JDK8-compatible operating mode to hash4j? I've spent a bit of time on this unfinished experimental approach to the problem, I could continue with this if there's interest.

Thanks!

note: there may be a license concern arising from my wip, don't merge it until that's dealt with

Request to publish JMH results

I'm interested in seeing (roughly) how fast Hash4J's hash functions are, and possibly how fast the comparable Guava/OpenHFT functions are too. I could run these on my own machine, but it would be nice as a quick approximation of what kind of performance numbers are reasonable.

In particular, the blog post describing the need for Hash4J mentions exact compatibility with the reference C++ implementation. I'm curious if the compatibility comes at a noticeable cost compared to the alternative implementations?

Ultraloglog and MinHash?

Hello Ertl,

Since ultraloglog uses more bits than hyperloglog for fast access and update. The idea of locality sensitive hashing (e.g. approximate MinHash to estimate Jaccard), based on setsektch, can it be also applied in the ultraloglog case? or it is not possible but we have to rely on inclusions and exclusion rule after obtaining the cardinality to approximate jaccard index, at the risk of losing locality sensitive hashing property?

Thanks,

Jianshu

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

dynatrace-oss / hash4j Goto Github PK

hash4j's Introduction

hash4j

Content

First steps

Hash algorithms

Usage

Similarity hashing

Usage

Approximate distinct counting

Usage

Compatibility

File hashing

Usage

Consistent hashing

Usage

Benchmark results

Contribution FAQ

Coding style

Python

Reference implementations

hash4j's People

Contributors

Stargazers

Watchers

Forkers

hash4j's Issues

Problem

Solution

Alternatives

Footnotes

Recommend Projects

Recommend Topics

Recommend Org