Giter Club home page Giter Club logo

pthash's Introduction

CodeQL

PTHash

PTHash is a C++ library implementing fast and compact minimal perfect hash functions as described in the papers

Please, cite these papers if you use PTHash.

NEWS:

  • The PHOBIC branch of PTHash introduces some algorithmic novelties to build smaller functions and accelerate construction.

Features

  • Minimal and Non-Minimal Perfect Hash Functions
  • Space/Time Efficiency: fast lookup within compressed space
  • External-Memory Scaling
  • Multi-Threaded Construction
  • Configurable: can offer different trade-offs (between construction time, lookup time, and space effectiveness)

Introduction

Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the first n natural numbers is called a minimal perfect hash function (MPHF) for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest. For instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f.

PTHash is one such algorithm.

The following guide is meant to provide a brief overview of the library by illustrating its functionalities through some examples.

Table of contents

Integration

Integrating PTHash in your own project is very simple: just get the source code and include the header include/pthash.hpp in your code. No other configurations are needed.

If you use git, the easiest way to add PTHash is via git add submodule as follows.

git submodule add https://github.com/jermp/pthash.git

Compiling the Code

The code is tested on Linux with gcc and on Mac OS with clang (both Intel and ARM processors, like Apple M1). To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/pthash.git

If you have cloned the repository without --recursive, be sure you pull the dependencies with the following command before compiling:

git submodule update --init --recursive

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make -j

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D PTHASH_USE_SANITIZERS=On
make

(NOTE: Beware that the software will result in a much slower execution when running in debug mode and using sanitizers. Use this only for debug purposes, not to run performance tests.)

Enable All Encoders

By default, you can choose between three encoders to compress the PTHash data structure: partitioned_compact, dictionary_dictionary, and elias_fano, respectively indicated with PC, D-D, and EF in our papers.

If you want to test all the encoders we tested in the SIGIR paper [1], you can compile with

cmake .. -D PTHASH_ENABLE_ALL_ENCODERS=On

Enable Large Bucket-Id Type

By default, PTHash assumes there are less than $2^{32}$ buckets, hence 32-bit integers are used for bucket ids. To overcome this, you can either lower the value of c or recompile with

cmake .. -D PTHASH_ENABLE_LARGE_BUCKET_ID_TYPE=On

to use 64-bit integers for bucket ids.

Quick Start

For a quick start, see the source file src/example.cpp (reported below). The example shows how to setup a simple build configuration for PTHash (parameters, base hasher, and encoder).

After compilation, run this example with

./example

which will build a PTHash MPHF on 10M random 64-bit keys using c = 6.0 and alpha = 0.94. It also shows how to serialize the data structure on disk and re-load it for evaluation.

#include <iostream>
#include "../include/pthash.hpp"
#include "util.hpp"  // for functions distinct_keys and check

int main() {
    using namespace pthash;

    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    build_configuration config;
    config.c = 6.0;
    config.alpha = 0.94;
    config.minimal_output = true;  // mphf
    config.verbose_output = true;

    /* Declare the PTHash function. */
    typedef single_phf<murmurhash2_64,         // base hasher
                       dictionary_dictionary,  // encoder type
                       true                    // minimal
                       > pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the function..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    /* Compute and print the number of bits spent per key. */
    double bits_per_key = static_cast<double>(f.num_bits()) / f.num_keys();
    std::cout << "function uses " << bits_per_key << " [bits/key]" << std::endl;

    /* Sanity check! */
    if (check(keys.begin(), keys.size(), f)) std::cout << "EVERYTHING OK!" << std::endl;

    /* Now evaluate f on some keys. */
    for (uint64_t i = 0; i != 10; ++i) {
        std::cout << "f(" << keys[i] << ") = " << f(keys[i]) << '\n';
    }

    /* Serialize the data structure to a file. */
    std::cout << "serializing the function to disk..." << std::endl;
    std::string output_filename("pthash.bin");
    essentials::save(f, output_filename.c_str());

    /* Now reload from disk and query. */
    pthash_type other;
    essentials::load(other, output_filename.c_str());
    for (uint64_t i = 0; i != 10; ++i) {
        std::cout << "f(" << keys[i] << ") = " << other(keys[i]) << '\n';
        assert(f(keys[i]) == other(keys[i]));
    }

    std::remove(output_filename.c_str());
    return 0;
}

Build Examples

All the examples below must be run from within the directory where the code was compiled (see the section Compiling the Code), using the driver program called build.

Running the command

./build --help

shows the usage of the driver program, as reported below.

Usage: ./build [-h,--help] [-n num_keys] [-c c] [-a alpha] [-e encoder_type] [-p num_partitions] [-s seed] [-t num_threads] [-i input_filename] [-o output_filename] [-d tmp_dir] [-m ram] [--minimal] [--external] [--verbose] [--check] [--lookup]

[-n num_keys]
REQUIRED: The size of the input.

[-c c]
REQUIRED: A constant that trades construction speed for space effectiveness. A reasonable value lies between 3.0 and 10.0.

[-a alpha]
REQUIRED: The table load factor. It must be a quantity > 0 and <= 1.

[-e encoder_type]
REQUIRED: The encoder type. Possibile values are: 'compact', 'partitioned_compact', 'compact_compact', 'dictionary', 'dictionary_dictionary', 'elias_fano', 'dictionary_elias_fano', 'sdc', 'all'.
The 'all' type will just benchmark all encoders. (Useful for benchmarking purposes.)

[-p num_partitions]
Number of partitions.

[-s seed]
Seed to use for construction.

[-t num_threads]
Number of threads to use for construction.

[-i input_filename]
A string input file name. If this is not provided, then num_keys 64-bit random keys will be used as input instead.If, instead, the filename is '-', then input is read from standard input.

[-o output_filename]
Output file name where the function will be serialized.

[-d tmp_dir]
Temporary directory used for building in external memory. Default is directory '.'.

[-m ram]
Number of Giga bytes of RAM to use for construction in external memory.

[--minimal]
Build a minimal PHF.

[--external]
Build the function in external memory.

[--verbose]
Verbose output during construction.

[--check]
Check correctness after construction.

[--lookup]
Measure average lookup time after construction.

[-h,--help]
Print this help text and silently exits.

Example 1

./build -n 1000000 -c 4.5 -a 0.99 -e dictionary_dictionary -s 727369 --minimal --verbose --check --lookup -o mphf.bin

This example will build a MPHF over 1M random 64-bit keys (generated with seed 727369), using c = 4.5, alpha = 0.99, and compressing the MPHF data structure with the encoder dictionary_dictionary.

The data structure will be serialized on a binary file named mphf.bin.

It will also check the correctness of the data structure (flag --check) and measure average lookup time (flag --lookup).

Construction will happen in internal memory, using a single processing thread. (Experimental setting of the SIGIR paper [1].)

Example 2

For the following example, we are going to use the strings from the UK-2005 URLs collection, which can be downloaded by clicking here. (This is also one of the datasets used in the paper.)

The file is ~300 MB compressed using gzip (2.86 GB uncompressed).

After download, place the dataset in the build directory and run

gunzip uk-2005.urls.gz

to uncompress it. The file contains one string per line, for a total of 39,459,925 strings.

NOTE: Input files are read line by line (i.e., individual strings are assumed to be separated by the character \n). Be sure there are no blank lines.

The following command will build a MPHF using the strings of the file as input keys, with c = 7.0, alpha = 0.94.

./build -n 39459925 -c 7.0 -a 0.94 -e dictionary_dictionary -s 1234567890 --minimal -i uk-2005.urls --verbose --check --lookup

Example 3

./build -n 39459925 -c 7.0 -a 0.94 -e dictionary_dictionary -s 1234567890 --minimal -i uk-2005.urls --verbose --check --lookup -p 128

This example will run the construction over the same input and parameters used in Example 2, but with 128 partitions. The resulting data structure will consume essentially the same space as that built in Example 2 and only slightly slower at lookup.

Example 4

./build -n 39459925 -c 7.0 -a 0.94 -e dictionary_dictionary -s 1234567890 -i uk-2005.urls --verbose --check --lookup --external

This example will run the construction over the same input and parameters used in Example 2, but using external memory. The resulting data structure will be exactly the same as that built in Example 2.

Enable Multi-Threading

You can always specify to use multiple threads for construction with -t. For example, just append -t 4 to any of the previous build commands to use 4 parallel threads. (Also consult our second paper [2] for more information about parallelism.)

Building Perfect Hash Functions (not Minimal)

Just do not specify the --minimal flag when using the build tool.

Reading Keys from Standard Input

You can make the build tool read the keys from stardard input using bash pipelining (|) in combination with option -i -. This is very useful when building keys from compressed files.

Some examples below.

for i in $(seq 1 1000000) ; do echo $i ; done > foo.txt
cat foo.txt | ./build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -m 1 -i - -o foo.mph --verbose --external

gzip foo.txt
zcat foo.txt.gz | ./build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -m 1 -i - -o foo.mph --verbose --external

gunzip foo.txt.gz
zstd foo.txt
zstdcat foo.txt.zst | ./build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -m 1 -i - -o foo.mph --verbose --external

Note: you may need to write zcat < foo.txt.gz | (...) on Mac OSX.

One caveat of this approach is that is not possible to use --check nor --lookup because these two options need to re-iterate over the keys from the stream.

An Example Benchmark

The script script/run_benchmark.sh runs the 4 trade-off configurations (encoder, alpha, c) described in Section 5.2 of the paper [1] on 100M and 1000M keys.

C-C stands for "compact-compact" encoder; D-D for "dictionary-dictionary"; and EF for "Elias-Fano".

Be sure you run the benchmark after compiling with

cmake .. -D PTHASH_ENABLE_ALL_ENCODERS=On

From within the directory where the code has been compiled, just run

bash ../script/run_benchmark.sh 2> results.json

to reproduce the bottom part of Table 5 of the SIGIR 2021 paper [1]. (All constructions run in internal memory on a single core of the processor).

Below, the result of the benchmark across different processors and compilers. The code is compiled with -O3 and -march=native in all cases.

Intel i9-9900K @ 3.60 GHz, gcc 9.2.1, GNU/Linux 5.4.0-70-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 42 3.36 28 1042 3.23 37
(2) D-D, alpha = 0.88, c = 11.0 19 4.05 46 308 3.94 64
(3) EF, alpha = 0.99, c = 6.0 45 2.26 49 1799 2.17 101
(4) D-D, alpha = 0.94, c = 7.0 26 3.23 37 689 2.99 55

Intel i7-7700 @ 3.60 GHz, gcc 9.3.0, GNU/Linux 5.4.0-66-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 59 3.36 35 1145 3.23 40
(2) D-D, alpha = 0.88, c = 11.0 27 4.05 57 357 3.94 69
(3) EF, alpha = 0.99, c = 6.0 86 2.26 66 1918 2.17 110
(4) D-D, alpha = 0.94, c = 7.0 45 3.23 48 796 2.99 61

Intel i7-4790K @ 4.00GHz, gcc 8.3.0, GNU/Linux 5.0.0-27-generic x86_64

Configuration 100M keys 1000M keys
constr. (sec) space (bits/key) lookup (ns/key) constr. (sec) space (bits/key) lookup (ns/key)
(1) C-C, alpha = 0.99, c = 7.0 55 3.36 41 1156 3.23 51
(2) D-D, alpha = 0.88, c = 11.0 26 4.05 55 422 3.94 69
(3) EF, alpha = 0.99, c = 6.0 81 2.26 69 1921 2.17 147
(4) D-D, alpha = 0.94, c = 7.0 42 3.23 47 812 2.99 60

Other Resources

Authors

References

pthash's People

Contributors

adamant-pwn avatar bytehamster avatar jermp avatar progval avatar roberto-trani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pthash's Issues

the program fall in loop

Hello!
the function "search_sequential",when step to "continue; // in-bucket collision detected, try next pilot"(search.hpp line 155),it fell in loop.
There are many groups of keys need to build hash,when it happended, I record the group size, and push a std::numeric_limits::max(), then it will be OK. But this is not a friendly method, bacause I need to modify the code and update the program and run it again.
So, is there a better way to solve the promblem, or maybe I do something wrong when using pthash.

C API

Hello!

Very nice project and research (I don't understand fully but seems very interesting), I wanted to port this to Rust but seems like it will be a lot of work.

I though might be easier if it had a c api so I can call it with ffi from rust.

Is this possible to do?

I can work on it if you give me some direction.

possibility to adapt builders for partitions with single elements

As part of unit testing, I've been creating hash tables with very few keys (e.g., making a sshash dictionary for a sequence set with a single super k-mer), frequently triggering this:

unknown file: Failure
C++ exception with description "each partition must contain more than one key: use less partitions" thrown in the test body.

I was wondering if it would be too much trouble to add support for single-element partitions to fix this?

distinct_keys doesn't actually remove duplicates

The following code removes duplicates, but doesn't shrink the vector to actually remove the duplicates that std::unique moved to its end:

std::unique(keys.begin(), keys.end()); // remove the duplicates

The line should probably be:

keys.erase(std::unique(keys.begin(), keys.end()), keys.end());

Use pthash with unordered_map

This issue says that pthash has not map implementation, but It seems that we can use pthash as the hash function of std::unordered_map.

std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);

To my surprise, pt_umap even performs worse than the default hash function. Is there a problem with the use of the above code?
Here is my benchmark code:

#include <iostream>
#include <string>
#include <set>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <map>
#include <type_traits>
#include <cstdlib>
#include <unistd.h>
#include <thread>
#include <typeindex>
#include <functional>
#include <chrono>

#include "pthash/include/pthash.hpp"
#include "pthash/src/util.hpp"

using namespace std::chrono;

int main()
{
    /* Generate 10M random 64-bit keys as input data. */
    static const uint64_t num_keys = 10000000;
    static const uint64_t seed = 1234567890;
    std::cout << "generating input data..." << std::endl;
    std::vector<uint64_t> keys = pthash::distinct_keys<uint64_t>(num_keys, seed);
    assert(keys.size() == num_keys);

    /* Set up a build configuration. */
    pthash::build_configuration config;
    config.c = 7.0;
    config.alpha = 0.99;
    config.minimal_output = true;  // mphf
    config.verbose_output = false;

    /* Declare the PTHash function. */
    typedef pthash::single_phf<pthash::murmurhash2_64,         // base hasher
            pthash::compact_compact,  // encoder type
            true                    // minimal
    > pthash_type;
    pthash_type f;

    /* Build the function in internal memory. */
    std::cout << "building the function..." << std::endl;
    f.build_in_internal_memory(keys.begin(), keys.size(), config);

    // normal, use the default hash function
    std::unordered_map<uint64_t, uint64_t> normal;
    for (uint64_t x : keys) {
        normal[x] = x;
    }

    int collision = 0;
    for (size_t bucket = 0; bucket < normal.bucket_count(); bucket++) {
        if (normal.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "normal collision=" << collision << ", total=" << normal.bucket_count() << std::endl;

    // vector with pthash
    std::vector<uint64_t> pt_map(keys.size());
    for (uint64_t x : keys) {
        pt_map[f(x)] = x;
    }

    // unordered_map with pthash
    std::unordered_map<uint64_t, uint64_t, pthash_type> pt_umap(keys.size(), f);
    for(uint64_t x : keys) {
        pt_umap[x] = x;
    }

    collision = 0;
    for (size_t bucket = 0; bucket < pt_umap.bucket_count(); bucket++) {
        if (pt_umap.bucket_size(bucket) > 1) {
            collision ++;
        }
    }
    std::cout << "ptu collision=" << collision << ", total=" << pt_umap.bucket_count() << std::endl;

    // correctness
    for (int i = 0; i < keys.size(); i += 100) {
        uint64_t key = keys[i];
        assert(pt_map[f(key)] == normal[key]);
        assert(pt_umap[key] == normal[key]);
    }

    // performance
    auto start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = normal[keys[i]];
    }
    std::cout << "normal cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_map[f(keys[i])];
    }
    std::cout << "pt cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;

    start = system_clock::now();
    for (int i = 0; i < keys.size(); i += 100) {
        auto tmp = pt_umap[keys[i]];
    }
    std::cout << "ptu cost=" << duration_cast<microseconds>(system_clock::now() - start).count() << "ms" << std::endl;
}

The results are:

generating input data...
building the function...
normal collision=2327031, total=13169977
ptu collision=0, total=10000019
normal cost=59043ms
pt cost=28636ms
ptu cost=60522ms

Is there a map implementation of PTHash?

PTHash has great performance and is headers only, so it's easy to use, however, it only includes the hash function now, is there a map implementation based on PTHash? For example, a map with put() and get().
Thank you.

"std::runtime_error ... using too many buckets: change bucket_id_type to uint64_t or use a smaller " with c >= 5 and 34*10^9 keys

Hello, I'm trying to use pthash on ~34*10^9 strings (~50 bytes each) and I'm hitting the runtime error in title when using the main executable like this:

$ ./build --minimal -c 5 -a 0.99 -e dictionary_dictionary -n 34121566250 -i graph.nodes.csv -o graph.nodes.mph -t 48 --external -m 2048 --verbose --check --lookup
2023-11-25 16:39:19: construction starts
terminate called after throwing an instance of 'std::runtime_error'
  what():  using too many buckets: change bucket_id_type to uint64_t or use a smaller c

It happens for c >= 5, but not with c = 4.

Does one need to recompile with the suggested type change to use higher c on such an input set?

Thanks a lot for pthash, it's amazing :-)

the program crashed

hello, I have a question:
the function map_parallel in internal_memory_builder_single_phf.hpp,
uint64_t local_num_keys = (tid != config.num_threads - 1) ? num_keys_per_thread : (num_keys - tid * num_keys_per_thread); local_pairs.resize(local_num_keys);
(num_keys - tid * num_keys_per_thread) may less than 0, and then crash happened when execute "local_pairs.resize(local_num_keys)"

why don't you write some judgement to avoid the crash?
justnow, my program crashed here: local_num_keys = (457 - 31 * 15) = -8

runtime error "blank line detected after reading X non-empty lines"

I'm getting the following runtime error with the main build executable:

 == partitioned 0% keysterminate called after throwing an instance of 'std::runtime_error'
  what():  blank line detected after reading 0 non-empty lines

This is a real dataset where there is indeed a valid key which happens to be the empty string.

Is there any particular reason for barfing on empty string?
I've used other MPHs (specifically: GOV) on this dataset in the past and I don't see why it shouldn't be possible to use PTHash.

Thanks in advance,
Cheers

missing include

Hi @jermp ,
I think pthash/include/encoders/util.hpp is missing the cstdint and cassert headers.
g++ 12.3 complains about uintX_t definitions and asserts.

Encoders must be documented in the help text

I'm trying to build maps using build but the process of choosing the encoder is extremely frustrating. The help text suggest to look at the source file, which contains no documentation. I tried structure names by trial and error (e.g., elias_fano), but, for example, compact_compact doesn't work. The possible values of -emust should be documented.

C++14 support

With C++14, I encountered the following error during compiling the example code:

pthash/include/builders/util.hpp:148:24: error: use of class template 'payload_iterator' requires template arguments
                       payload_iterator(pairs.begin() + i - bucket_size));
                       ^

I fixed this error by using C++17.
Is there any way to use pthash under the C++14 standard? Thanks in advance!

Seed not working

Hello @jermp @ByteHamster

I tried pthash on artificial data, it worked. However when I moved to the hash values of k-mers, it is raising an error related to the seed (see below).
c = 6 alpha = 0.94 num_keys = 21817 table_size = 23209 num_buckets = 9083 == map+sort took: 0.002 seconds == merged 0% pairsterminate called after throwing an instance of 'pthash::seed_runtime_error' what(): seed did not work
I tried different values of seed, but they are not working.

What could be the problem?

undefined reference to `pthread_create'

Under Ubuntu 18.04.6 and gcc 7.5/11/13, I got an error undefined reference to pthread_create which is caused by target_compile_options(PTHASH INTERFACE -pthread) which does not seem to work. Re-adding the original set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -pthread") works.

Why is that?

CC. @ByteHamster any ideas?

-i can't read from a non-mmap-friendly input file when using --external

It is handy to pass "streaming" content as input to build -i, e.g., by using shell process substitution like this: build -i <(zstdcat foo.txt.zst).

Unfortunately that does not work when --external is used.
Apparently in that case (and only in that case) the input file passed to -i is mmap-ed, which obviously fail when the input file is not a regular file (a FIFO, in the case of process substitution).

Would it be possible to either avoid mmap-ing the input file even when --external is used or, alternatively, to have an option to inhibit that selectively?

Thanks!

pthash stucks while creating mphf

Hello again @jermp

What could be the reason for pthash to stuck while creating a mphf (see the log below)?

c = 6 alpha = 0.94 num_keys = 19 table_size = 20 num_buckets = 27 == map+sort took: 0 seconds == merged 100% pairs == merge+check took: 0 seconds == mapping+ordering took 0 seconds == max bucket size = 6 == lambda = 0.707988 == lambda_1 = 1.41598 == lambda_2 = 0.404565 == num_buckets of size 6 = 1 (estimated with Poisson = 0) == num_buckets of size 5 = 0 (estimated with Poisson = 0) == num_buckets of size 4 = 0 (estimated with Poisson = 0) == num_buckets of size 3 = 0 (estimated with Poisson = 1) == num_buckets of size 2 = 5 (estimated with Poisson = 3) == num_buckets of size 1 = 3 (estimated with Poisson = 7) 2023-10-16 15:05:45: search starts
it could be the configuration for small number of keys. If such a case, what do you recommend?
Thanks

Support process substitution to read keys from std::in

As suggested by @zacchiro,
we should support process substitution to read keys from std::in.

Once done, we can specify, e.g., -i < (cat file) or -i < (zcat file.gz), etc.

@roberto-trani suggested to use the same trick as used in grep (https://man7.org/linux/man-pages/man1/grep.1.html)
where

     -f FILE, --file=FILE

          Obtain patterns from FILE, one per line.  If this option
          is used multiple times or is combined with the -e
          (--regexp) option, search for all patterns given.  The
          empty file contains zero patterns, and therefore matches
          nothing.  If FILE is - , read patterns from standard
          input.

The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).

Compile issue in gcc 10.2.1

The following line in util.hpp causes compile error in gcc 10.2.1:
#define MAX_BUCKET_SIZE static_cast<bucket_size_type>(100)
error message expected unqualified-id before ‘static_cast’

I found that replacing that line with the following fix the problem:
constexpr bucket_size_type MAX_BUCKET_SIZE = 100;

runtime error "seed did not work" when -i points to a non-regular file

With a pthash clone at commit 0d09102 I'm now getting the following error:

$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 34121566250 \
    -p 3412 -t 96 --external -m 1024 -i $(zstdcat graph.nodes.csv.zst) -o graph.nodes.mph \
    -d ./tmp.pthash-nodes.IH4bXCZpg9 \
    --verbose
2023-11-29 07:01:49: construction starts
num_partitions 3412
using 1024 GB of RAM
 == partitioned 100% keys
processing 3412/3412 partitions...
terminate called after throwing an instance of 'pthash::seed_runtime_error'
  what():  seed did not work
$

I've tried two different datasets, encountering the same error.

I've also tried not using a process substitution for -i, i.e., passing a regular file to it rather than zstdcat, and it does not fail in that case, which is quite weird. Note that I did not use the new --mmap flag, so I'm not sure what is different between the two cases from the point of view of ./build.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.