Giter Club home page Giter Club logo

libcuckoofilter's Introduction

Cuckoo Filter Library

Similar to a Bloom filter, a Cuckoo filter provides a space-efficient data structure designed to answer approximate set-membership queries (e.g. "is item x contained in this set?") Unlike standard Bloom filters, however, Cuckoo filters support deletion. Likewise, Cuckoo filters are more optimal than Bloom variants which support deletion, such as counting Bloom filters, in both space and time.

Cuckoo filters are based on cuckoo hashing. A Cuckoo filter is essentially a cuckoo hash table which stores each key's fingerprint. As Cuckoo hash tables are highly compact, a cuckoo filter often requires less space than conventional Bloom filters for applications that require low false positive rates (< 3%).

Implementation Details

This library was designed to provide a target false positive probability of ~P(0.001) and was hard-coded to use sixteen bits per item and four nests per bucket. As it uses two hashes, it's a (2, 4)-cuckoo filter.

Interface

A Cuckoo filter supports following operations:

  • cuckoo_filter_new(filter, max_key_count, max_kick_attempts, seed): creates a filter
  • cuckoo_filter_free(filter): destroys a filter
  • cuckoo_filter_add(filter, item, item_length_in_bytes): add an item to the filter
  • cuckoo_filter_remove(filter, item, item_length_in_bytes): remove an item from the filter
  • cuckoo_filter_contains(filter, item, item_length_in_bytes): test for approximate membership of an item in the filter

Repository structure

  • example/example.c: an example demonstrating the use of the filter
  • include/cuckoo_filter.h: the public header file
  • src/cuckoo_filter.c: a C-based implementation of a (2, 4)-cuckoo filter
  • tests/test.c: unit tests

Usage

To build this example:

$ make test

Authors

Jonah H. Harris [email protected]

License

The MIT License

References

  • "Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen, and Michael Kaminsky

libcuckoofilter's People

Contributors

jonahharris avatar peko avatar samuelmarks avatar solardiz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

libcuckoofilter's Issues

SIGSERV at cuckoo_filter_lookup

I experimented a bit with your implementation of the filter, and faced with the following error.
Program received signal SIGSEGV, Segmentation fault. 0x0000000000401075 in cuckoo_filter_lookup (filter=0x6039e0, result=0x7fffffffe990, key=0x603010, key_length_in_bytes=22) at cuckoo_filter.c: 217

Pull request with test

Conditions

I create about 1000 small filters to store 100-500 values in each (values 22 bytes long). When I try to call cuckoo_filter_add or cuckoo_filter_check, I'm occasionally get an SIGSERV. The error does not appear immediately, but for some filters. When I test with shorter values - error appears less frequently.

Compilation trivial gcc -O0 -g3 -Wall ...

Tested with clang version 3.7.1 (tags/RELEASE_371/final) and gcc (GCC) 5.3.0

Code

214│   for (size_t ii = 0; ii < filter->nests_per_bucket; ++ii) {
215│     cuckoo_nest_t *n1 =
216│       &filter->bucket[((h1 - 1) * filter->nests_per_bucket) + ii];
217├>    if (fingerprint == n1->fingerprint) {
218│       result->was_found = true;
219│       break;
220│     }

Backtrace

#0  0x0000000000401075 in cuckoo_filter_lookup (
filter=0x6039e0, 
result=0x7fffffffe990, 
key=0x603010, 
key_length_in_bytes=22) 
at ./cuckoo/libcuckoofilter/src/cuckoo_filter.c:217

#1  0x000000000040124f in cuckoo_filter_contains (
filter=0x6039e0, 
key=0x603010, 
key_length_in_bytes=22) 
at ./cuckoo/libcuckoofilter/src/cuckoo_filter.c:305
...

Stack

(gdb) p *result
$2 = {was_found = false, item = {fingerprint = 0, h1 = 0, h2 = 0, padding = 1660944384}}

(gdb) p *filter
$3 = {bucket_count = 16, nests_per_bucket = 4, mask = 65535, max_kick_attempts = 100, seed = 1456275313, padding = 0, victim = {fingerprint = 0, h1 = 0, h2 = 0, padding = 0}, last_victim = 0x0, bucket = {{fingerprint = 0}}}

(gdb) info locals
n1 = 0x200603a08
n2 = 0x0
ii = 0
fingerprint = 0
h1 = 0
h2 = 9

False negatives

I wrapped this as a Python extension and tested it with a range of inputs. What I observed was that I would get false negatives if I added hundreds of thousands of items and then checked if they were found. It turns out this is because the victim support is not implemented and this seems to corrupt the filter past a certain point (maybe dropped victims).

I implemented the victim support and this fixed the false negatives. However, what I then experienced was that if I tried to iterate through the added items and remove them, I encountered false negatives due to filter corruption. It is possible that this was due to my victim handling in the removal code but who knows at this point.

I suggest you alter the readme to indicate this implementation requires more work before it is not suitable for use. You can find my altered code here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.