Giter Club home page Giter Club logo

sparkey's Introduction

Sparkey is a simple constant key/value storage library. It is mostly suited for read heavy systems with infrequent large bulk inserts. It includes both a C library for working with sparkey index and log files (libsparkey), and a command line utility for getting info about and reading values from a sparkey index/log (sparkey).

Travis

Continuous integration with travis.

Build Status

Dependencies

  • GNU build system (autoconf, automake, libtool)
  • Snappy

Optional

Building

autoreconf --install
./configure
make
make check

API documentation can be generated with doxygen.

Installing

sudo make install && sudo ldconfig

Related projects

Description

Sparkey is an extremely simple persistent key-value store. You could think of it as a read-only hashtable on disk and you wouldn't be far off. It is designed and optimized for some server side usecases at Spotify but it is written to be completely generic and makes no assumptions about what kind of data is stored.

Some key characteristics:

  • Supports data sizes up to 2^63 - 1 bytes.
  • Supports iteration, get, put, delete
  • Optimized for bulk writes.
  • Immutable hash table.
  • Any amount of concurrent independent readers.
  • Only allows one writer at a time per storage unit.
  • Cross platform storage file.
  • Low overhead per entry.
  • Constant read startup cost
  • Low number of disk seeks per read
  • Support for block level compression.
  • Data agnostic, it just maps byte arrays to byte arrays.

What it's not:

  • It's not a distributed key value store - it's just a hash table on disk.
  • It's not a compacted data store, but that can be implemented on top of it, if needed.
  • It's not robust against data corruption.

The usecase we have for it at Spotify is serving data that rarely gets updated to users or other services. The fast and efficient bulk writes makes it feasible to periodically rebuild the data, and the fast random access reads makes it suitable for high throughput low latency services. For some services we have been able to saturate network interfaces while keeping cpu usage really low.

Limitations

The hash writing process requires memory allocation of num_entries * 16 * 1.3 bytes. This means that you may run out of memory if trying to write a hash index for too many entries. For instance, with 16 GB available RAM you may write 825 million entries.

This limitation has been removed in sparkey-java, but it has not yet been implemented in this version.

Usage

Sparkey is meant to be used as a library embedded in other software. Take a look at the API documentation which gives examples on how to use it.

License

Apache License, Version 2.0

Design

Sparkey uses two files on disk to store its data. The first is the sparkey log file (.spl), which is simply a sequence of key value pairs. This is an append-only file. You're not allowed to modify it in the middle, and you can't use more than one writer to append to it.

The other file is the sparkey index file (.spi) which is a just a hashtable pointing at entries in the log. This is an immutable file, so you would typically only update it once you're done with your bulk appends.

Doing a random lookup involves first finding the proper entry in the hashtable, and then doing a seek to the right offset in the log file. On average, this means two disk seeks per access for a cold disk cache. If you mlock the index file, it goes down to one seek. For some of our usecases, the total data set is less than the available RAM, so it makes sense to mlock everything.

The advantages of having two files instead of just one (another solution would be to append the hash table at the end) is that it's trivial to mlock one of the files and not the other. It also enables us to append more data to existing log files, even after it's already in use.

History

Sparkey is the product of hackdays at Spotify, where our developers get to spend some of their time on anything they think is interesting.

We have several usecases where we need to serve large amounts of static data with high throughput and low latency. To do this, we've built our own services, backed by various storage systems. Our flow consists of first generating large static storage files in our offline-systems, which then gets pushed out to the user facing services to serve the data.

The storage solutions we used for that have all served us well for a time, but they had limitations that became problematic.

  • We used to rely a lot on CDB (which is a really great piece of software). It performed blazingly quick and produces compact files. We only stopped using it when our data started growing close to the 4 GB limit
  • We also used (and still use) Tokyo Cabinet for a bunch of usecases. It performs really well for reading, but the write throughput really suffers when you can no longer keep the entire dataset in memory, and there were issues with opening the same file multiple times from the same process.

We needed a key-value store with the following characteristics:

  • random read throughput comparable to tokyo cabinet and cdb.
  • high throughput bulk writes.
  • low overhead.
  • high limit on data size.

For fun, we started hacking on a new key-value store on our internal hackdays, where developers get to work on whatever they're interested in. The result was this project.

Performance

A very simple benchmark program is included - see src/bench.c. The program is designed to be easily extended to measure other key value stores if anyone wants to. Running it on a production-like server (Intel(R) Xeon(R) CPU L5630 @ 2.13GHz) we get the following:

Testing bulk insert of 1000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     0.00
    creation time (cpu):      0.00
    throughput (puts/cpusec): 1098272.88
    file size:                28384
    lookup time (wall):          0.50
    lookup time (cpu):           0.58
    throughput (lookups/cpusec): 1724692.62

Testing bulk insert of 1000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     0.50
    creation time (cpu):      0.69
    throughput (puts/cpusec): 1448618.25
    file size:                34177984
    lookup time (wall):          1.00
    lookup time (cpu):           0.78
    throughput (lookups/cpusec): 1284477.75

Testing bulk insert of 10.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     7.50
    creation time (cpu):      7.73
    throughput (puts/cpusec): 1294209.62
    file size:                413777988
    lookup time (wall):          1.00
    lookup time (cpu):           0.99
    throughput (lookups/cpusec): 1014608.94

Testing bulk insert of 100.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey
    creation time (wall):     82.00
    creation time (cpu):      81.58
    throughput (puts/cpusec): 1225726.75
    file size:                4337777988
    lookup time (wall):          2.00
    lookup time (cpu):           1.98
    throughput (lookups/cpusec): 503818.84

Testing bulk insert of 1000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     0.00
    creation time (cpu):      0.00
    throughput (puts/cpusec): 1101445.38
    file size:                19085
    lookup time (wall):          3.50
    lookup time (cpu):           3.30
    throughput (lookups/cpusec): 303335.78

Testing bulk insert of 1000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     0.50
    creation time (cpu):      0.75
    throughput (puts/cpusec): 1333903.25
    file size:                19168683
    lookup time (wall):          3.00
    lookup time (cpu):           2.91
    throughput (lookups/cpusec): 343833.28

Testing bulk insert of 10.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     8.50
    creation time (cpu):      8.50
    throughput (puts/cpusec): 1176634.25
    file size:                311872187
    lookup time (wall):          3.00
    lookup time (cpu):           2.99
    throughput (lookups/cpusec): 334490.22

Testing bulk insert of 100.000.000 elements and 1000.000 random lookups
  Candidate: Sparkey compressed(1024)
    creation time (wall):     90.50
    creation time (cpu):      90.46
    throughput (puts/cpusec): 1105412.00
    file size:                3162865465
    lookup time (wall):          3.50
    lookup time (cpu):           3.60
    throughput (lookups/cpusec): 277477.41

File format details

Log file format

The contents of the log file starts with a constant size header, describing some metadata about the log file. After that is just a sequence of entries, where each entry consists of a type, key and a value.

Each entry begins with two Variable Length Quantity (VLQ) non-negative integers, A and B. The type is determined by the A. If A = 0, it's a DELETE, and B represents the length of the key to delete. If A > 0, it's a PUT and the key length is A - 1, and the value length is B.

(It gets slightly more complex if block level compression is used, but we'll ignore that for now.)

Hash file format

The contents of the hash file starts with a constant size header, similarly to the log file. The rest of the file is a hash table, represented as capacity * slotsize bytes. The capacity is simply an upper bound of the number of live entries multiplied by a hash density factor > 1.0. The default implementation uses density factor = 1.3. Each slot consists of two parts, the hash value part and the address. The size of the hash value is either 4 or 8 bytes, depending on the hash algorithm. It currently uses murmurhash32 if the number of entries is small, and a 64 bit truncation of murmurhash128 if the number of entries is large. The address is simply a reference into the log file, either as 4 or 8 bytes, depending on the size of the log file. That means that the slotsize is usually 16 bytes for any reasonably large set of entries. By storing the hash value itself in each slot we're wasting some space, but in return we can expect to avoid visiting the log file in most cases.

Hash lookup algorithm

One of few non-trivial parts in Sparkey is the way it does hash lookups. With hashtables there is always a risk of collisions. Even if the hash itself may not collide, the assigned slots may.

(It recently came to my attention that the method described below is basically the same thing as Robin Hood hashing with backward shift deletion)

Let's define displacement as the distance from the calculated optimal slot for a given hash to the slot it's actually placed in. Distance in this case is defined as the number of steps you need to move forward from your optimal slot to reach the actual slot.

The trivial and naive solution for this is to simply start with an empty hash table, move through the entries and put them in the first available slot, starting from the optimal slot, and this is almost what we do.

If we consider the average displacement, we can't really do better than that. We can however minimize the maximum displacement, which gives us some nice properties:

  • We can store the maximum displacement in the header, so we have an upper bound on traversals. We could possibly even use this information to binary search for the entry.
  • As soon as we reach an entry with higher displacement than the thing we're looking for, we can abort the lookup.

It's very easy to set up the hash table like this, we just need to do insertions into slots instead of appends. As soon as we reach a slot with a smaller displacement than our own, we shift the following slots up until the first empty slot one step and insert our own element.

Let's illustrate it with an example - let's start off with an empty hash table with a capacity of 7:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 |            |            |
       +------------+------------+
slot 4 |            |            |
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

We add the key "key0" which happens to end up in slot 3, h("key0") % 7 == 3. The slot is empty, so this is trivial:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 |            |            |
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Now we add "key1" which happens to end up in slot 4:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key1)    | 11         |  4               0
       +------------+------------+
slot 5 |            |            |
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Now we add "key2" which also wants to be in slot 3. This is a conflict, so we skip forward until we found a slot which has a lower displacement than our current displacement. When we find that slot, all following entries until the next empty slot move down one step:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key1)    | 11         |  4               1
       +------------+------------+
slot 6 |            |            |
       +------------+------------+

Let's add "key3" which maps to slot 5. We can't push down key1, because it already has displacement 1 and our current displacement for key3 is 0, so we have to move forward:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 |            |            |
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key1)    | 11         |  4               1
       +------------+------------+
slot 6 | h(key3)    | 31         |  5               1
       +------------+------------+

Adding "key4" for slot 3. It ends up in slot 5 with displacement 2 and key3 loops around to slot 0:

         hash value   log offset    optimal slot    displacement
       +------------+------------+
slot 0 | key(key3)  | 31         |  5               2
       +------------+------------+
slot 1 |            |            |
       +------------+------------+
slot 2 |            |            |
       +------------+------------+
slot 3 | h(key0)    | 1          |  3               0
       +------------+------------+
slot 4 | h(key2)    | 21         |  3               1
       +------------+------------+
slot 5 | h(key4)    | 41         |  3               2
       +------------+------------+
slot 6 | h(key1)    | 11         |  4               2
       +------------+------------+

Now, if we search for key123 which maps to slot 3 (but doesn't exist!), we can stop scanning as soon as we reach slot 6, because then the current displacement (3) is higher than the displacement of the entry at the current slot (2).

Compression

Sparkey also supports block level compression using google snappy. You select a block size which is then used to split the contents of the log into blocks. Each block is compressed independently with snappy. This can be useful if your bottleneck is file size and there is a lot of redundant data across adjacent entries. The downside of using this is that during lookups, at least one block needs to be decompressed. The larger blocks you choose, the better compression you may get, but you will also have higher lookup cost. This is a tradeoff that needs to be empirically evaluated for each use case.

sparkey's People

Contributors

kirang89 avatar noj avatar nresare avatar rohansingh avatar rschildmeijer avatar spkrka avatar thedrow avatar tiegz avatar vasi avatar wbolster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sparkey's Issues

OS X automake failure.

Hello,

Thanks for the very interesting library. However "clock_gettime" is not available on OS X and configuration fails on this system. I think it is missing on Windows too. "clock_gettime" api could be simulated using gettimeofday, or mach_absolute_time. Maybe building benchmark app could be optional?

I'd suggest to move this project to CMake as this will simplify the "configure" step and will make integration into other projects easy.

make failure

Under Ubuntu 12.04 + gcc 4.6.3.
When I ran make, I got the following message:

bench.c:253:3: error: format ‘%ld’ expects argument of type ‘long int’, but argument 2 has type ‘size_t’ [-Werror=format]

printf("    file size:                %ld\n", total_file_size(c->files()));

to

printf("    file size:                %ld\n", (long int) total_file_size(c->files()));

libsparkey.so.0: cannot open shared object file

I just installed sparkey (Commit: ec80630) on my local maschine (Ubuntu 12.04) like described in the README file:

$ git clone [email protected]:spotify/sparkey.git
$ cd sparkey
$ autoreconf --install
$ ./configure
$ make
$ make check
$ sudo make install

But if I want to execute sparkey, I get the following error:

$ sparkey
sparkey: error while loading shared libraries: libsparkey.so.0:
  cannot open shared object file: No such file or directory

Can someone please tell me, where sparkey is searching for the file because it is available in /usr/local/lib:

$ ls -la /usr/local/lib/ | grep sparkey
-rw-r--r-- 1 root root  79842 Sep  3 16:02 libsparkey.a
-rwxr-xr-x 1 root root    974 Sep  3 16:02 libsparkey.la
lrwxrwxrwx 1 root root     19 Sep  3 16:02 libsparkey.so -> libsparkey.so.0.0.0
lrwxrwxrwx 1 root root     19 Sep  3 16:02 libsparkey.so.0 -> libsparkey.so.0.0.0
-rwxr-xr-x 1 root root  50900 Sep  3 16:02 libsparkey.so.0.0.0

I also tried to copy/symlink the files to /usr/lib but this didn't work.

Non-exact seeks?

Assuming I have two keys in my store: a and c. I can get exact keys, but would it be possible to seek for the next match?

Example:

sparkey_hash_seek(myreader, (uint8_t*)"a", 1, myiter); // exact match, same as sparkey_hash_get
sparkey_hash_seek(myreader, (uint8_t*)"b", 1, myiter); // cannot find "b", so advance to next live key: "c" 

I am not sure if/how this could work with the slot traversal algorithm, but this is quite a useful feature when e.g. looking up ranges, etc.

Thanks

does sparkey support open a store for writing subsequent times

After creating indexFile SparkeyWriter and close it. Am I able to read this same indexFile and append new data to it?
I have a scene which source data come from many large file. I read one by one large file and write to several different indexFile. the new large file should append data to existing indexfile.

for example . file1:

absssss,1111
acsssss,1222

when index original data to index file, I design to use the first two char from key. such as
all prefix equal to ab will group to ab.spi file

file2:

abssss,23444
cdssss,34444

finally there are 3 indexFile: ab.spi, ac.spi, cd.spi
because when query abssss, I only query ab.spi. when query cdssss only query cd.spi.

something similar to query routing to reduce each file's query load.
because our data has nearly 100 billion.that's why I want to split different index file

one way is keep SparkeyWriter in memory by a map like

ab->SparkeyWriter1
ac->SparkeyWriter2
cd->SparkeyWriter3

and read all original file just one time: read line, substring first two char ,get corresponding SparkeyWriter from map, and write data to this index.

as must get all file and keep SparkeyWriter untill all work done seems too slow,also may be memory insufficient. so I want to know does sparkey support open a store for writing subsequent times

As I check paldb project by linkedin, It says :

Can you open a store for writing subsequent times?  
No, the final binary file is created when StoreWriter.close() is called.

so I want to know sparkey support this feature?
or may be I don't need split index file: just pull all 100 billion data is sufficient fast query support by sparkey?

Tag a Release

Hey @spkrka,

I know that you've not worked on this repo for a while and are working on a full Java version now, but I was wondering if you could tag a release at whatever latest point you feel is reasonable. I'd like to use this as a dependency for a project, but I'm hesitant (out of habit) to simply refer to HEAD or a specific commit hash.

Thanks!

File descriptor leak when hash/log files are corrupt or not matching

In our case we had hash/log files that were not corrupt but had different file_identifier fields due to problems in a writing pipeline. The result is sparkey_hash_open() will (correctly) return -305 (SPARKEY_FILE_IDENTIFIER_MISMATCH), but the log resources will not be cleaned up, because:

  • in sparkey_hash_open, reader->open_status is not set when there's a file identifier mismatch (ditto any of the checks sparkey_hash_open does)
  • sparkey_hash_close does nothing if reader->open_status is not set

The fallout is munmap and close are not called on the log file, leading to leaking file descriptors and the OS refusing to remove those bytes from the filesystem until the process is killed.

I intend to fix this problem, but figured I'd document it first!

sparkey_logiter_reset doesn't reset

I may have misinterpreted the documentation, but it doesn't look like sparkey_logiter_reset actually resets the iterator.

Test case:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sparkey/sparkey.h>
#include <assert.h>

#define SPARKEY_ASSERT(rc) ({  \
  if (SPARKEY_SUCCESS != rc) { \
    fprintf(                   \
        stderr                 \
      , "error: %s (%d)\n"     \
      , sparkey_errstring(rc)  \
      , __LINE__               \
    );                         \
    exit(1);                   \
  }                            \
});

int
main(void) {
  sparkey_logwriter *writer = NULL;
  sparkey_logreader *reader = NULL;
  sparkey_logiter *iterator = NULL;
  const char *key1 = "key1";
  const char *value1 = "value1";
  size_t key1size = strlen(key1);
  size_t value1size = strlen(value1);
  const char *key2 = "key2";
  const char *value2 = "value2";
  size_t key2size = strlen(key2);
  size_t value2size = strlen(value2);
  uint64_t wanted;
  uint64_t actual;
  uint8_t *buffer = NULL;

  // create a log
  SPARKEY_ASSERT(sparkey_logwriter_create(
      &writer
    , "test.spl"
    , SPARKEY_COMPRESSION_NONE
    , 0
  ));

  // write some stuff
  SPARKEY_ASSERT(sparkey_logwriter_put(
      writer
    , key1size
    , (uint8_t *) key1
    , value1size
    , (uint8_t *) value1
  ));
  SPARKEY_ASSERT(sparkey_logwriter_put(
      writer
    , key2size
    , (uint8_t *) key2
    , value2size
    , (uint8_t *) value2
  ));

  SPARKEY_ASSERT(sparkey_logwriter_close(&writer));

  SPARKEY_ASSERT(sparkey_logreader_open(&reader, "test.spl"));
  SPARKEY_ASSERT(sparkey_logiter_create(&iterator, reader));

  // get first key
  SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
  wanted = sparkey_logiter_keylen(iterator);
  assert((buffer = malloc(wanted)));
  SPARKEY_ASSERT(sparkey_logiter_fill_key(
      iterator
    , reader
    , wanted
    , buffer
    , &actual
  ));

  printf("buffer: %s\n", buffer);
  assert(0 == strcmp("key1", (char *) buffer));
  free(buffer);

  // reset iterator
  SPARKEY_ASSERT(sparkey_logiter_reset(iterator, reader));

  // get key again
  SPARKEY_ASSERT(sparkey_logiter_next(iterator, reader));
  wanted = sparkey_logiter_keylen(iterator);
  assert((buffer = malloc(wanted)));
  SPARKEY_ASSERT(sparkey_logiter_fill_key(
      iterator
    , reader
    , wanted
    , buffer
    , &actual
  ));

  printf("buffer: %s (after reset)\n", buffer);
  assert(0 == strcmp("key1", (char *) buffer));
  free(buffer);

  // cleanup
  sparkey_logiter_close(&iterator);
  sparkey_logreader_close(&reader);
  free(buffer);

  return 0;
}

Yields:

$ gcc -lsparkey reset.c -o reset -Wall -Wextra
$ ./reset
buffer: key1
buffer: key2 (after reset)
Assertion failed: (0 == strcmp("key1", (char *) buffer)), function main, file reset.c, line 98.
Abort trap: 6

Improve hash algorithm description in README

I downloaded the source code and read it today. But I still have some difficulty understanding how the hash algorithm works. Could you write more description about the algorithm in this project to help us learn faster? A more detailed design might be better. Thanks a lot.

Problem creating hash indices larger than available RAM

Currently the hash writing requires malloc of the entire index size. If this is larger than available memory, the operation fails.

We have a use case requiring more than 1.65 billion entries, which results in a 32 GB+ hash index. This is larger than the available memory on the machine that builds it.

We could replace the malloc with mmap:ing of the file itself but that would lead to a lot of random reads and writes from disk which would be horribly slow.

I have an idea to reduce this limitation, but no time to do it at the moment.

  1. Write all hash entries to some sort of log file (64 bit hash value, 64 bit data offset).
  2. Calculate hash capacity (num_entries * factor)
  3. Sort the entries in the file by hash value % capacity. (This can be done efficiently without having everything in RAM)
  4. Mmap the large index file
  5. Iterate over the sorted hash entries and write them to the mmapped index. This should be efficient since the mmapped writes will happen with good locality and in sequential order. Except for the last values which may wrap around to the beginning of the file, but that's a small part of the data.

For sorting the large set of entries, split it into reasonably small chunks and sort those in memory. Then run a tree of sequential file merges.

REPL

Thoughts on adding a REPL to sparkey(1)? I'd be happy to put it together if you're interested.

sparkey_hash_write returns successfully when given an invalid hash size

I know this is a bit edge-casey, but when clobbering an existing hash and providing an invalid hash size, sparkey_hash_write incorrectly returns SPARKEY_SUCCESS.

Test case:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sparkey/sparkey.h>
#include <assert.h>

#define check(rc) ({  \
  if (SPARKEY_SUCCESS != rc) { \
    fprintf(stderr, "error: %s (L#%d)\n", sparkey_errstring(rc), __LINE__); \
    exit(1); \
  } \
});

int
main(void) {
  sparkey_logwriter *writer = NULL;
  check(sparkey_logwriter_create(&writer, "test.spl", SPARKEY_COMPRESSION_NONE, 0));
  check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key1", 7, (uint8_t *) "value1"));
  check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key2", 7, (uint8_t *) "value2"));
  check(sparkey_logwriter_put(writer, 5, (uint8_t *) "key3", 7, (uint8_t *) "value3"));

  // write a hash with size 4
  check(sparkey_hash_write("test.spi", "test.spl", 4));
  // write a hash with invalid size
  check(sparkey_hash_write("test.spi", "test.spl", 3456789));

  assert(0 && "wrote test.spi with an invalid size :(");

  return 0;
}

Converting multiple log-files into a hash-file in bulk

I'm looking to do parallel bulk writes.

Since log-files can only be written sequentially, a solution might be to write a log-file per concurrent process and then merge them. Merging the log-files into one can be done quite efficiently, but it might make more sense to allow converting multiple log-files into a hash-file directly at sparkey_hash_write.

It would save the extra work of merging that can only be done with limited parallelism, plus it would for instance allow to spread out the log-files among multiple disks to improve write speeds (and read speeds for creating the hash table).

If more than one log-file is passed to sparkey_hash_write, might that also allow to do part of the process in parallel? At least reading the log-files and (re-)hashing the keys.

I'd like to hear the maintainer's opinion on it.

EDIT: I see that the hash-table doesn't contain the key-values but it addresses the log-file. That means that multiple log-files (in multiple disks) might also improve general read performance.

Comparison to NuDB?

I've been polishing up my NuDB key/value database library which we're using in rippled (https://github.com/ripple/rippled) so I'm looking at other key value systems for comparisons. We were very unhappy with the performance of RocksDB for our use case (which admittedly is rather unique). Interestingly, sparkey comes the closest in terms of implementation to NuDB. They both use an append-only data file, and a key file which implements an on-disk hash table. One difference is that NuDB uses linear hashing so that buckets reach the 50% occupancy. When a bucket gets full, it is "spilled" over into the data file (that bucket now becomes a linked list of buckets).

I'd be very interested in your feedback on NuDB's design in comparison to sparkey. I'd like to hear about problems you encountered with other key value stores and the data access pattern of your software. I'm curious to know how NuDB compares for your use case, or if there are any techniques in sparkey that I could adopt. The repo is here:
https://github.com/vinniefalco/NuDB

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.