Giter Club home page Giter Club logo

Comments (4)

spkrka avatar spkrka commented on July 21, 2024

It sounds like a good idea, but it would require changes to the hash file format so it is not a trivial change to make.
I haven't thought this fully through yet, so this may not be 100% correct.

Currently the hash header contains:

uint32_t file_identifier;
uint64_t data_end;

That would need to be extended to support multiple identifiers and data_end - one for each log file. That would also lead to having a dynamic header size, which would likely have other side effects on the code base.

Furthermore, each hash entry would need to be extended to also reference which log file the entry is found in.
This would have impact on the maximum log file size.
Fortunately the limit now is fairly large - 64 bits of addressing space, minus some bits if block compression is enabled.

As an example, if the block with the most entries has 1000 entries, that would imply 10 bits for that.
If we have 1000 log files, that would also imply 10 bits.
That would leave us with 44 bits for addressing per log file, which is about 17 terabytes so it's probably ok.

Since you mentioned having log files on separate disks, you would also need some way for the hash file to find the log files. Currently you either specify both files explicitly, or implicitly assume same directory and a specific naming convention.

All in all, I don't think this is a trivial change to make, but it could work and it could be useful.
I don't have time myself to work on it, but I could likely find time to review PRs unless they grow too complicated.

from sparkey.

oersted avatar oersted commented on July 21, 2024

Thank you for considering the proposal seriously. I might look into doing a PR, but I did manage to implement an alternate solution that is good enough for now (using lmdb), so it's no longer a priority.

from sparkey.

spkrka avatar spkrka commented on July 21, 2024

An alternative solution that would not require any code changes would be to manually shard the data into multiple sparkey files.
Use a custom hash map each key-value pair to the correct shard, and use the same logic for both writing and reading.

We actually do exactly this in some of data processing workflows:
https://github.com/spotify/scio/blob/main/scio-extra/src/main/scala/com/spotify/scio/extra/sparkey/instances/ShardedSparkeyReader.scala

from sparkey.

oersted avatar oersted commented on July 21, 2024

It's a good point, thanks for the contribution. I also built such an architecture with RocksDB a while back.

The sharded version you are describing works well as long as you pre-partition the data by hash-mod, so that each process writes to a single shard. This can be quite costly for some workloads and may be imbalanced if the distribution of keys is not uniform. You could always use locking I suppose to write to all shards from all processes without pre-partitioning, with enough shards and a well-balanced key-space it could be performant.

We also had a similar alternative where each key could be in any of the shards. Each process writes all its key-values to a single shard, and then reads need to query every shard for each get. This clearly sacrifices read-speed but it's quite simple and has very good write-speed and flexibiilty.

In the case of sparkey, I think that creating one log-file per process and then concatenating them before creating the hash-file is the best solution. The concatenation can be done in a single process or, if necessary, parallelized as in a parallel merge-sort or LSM tree compaction. The proposed change of supporting multiple files wouldn't require this concatenation work, but I understand it's not trivial.

from sparkey.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.