I'm looking to do parallel bulk writes. Since log-files can only be

Converting multiple log-files into a hash-file in bulk about sparkey HOT 4 CLOSED

spotify commented on July 21, 2024

Converting multiple log-files into a hash-file in bulk

from sparkey.

Comments (4)

spkrka commented on July 21, 2024

It sounds like a good idea, but it would require changes to the hash file format so it is not a trivial change to make.
I haven't thought this fully through yet, so this may not be 100% correct.

Currently the hash header contains:

uint32_t file_identifier;
uint64_t data_end;

That would need to be extended to support multiple identifiers and data_end - one for each log file. That would also lead to having a dynamic header size, which would likely have other side effects on the code base.

Furthermore, each hash entry would need to be extended to also reference which log file the entry is found in.
This would have impact on the maximum log file size.
Fortunately the limit now is fairly large - 64 bits of addressing space, minus some bits if block compression is enabled.

As an example, if the block with the most entries has 1000 entries, that would imply 10 bits for that.
If we have 1000 log files, that would also imply 10 bits.
That would leave us with 44 bits for addressing per log file, which is about 17 terabytes so it's probably ok.

Since you mentioned having log files on separate disks, you would also need some way for the hash file to find the log files. Currently you either specify both files explicitly, or implicitly assume same directory and a specific naming convention.

All in all, I don't think this is a trivial change to make, but it could work and it could be useful.
I don't have time myself to work on it, but I could likely find time to review PRs unless they grow too complicated.

from sparkey.

oersted commented on July 21, 2024

Thank you for considering the proposal seriously. I might look into doing a PR, but I did manage to implement an alternate solution that is good enough for now (using lmdb), so it's no longer a priority.

from sparkey.

spkrka commented on July 21, 2024

An alternative solution that would not require any code changes would be to manually shard the data into multiple sparkey files.
Use a custom hash map each key-value pair to the correct shard, and use the same logic for both writing and reading.

We actually do exactly this in some of data processing workflows:
https://github.com/spotify/scio/blob/main/scio-extra/src/main/scala/com/spotify/scio/extra/sparkey/instances/ShardedSparkeyReader.scala

from sparkey.

oersted commented on July 21, 2024

It's a good point, thanks for the contribution. I also built such an architecture with RocksDB a while back.

The sharded version you are describing works well as long as you pre-partition the data by hash-mod, so that each process writes to a single shard. This can be quite costly for some workloads and may be imbalanced if the distribution of keys is not uniform. You could always use locking I suppose to write to all shards from all processes without pre-partitioning, with enough shards and a well-balanced key-space it could be performant.

We also had a similar alternative where each key could be in any of the shards. Each process writes all its key-values to a single shard, and then reads need to query every shard for each get. This clearly sacrifices read-speed but it's quite simple and has very good write-speed and flexibiilty.

In the case of sparkey, I think that creating one log-file per process and then concatenating them before creating the hash-file is the best solution. The concatenation can be done in a single process or, if necessary, parallelized as in a parallel merge-sort or LSM tree compaction. The proposed change of supporting multiple files wouldn't require this concatenation work, but I understand it's not trivial.

from sparkey.

Converting multiple log-files into a hash-file in bulk about sparkey HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent