utsaslab / pebblesdb Goto Github PK
View Code? Open in Web Editor NEWThe PebblesDB write-optimized key-value store (SOSP 17)
License: BSD 3-Clause "New" or "Revised" License
The PebblesDB write-optimized key-value store (SOSP 17)
License: BSD 3-Clause "New" or "Revised" License
I go through db_bench.cc
, and I notice that code ships with some code related to YCSB. I'm curious what's the status of this part of the code? --benchmarks
doesn't expose ycsb
option and when I manually enable this option, core dump happens.
Also, it seems db_bench.cc
is somewhat broken when I run the benchmark by default
$ ./db_bench
LevelDB: version 1.17
Date: Tue Apr 17 18:12:07 2018
CPU: 1 * Intel(R) Core(TM) i5-2435M CPU @ 2.40GHz
CPUCache: 3072 KB
Keys: 16 bytes each
Values: 1024 bytes each (512 bytes after compression)
Entries: 1000000
RawSize: 991.8 MB (estimated)
FileSize: 503.5 MB (estimated)
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
fillseq : 197.156 micros/op; 5.0 MB/s
fillsync : 4347.880 micros/op; 0.2 MB/s (1000 ops)
fillrandom : 234.785 micros/op; 4.2 MB/s
overwrite : 285.580 micros/op; 3.5 MB/s
readrandom : 61.021 micros/op; (1000000 of 1000000 found)
readrandom : 58.686 micros/op; (1000000 of 1000000 found)
readseq : 7.205 micros/op; 137.7 MB/s
readreverse : 16.426 micros/op; 60.4 MB/s
lt-db_bench: ./db/dbformat.h:111: leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&): Assertion `internal_key.size() >= 8' failed.
Aborted (core dumped)
I'm playing around with the source code and I'm curious if the source code supports write amplification calculation out-of-box? If so, can you point which part does the calculation?
Thanks!
Verified in:
What happened:
After experiencing a power failure while adding values to PebblesDB with the verify_checksums
and paranoid_checks
parameters set to true
, database gets corrupted. After applying the recovery method suggested in https://github.com/google/leveldb/blob/main/doc/index.md (using RepairDB), a value that was partially persisted is present.
The root cause of the problem is that some writes to the log file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:
This problem was already reported in leveldb google/leveldb#251 and does not exist in the latest release (1.23).
How to reproduce
This issue can be replicated using LazyFS, a file system capable of simulating power failures and the behavior of the OS mentioned above, i.e., simulating file system pages persisted out of order at the disk.
The main problem is a write to the file 000003.log
which is 12288 bytes long. LazyFS will persist portions (in sizes of 4096 bytes) of this write out of order and will crash, simulating a power failure.
To reproduce this problem, one can follow these steps (the mentioned files write_test.cpp
, etc., are in this zip pebblesdb_test.zip):
/home/pebblesdb/data
and the root directory is /home/pebblesdb/data-r
, add the following lines to the default configuration file (located in the config/default.toml
directory):[[injection]]
type="split_write"
file="/home/pebblesdb/data-r/000003.log"
persist=[1,3]
parts=3
occurrence=4
These lines define a fault to be injected. A power failure will be simulated after writing to the /home/pebblesdb/data-r/000003.log
file. Since this write is large (12288 bytes), it is split into 3 parts (each with 4096 bytes), and only the first and the third parts will be persisted. Specify that it's the fourth write issued to this file (with the parameter occurrence).
Start LazyFS with the following command:
./scripts/mount-lazyfs.sh -c config/default.toml -m /home/pebblesdb/data -r /home/pebblesdb/data-r -f
Compile and execute the write_test.cpp
file, that adds 4 pairs of key-values to PebblesDB, where the third pair is the only one that exceeds the size of a page at the page cache .
Immediately after this step, PebblesDB will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '262144', 'off': '0'}
{'syscall': 'read', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '131072', 'off': '0'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': /home/pebblesdb/data-r/000003.log', 'size': '12288', 'off': '0'}
fusermount -uz /home/pebblesdb/data
repair.cpp
file that recovers the database.read_test.cpp
file that reads and checks the values previously inserted. The value for the key k3
is only part of the initial value.Note that when paranoid_checks
and verify_checksums
are set to false
, PebblesDB does not fail on restart and discards the partial value of the key k3
(says that this key does not exist).
I'd like to have PebblesDB be accessed via Python and be installed via pip. Currently, PebblesDB works well in C++, and somewhat less well with Java (using the LevelDB JNI binding).
Implement Java JNI Wrapper so that Java applications can access PebblesDB.
PebblesDB is crashing when running the large-sized KV pair using YCSB Benchmark. I have been trying to make it run but it is keep crashing when I feed the KV pair size larges than few 10s of KB. If PebblesDB does not support large-sized KV pairs then please mention here and if it does then how to work it around.
PebblesDB currently uses bloom filters. Change PebblesDB so that it uses SURF Filters from CMU (https://github.com/efficient/SuRF) instead. This should help both point-query and range query performance.
Right now, PebblesDB uses a lot of memory for the TableCache (caching metadata) and for the bloom filters used for each sstable.
We want to add a command line option for PebblesDB which would limit the total amount of memory used by PebblesDB for the TableCache and bloom filters.
When using the specified amount of memory, preference should be given first to the table cache, and then bloom filters for upper levels (level 0, level 1).
Why not add all files at the level level directly to guards_compaction_and_all_files instead of taking the intersection of Complete_guards and Guards? I'm a little confused about this code, so I'd appreciate explaining it ใ
`
int guard_index_iter = 0;
for (size_t i = 0; i < complete_guards.size(); i++) {
GuardMetaData* cg = complete_guards[i];
int guard_index = -1;
Slice guard_key = cg->guard_key.user_key(), next_guard_key;
if (i + 1 < complete_guards.size()) {
next_guard_key = complete_guards[i+1]->guard_key.user_key();
}
for (; guard_index_iter < guards.size(); guard_index_iter++) {
int compare = icmp_.user_comparator()->Compare(guards[guard_index_iter]->guard_key.user_key(), guard_key);
if (compare == 0) {
guard_index = guard_index_iter;
guard_index_iter++;
break;
} else if (compare > 0) {
break;
} else {
// Ideally it should never reach here since there are no duplicates in complete_guards and complete_guards is a superset of guards
}
}
if (guard_index == -1) { // If guard is not found for this complete guard
continue;
}
GuardMetaData* g = guards[guard_index];
bool guard_added = false;
for (unsigned j = 0; j < g->files.size(); j++) {
FileMetaData* file = g->file_metas[j];
Slice file_smallest = file->smallest.user_key();
Slice file_largest = file->largest.user_key();
if ((i < complete_guards.size()-1 // If it is not the last guard, checking for smallest and largest to fit in the range
&& (icmp_.user_comparator()->Compare(file_smallest, guard_key) < 0
|| icmp_.user_comparator()->Compare(file_largest, next_guard_key) >= 0))
|| (i == complete_guards.size()-1 // If it is the last guard, checking for the smallest to fit in the guard
&& icmp_.user_comparator()->Compare(file_smallest, guard_key) < 0)) {
guards_to_add_to_compaction.push_back(g);
guards_compaction_add_all_files.push_back(true);
guard_added = true;
break; // No need to check other files
}
}
if (!guard_added && which == 0 && (force_compact || v->guard_compaction_scores_[current_level][guard_index] >= 1.0)) {
guards_to_add_to_compaction.push_back(g);
guards_compaction_add_all_files.push_back(false);
continue;
}
}
`
How to Reproduce
Run db_test
many times then see DBTest.MultiThreaded
crashes occasionally
Stack trace
==== Test DBTest.MultiThreaded
[New Thread 0x7fff4cfd9700 (LWP 23220)]
[New Thread 0x7fff4c7d8700 (LWP 23221)]
[New Thread 0x7fff4bfd7700 (LWP 23222)]
... starting thread 0
[New Thread 0x7fff4b7d6700 (LWP 23223)]
... starting thread 1
[New Thread 0x7fff4afd5700 (LWP 23224)]
... starting thread 2
[New Thread 0x7fff4a7d4700 (LWP 23225)]
... starting thread 3
Thread 307 "db_test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4c7d8700 (LWP 23221)]
0x00007ffff7afe106 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) where
#0 0x00007ffff7afe106 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1 0x00005555555d46e3 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#2 0x00005555555d2919 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#3 0x00005555555d0a64 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#4 0x00005555555cd5e9 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#5 0x00005555555c8e3c in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase(unsigned long const&) ()
#6 0x00005555555c6273 in std::map<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase(unsigned long const&) ()
#7 0x00005555555b9358 in leveldb::VersionSet::RemoveFileLevelBloomFilterInfo(unsigned long) ()
#8 0x0000555555591e7f in leveldb::DBImpl::DeleteObsoleteFiles() ()
#9 0x0000555555595600 in leveldb::DBImpl::BackgroundCompactionGuards(leveldb::FileLevelFilterBuilder*) ()
#10 0x0000555555594dc3 in leveldb::DBImpl::CompactLevelThread() ()
#11 0x000055555559e395 in leveldb::DBImpl::CompactLevelWrapper(void*) ()
#12 0x00005555555e7403 in leveldb::(anonymous namespace)::StartThreadWrapper(void*) ()
#13 0x00007ffff7326494 in start_thread (arg=0x7fff4c7d8700) at pthread_create.c:333
#14 0x00007ffff7068acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Does PebblesDB compatible with leveldb completely? for example, if I used to use leveldb , will it get error if I use PebblesDB to open it ?
PebblesDB has a known memory-leak when used with a large number of small key-value pairs. This doesn't appear for some reason when PebblesDB is used with large key-value pairs (the default). This doesn't affect default behavior, but we would like to fix this going forward.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.