utsaslab / pebblesdb Goto Github PK

View Code? Open in Web Editor NEW

494.0 31.0 100.0 874 KB

The PebblesDB write-optimized key-value store (SOSP 17)

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.86% C++ 88.35% Shell 1.78% M4 1.92% C 2.06% CMake 0.82% HTML 4.10% CSS 0.09%

sosp17 key-value-store flsm leveldb

pebblesdb's People

Contributors

Stargazers

Watchers

Forkers

rescrv oliveiradan jialin-li kdf5000 novahe wowuwowuwo eminugurkenar zz198808 tab227 nsq974487195 mbrukman nikileshsa fusionfoto songzhao pbjc yaojingguo suuyaoo jiayuzzz meghanagupta dingjun84 neeraj9 deepankarsharma xxks-kkk zhaodiankui fangzhou-lu nanne007 lkk2003rty solitarythinker abhijith97 glock42 henrywoo vishwanath1306 scsldb mohsalsaleem nguyendv keshawnzhen ramgtv bx1n andypeng2015 uerfan cryptokat atticusyang liukun4515 fgwu dgauraang zwdong1994 rgmacedo tapaswenipathak ksharkstone a993096281 aakp10 jingnanjia gumi-presentation-by-dzh virgilshi dangzhonghua ashutoshraina baoquanzhang crixalis2013 rallylee xiaming9880 ghdud4006 timoc xiaorz haoyuhuang lsmdb shaanzie singularity0817 setu231 aayushmdesai luckyxietian sz-npe topecongiro perrynzhou talenterj suhasaggarwal suhasagg wendyqiu he-xiaolong haltz leavrth sagitrs linqy71 iamfork jeffreymu darius513 skrlin silver144 ayn3th viyond leeheerock yilun0206 qq2237702301 sunzhoujia m561247 peiqi0714 iocing ehds xinyingzheng00

pebblesdb's Issues

YCSB legacy code in db_bench?

I go through db_bench.cc, and I notice that code ships with some code related to YCSB. I'm curious what's the status of this part of the code? --benchmarks doesn't expose ycsb option and when I manually enable this option, core dump happens.

Also, it seems db_bench.cc is somewhat broken when I run the benchmark by default

$ ./db_bench
LevelDB:    version 1.17
Date:       Tue Apr 17 18:12:07 2018
CPU:        1 * Intel(R) Core(TM) i5-2435M CPU @ 2.40GHz
CPUCache:   3072 KB
Keys:       16 bytes each
Values:     1024 bytes each (512 bytes after compression)
Entries:    1000000
RawSize:    991.8 MB (estimated)
FileSize:   503.5 MB (estimated)
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
fillseq      :     197.156 micros/op;    5.0 MB/s
fillsync     :    4347.880 micros/op;    0.2 MB/s (1000 ops)
fillrandom   :     234.785 micros/op;    4.2 MB/s
overwrite    :     285.580 micros/op;    3.5 MB/s
readrandom   :      61.021 micros/op; (1000000 of 1000000 found)
readrandom   :      58.686 micros/op; (1000000 of 1000000 found)
readseq      :       7.205 micros/op;  137.7 MB/s
readreverse  :      16.426 micros/op;   60.4 MB/s
lt-db_bench: ./db/dbformat.h:111: leveldb::Slice leveldb::ExtractUserKey(const leveldb::Slice&): Assertion `internal_key.size() >= 8' failed.
Aborted (core dumped)

Which part of the code gives write amplification calculation

I'm playing around with the source code and I'm curious if the source code supports write amplification calculation out-of-box? If so, can you point which part does the calculation?

Thanks!

PebblesDB does not discard partially-flushed values

Verified in:

Current version on the main branch (https://github.com/utsaslab/pebblesdb/tree/703bd01)
SOSP 2017 release

What happened:
After experiencing a power failure while adding values to PebblesDB with the verify_checksums and paranoid_checks parameters set to true, database gets corrupted. After applying the recovery method suggested in https://github.com/google/leveldb/blob/main/doc/index.md (using RepairDB), a value that was partially persisted is present.

The root cause of the problem is that some writes to the log file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:

This problem was already reported in leveldb google/leveldb#251 and does not exist in the latest release (1.23).

How to reproduce
This issue can be replicated using LazyFS, a file system capable of simulating power failures and the behavior of the OS mentioned above, i.e., simulating file system pages persisted out of order at the disk.
The main problem is a write to the file 000003.log which is 12288 bytes long. LazyFS will persist portions (in sizes of 4096 bytes) of this write out of order and will crash, simulating a power failure.
To reproduce this problem, one can follow these steps (the mentioned files write_test.cpp, etc., are in this zip pebblesdb_test.zip):

Mount LazyFS on a directory where PebblesDB data will be saved, with a specified root directory. Assuming the data path for PebblesDB is /home/pebblesdb/data and the root directory is /home/pebblesdb/data-r, add the following lines to the default configuration file (located in the config/default.toml directory):

[[injection]]
type="split_write"
file="/home/pebblesdb/data-r/000003.log"
persist=[1,3]
parts=3
occurrence=4

These lines define a fault to be injected. A power failure will be simulated after writing to the /home/pebblesdb/data-r/000003.log file. Since this write is large (12288 bytes), it is split into 3 parts (each with 4096 bytes), and only the first and the third parts will be persisted. Specify that it's the fourth write issued to this file (with the parameter occurrence).

Start LazyFS with the following command:
./scripts/mount-lazyfs.sh -c config/default.toml -m /home/pebblesdb/data -r /home/pebblesdb/data-r -f
Compile and execute the write_test.cpp file, that adds 4 pairs of key-values to PebblesDB, where the third pair is the only one that exceeds the size of a page at the page cache .

Immediately after this step, PebblesDB will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:

{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '262144', 'off': '0'}
{'syscall': 'read', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '131072', 'off': '0'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': '/home/pebblesdb/data-r/000003.log', 'size': '4096', 'off': '0'}
{'syscall': 'fsync', 'path': '/home/pebblesdb/data-r/000003.log'}
{'syscall': 'write', 'path': /home/pebblesdb/data-r/000003.log', 'size': '12288', 'off': '0'}

Remove the fault from the configuration file, unmount the filesystem with fusermount -uz /home/pebblesdb/data
Mount LazyFS again with the previously provided command.
Attemp to start PebblesDB (it fails).
Compile and execute the repair.cpp file that recovers the database.
Compile and execute the read_test.cpp file that reads and checks the values previously inserted. The value for the key k3 is only part of the initial value.

Note that when paranoid_checks and verify_checksums are set to false, PebblesDB does not fail on restart and discards the partial value of the key k3 (says that this key does not exist).

Create python binding for PebblesDB

I'd like to have PebblesDB be accessed via Python and be installed via pip. Currently, PebblesDB works well in C++, and somewhat less well with Java (using the LevelDB JNI binding).

Implement Java JNI Wrapper

Implement Java JNI Wrapper so that Java applications can access PebblesDB.

PebblesDB Crashing with large KV pairs

PebblesDB is crashing when running the large-sized KV pair using YCSB Benchmark. I have been trying to make it run but it is keep crashing when I feed the KV pair size larges than few 10s of KB. If PebblesDB does not support large-sized KV pairs then please mention here and if it does then how to work it around.

Extend PebblesDB to use SURF filters

PebblesDB currently uses bloom filters. Change PebblesDB so that it uses SURF Filters from CMU (https://github.com/efficient/SuRF) instead. This should help both point-query and range query performance.

Set max memory used by PebblesDB

Right now, PebblesDB uses a lot of memory for the TableCache (caching metadata) and for the bloom filters used for each sstable.

We want to add a command line option for PebblesDB which would limit the total amount of memory used by PebblesDB for the TableCache and bloom filters.

When using the specified amount of memory, preference should be given first to the table cache, and then bloom filters for upper levels (level 0, level 1).

Some question about VersionSet::PickCompactionForGuards

Why not add all files at the level level directly to guards_compaction_and_all_files instead of taking the intersection of Complete_guards and Guards? I'm a little confused about this code, so I'd appreciate explaining it 。
`

              int guard_index_iter = 0;
	  for (size_t i = 0; i < complete_guards.size(); i++) {
		  GuardMetaData* cg = complete_guards[i];
		  int guard_index = -1;
		  Slice guard_key = cg->guard_key.user_key(), next_guard_key;
		  if (i + 1 < complete_guards.size()) {
			  next_guard_key = complete_guards[i+1]->guard_key.user_key();
		  }

		  for (; guard_index_iter < guards.size(); guard_index_iter++) {
			  int compare = icmp_.user_comparator()->Compare(guards[guard_index_iter]->guard_key.user_key(), guard_key);
			  if (compare == 0) {
				  guard_index = guard_index_iter;
				  guard_index_iter++;
				  break;
			  } else if (compare > 0) {
				  break;
			  } else {
				  // Ideally it should never reach here since there are no duplicates in complete_guards and complete_guards is a superset of guards
			  }
		  }

		  if (guard_index == -1) { // If guard is not found for this complete guard
			  continue;
		  }
		  GuardMetaData* g = guards[guard_index];
		  bool guard_added = false;
		  for (unsigned j = 0; j < g->files.size(); j++) {
			  FileMetaData* file = g->file_metas[j];
			  Slice file_smallest = file->smallest.user_key();
			  Slice file_largest = file->largest.user_key();
			  if ((i < complete_guards.size()-1 							// If it is not the last guard, checking for smallest and largest to fit in the range
							  && (icmp_.user_comparator()->Compare(file_smallest, guard_key) < 0
									  || icmp_.user_comparator()->Compare(file_largest, next_guard_key) >= 0))
					  || (i == complete_guards.size()-1 						// If it is the last guard, checking for the smallest to fit in the guard
							  && icmp_.user_comparator()->Compare(file_smallest, guard_key) < 0)) {
				  guards_to_add_to_compaction.push_back(g);
				  guards_compaction_add_all_files.push_back(true);	
				  guard_added = true;
				  break; // No need to check other files
			  }
		  }
		  if (!guard_added && which == 0 && (force_compact || v->guard_compaction_scores_[current_level][guard_index] >= 1.0)) {
			  guards_to_add_to_compaction.push_back(g);
			  guards_compaction_add_all_files.push_back(false);
			  continue;
		  }
	  }

VersionSet::RemoveFileLevelBloomFilterInfo isn't thread-safe

How to Reproduce
Run db_test many times then see DBTest.MultiThreaded crashes occasionally

Stack trace

==== Test DBTest.MultiThreaded
[New Thread 0x7fff4cfd9700 (LWP 23220)]
[New Thread 0x7fff4c7d8700 (LWP 23221)]
[New Thread 0x7fff4bfd7700 (LWP 23222)]
... starting thread 0
[New Thread 0x7fff4b7d6700 (LWP 23223)]
... starting thread 1
[New Thread 0x7fff4afd5700 (LWP 23224)]
... starting thread 2
[New Thread 0x7fff4a7d4700 (LWP 23225)]
... starting thread 3

Thread 307 "db_test" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff4c7d8700 (LWP 23221)]
0x00007ffff7afe106 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) where
#0  0x00007ffff7afe106 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00005555555d46e3 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#2  0x00005555555d2919 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#3  0x00005555555d0a64 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#4  0x00005555555cd5e9 in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase[abi:cxx11](std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::_Rb_tree_const_iterator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >) ()
#5  0x00005555555c8e3c in std::_Rb_tree<unsigned long, std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, std::_Select1st<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase(unsigned long const&) ()
#6  0x00005555555c6273 in std::map<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*> > >::erase(unsigned long const&) ()
#7  0x00005555555b9358 in leveldb::VersionSet::RemoveFileLevelBloomFilterInfo(unsigned long) ()
#8  0x0000555555591e7f in leveldb::DBImpl::DeleteObsoleteFiles() ()
#9  0x0000555555595600 in leveldb::DBImpl::BackgroundCompactionGuards(leveldb::FileLevelFilterBuilder*) ()
#10 0x0000555555594dc3 in leveldb::DBImpl::CompactLevelThread() ()
#11 0x000055555559e395 in leveldb::DBImpl::CompactLevelWrapper(void*) ()
#12 0x00005555555e7403 in leveldb::(anonymous namespace)::StartThreadWrapper(void*) ()
#13 0x00007ffff7326494 in start_thread (arg=0x7fff4c7d8700) at pthread_create.c:333
#14 0x00007ffff7068acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Does PebblesDB compatible with leveldb completely?

Does PebblesDB compatible with leveldb completely? for example, if I used to use leveldb , will it get error if I use PebblesDB to open it ?

Fix memory leak when using PebblesDB with small key-value pairs

PebblesDB has a known memory-leak when used with a large number of small key-value pairs. This doesn't appear for some reason when PebblesDB is used with large key-value pairs (the default). This doesn't affect default behavior, but we would like to fix this going forward.