skyzh / mini-lsm Goto Github PK

View Code? Open in Web Editor NEW

2.6K 31.0 351.0 840 KB

A tutorial of building an LSM-Tree storage engine in a week!

Home Page: https://skyzh.github.io/mini-lsm/

License: Apache License 2.0

Rust 99.93% CSS 0.03% Shell 0.04%

database lsm-tree rust storage tutorial key-value-store kv-store lsm

mini-lsm's Introduction

LSM in a Week

Build a simple key-value storage engine in a week! And extend your LSM engine on the second + third week.

Tutorial

The Mini-LSM book is available at https://skyzh.github.io/mini-lsm. You may follow this guide and implement the Mini-LSM storage engine. We have 3 weeks (parts) of the tutorial, each of them consists of 7 days (chapters).

Community

You may join skyzh's Discord server and study with the mini-lsm community.

Add Your Solution

If you finished at least one full week of this tutorial, you can add your solution to the community solution list at SOLUTIONS.md. You can submit a pull request and we might do a quick review of your code in return of your hard work.

Development

For Students

You should modify code in mini-lsm-starter directory.

cargo x install-tools
cargo x copy-test --week 1 --day 1
cargo x scheck
cargo run --bin mini-lsm-cli
cargo run --bin compaction-simulator

For Course Developers

You should modify mini-lsm and mini-lsm-mvcc

cargo x install-tools
cargo x check
cargo x book

If you changed public API in the reference solution, you might also need to synchronize it to the starter crate. To do this, use cargo x sync.

Code Structure

mini-lsm: the final solution code for <= week 2
mini-lsm-mvcc: the final solution code for week 3 MVCC
mini-lsm-starter: the starter code
mini-lsm-book: the tutorial

We have another repo mini-lsm-solution-checkpoint at https://github.com/skyzh/mini-lsm-solution-checkpoint. In this repo, each commit corresponds to a chapter in the tutorial. We will not update the solution checkpoint very often.

Demo

You can run the reference solution by yourself to gain an overview of the system before you start.

cargo run --bin mini-lsm-cli-ref
cargo run --bin mini-lsm-cli-mvcc-ref

And we have a compaction simulator to experiment with your compaction algorithm implementation,

cargo run --bin compaction-simulator-ref
cargo run --bin compaction-simulator-mvcc-ref

Tutorial Structure

We have 3 weeks + 1 extra week (in progress) for this tutorial.

Week 1: Storage Format + Engine Skeleton
Week 2: Compaction and Persistence
Week 3: Multi-Version Concurrency Control
The Extra Week / Rest of Your Life: Optimizations (unlikely to be available in 2024...)

Week + Chapter	Topic
1.1	Memtable
1.2	Merge Iterator
1.3	Block
1.4	Sorted String Table (SST)
1.5	Read Path
1.6	Write Path
1.7	SST Optimizations: Prefix Key Encoding + Bloom Filters
2.1	Compaction Implementation
2.2	Simple Compaction Strategy (Traditional Leveled Compaction)
2.3	Tiered Compaction Strategy (RocksDB Universal Compaction)
2.4	Leveled Compaction Strategy (RocksDB Leveled Compaction)
2.5	Manifest
2.6	Write-Ahead Log (WAL)
2.7	Batch Write and Checksums
3.1	Timestamp Key Encoding
3.2	Snapshot Read - Memtables and Timestamps
3.3	Snapshot Read - Transaction API
3.4	Watermark and Garbage Collection
3.5	Transactions and Optimistic Concurrency Control
3.6	Serializable Snapshot Isolation
3.7	Compaction Filters

License

The Mini-LSM starter code and solution are under Apache 2.0 license. The author reserves the full copyright of the tutorial materials (markdown files and figures).

mini-lsm's People

Contributors

Stargazers

Watchers

Forkers

yaoyaoio zrealshadow sgzerolc zhouwfang chsjiang parsedark cycloidzzz qiaolin-yu isgasho jancd wbtlb shscy sharath-naik0 junzhigediao iyueyong corrigentia mahinshaw gopherj sulina9 jiabai98 xiangjiao9 crt-fork iamljd loote2 zhoush41 saliou33 ss18 amyblur 1920853199 rafikmojr ns11 asdlei99 hu00yan pravinshahi0007 weykon hawkingrei zouyishan dousir9 duytruongpham warmchang wathenjiang gobraves zeminzhou dashjay karlvenk wenyxu rohankumardubey 67jason sstamoulis rustacean-workshop ndrsllwngr dlhxzb shichao-wang leboucetmistere ggethan stevelauc lordworms mooleetzi zmshahaha stopire spenserzc spl0i7 hrh007 tianion loloxwg hypenzou duanfuxiang0 felixduan0 ywy2090 keliwang xuxian73 wilbertharriman clslaid yutiansut yaaawww yoeight moxiney stamp711 ponsecis engkaubabi2 v1siuol xuxiaotuan centersakshi flaymes eveningcafe kardusenor lu-kuan-lpk leager-zju hustjieke jubilee101 jblueberry szupzj18 ehhuts xzhseh zen0fpy chpaek ting-kkk congee littleli-star7 andrewrong

mini-lsm's Issues

simple_leveled compaction apply

in this fille https://github.com/skyzh/mini-lsm/blob/main/mini-lsm/src/compact/simple_leveled.rs

let new_l0_sstables = snapshot
    .l0_sstables
    .iter()
    .copied()
    .filter(|x| !l0_ssts_compacted.remove(x))
    .collect::<Vec<_>>();
assert!(l0_ssts_compacted.is_empty());
snapshot.l0_sstables = new_l0_sstables;

snapshot.l0_sstables is lastest，l0_tables may be compacted by full compaction task, l0_sstables may be empty, so i'm not sure this assert is right?

Why remove the `memtable` from `imm_memtables` in `LsmStorgae.sync()`?

mini-lsm/mini-lsm/src/lsm_storage.rs

Lines 165 to 177 in 130b47b

 // Add the flushed L0 table to the list. 

 { 

 let mut guard = self.inner.write(); 

 let mut snapshot = guard.as_ref().clone(); 

 // Remove the memtable from the immutable memtables. 

 snapshot.imm_memtables.pop(); 

 // Add L0 table 

 snapshot.l0_sstables.push(sst); 

 // Update SST ID 

 snapshot.next_sst_id += 1; 

 // Update the snapshot. 

 *guard = Arc::new(snapshot); 

 }

In sync() we push the memtable to imm_memtables first, but remove it during flush to ssTable. What's the reason for removing memtable from imm_memtables? If we do not remove it, we can get key-value in imm_memtables which is located in memory.

And what is the role of imm_memtables in LsmStorage? It's unclear to me.

documentation error in day1 task2

https://github.com/skyzh/mini-lsm/blob/210a1c66c17423654ec82046dbdf794dc4f05bfe/mini-lsm-book/src/week1-01-memtable.md?plain=1#L42C1-L42C17

It seems that task2 does not require changing the mem_table.rs

Tracking: Proof Reading, v1 release

Is there anything else besides just writing?

Feedback after complete mini-lsm.

First of all, thank you for such a great tutorial on LSM-Tree storage engine.

After finishing 3 weeks project, I have got a lot of to share, including some questions and suggestions.

1. StorageIterator specification

Is there an implicit StorageIterator specification?

In week 1 project, there are several iterators that implement StorageIterator.

As I implemented these iterators, I often asked myself, "Have I implemented these iterators in a coherent way?"

It turns out that I wrote the same assertions again and again, so I feel there is some implicit specification that I should apply.

The spec I summarized is listed below:

fn key(&self) -> KeyType<'_>
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
- The return value must be an valid key, which means it is an non-empty key.
fn value(&self) -> &[u8]
- This function can be called iff the iterator is valid, if it is called from an invalid iterator, we could panic.
fn is_valid(&self) -> bool
- This function return false iff its underlying storage cannot produce a new element.
- For FuseIterator, it also return false after next return an error.
fn num_active_iterators(&self) -> usize
- For invalid iterator, it should return 0
- For valid iterator, it should not return 0.
- For FuseIterator, when the iterator is valid, it should not return 0. when the iterator is invalid, the return value is unspecified.
fn next(&mut self) -> Result<()>
- For invalid iterator, it is no-op.
- For valid iterator, if return Ok(())
  - If underlying storage cannot produce a new element, turn into invalid iterator. return Ok(())
  - Otherwise, store the new element. return Ok(())
  - The key from new element should strict greater than old key.
- For valid iterator, if return Err(e)
  - The key, value, is_valid call should return same value as before, that means act like no-op.
  - For compound of iterators like MergeIterator, SstConcatIterator and TwoMergeIterator, it should remove the underlying iterator which cause the error. (I'm not sure whether is sound for SstConcatIterator and TwoMergeIterator)
  - For FuseIterator, it should turn into invalid iterator.

I'm not sure if it covers all situations.

2. Inconsistent function signature

As mentioned in #72. There is an inconsistent function signature in decode_block_meta.

mini-lsm/mini-lsm-starter/src/table.rs

Lines 45 to 46 in dd333ca

 /// Decode block meta from a buffer. 

 pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> {

mini-lsm/mini-lsm/src/table.rs

Lines 63 to 64 in dd333ca

 /// Decode block meta from a buffer. 

 pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> {

Until week 2 day 7, the signature is fine, but it turns out that this is not sufficient for checksum functionality.

The original signature implied that this function is infallible. But checksum will cause an error, so the return type should be wrapped in Result,
&[u8] is more convenient than impl Buf when implementing checksum functionality.

3. key always non-empty

It seems that an empty key is considered invalid. So we can enforce it by adding ParsedKey type, we could use nutype crate to achieve this goal.

But it may not be necessary for an educational project, since it is convenient to use an empty key instead of Option wrapper.

4. `apply_result` does not point out what it should return

mini-lsm/mini-lsm-starter/src/compact/simple_leveled.rs

Lines 41 to 48 in dd333ca

 /// Apply the compaction result. 

 /// 

 /// The compactor will call this function with the compaction task and the list of SST ids generated. This function applies the 

 /// result and generates a new LSM state. The functions should only change `l0_sstables` and `levels` without changing memtables 

 /// and `sstables` hash map. Though there should only be one thread running compaction jobs, you should think about the case 

 /// where an L0 SST gets flushed while the compactor generates new SSTs, and with that in mind, you should do some sanity checks 

 /// in your implementation. 

 pub fn apply_compaction_result(

As the document does point out that LsmStorageState should be the new state, but it does not point out what Vec<usize> means.
Only after I exam solution code, I figure out that should be sst_ids which need be removed.

5. Seedable level compaction simulator

The level compaction simulator uses rand to generate the key range, so each run will produce different output, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff cmp.

One solution is to add the seed flag to the simulator, and use rand::SeedableRng to generate random numbers.

6. `println!()` to log

In the reference solution, the implementation of apply_result contains some of println!() to log information, making it difficult to verify that the output is identical to the reference solution by using command line tools such as diff, cmp.

I think it should be replaced with eprintln!(), which will output to stderr and does not lose logging functionality.

7. Encourage to add test

In rust project, adding test is pretty easy! and adding tests is an easiest way to contribute to repository.

For example, when implementing manifest functionality, it is good to test serde_json library first.
So I can quickly write relative test within manifest.rs.

#[cfg(test)]
#[test]
fn test_record_serde() {
    let record1 = ManifestRecord::Flush(1);
    let json = serde_json::to_vec(&record1).unwrap();
    let record2 = serde_json::from_slice(&json).unwrap();
    assert_eq!(record1, record2);
}

Most of IDEs have the ability to run tests directly in Rust, which is different from C/C++ or Java, where you need some sort of test framework or use the main function to run a single simple test.

8. extract apply result logic to bind specific impl

When implementing apply_result function, I use a lot of related method of Vec in my solution,
e.g. Vec::splice, Vec::drain. But the levels Vec<(usize, Vec<usize>)> seems inefficient for doing compaction.

We could generalize levels impl like levels: Impl LevelsImpl and create a type alias type Levels = Vec<(usize, Vec<usize>)>;, and impl LevelsImpl trait for Vec<(usize, Vec<usize>)>, and we have the ability to easily switch to another implementation.

But it is of low priority, since this project is just for educational purposes.

9. immutable memtable is not immutated

Due to the interior mutability of MemTable, there is no way to forbid programmers from modifying immutable memtables in Vec<Arc<MemTable>>.

Which may (low probability) cause a nasty bug. But could be avoided with a little care.

The solution is quite simple: Use the Newtype pattern and expose only the read-only interface.

10. Migration to mvcc version is considered painful

It is kind of dirty work, and frustrating, to refactor code to use the key+ts representation.

The code that needs to be changed is scattered across many files.

Asking students to write unsafe code in Rust is awkward.

Learning on code that does not complied is quite inefficient.

11. use `Arc_cycle` to store weak pointer into LsmMvccInner

pub fn new_txn(&self, inner: Arc<LsmStorageInner>, serializable: bool) -> Arc<Transaction>

Strictly speaking, LsmMvccInner::new_txn can only be called with the LsmStorageInner that created it, but its semantics do not restrict this.

For example, if there are two different LsmStorageInner named storageA and storageB with associated LsmMvccInner named mvccA and mvccB,
nobody forbids us to call mvccB.new_txn(storageA) or mvccA.new_txn(storageB), which is obviously a logical error.

However, as project developers, we know that there is only one LsmStorageInner instance, so this won't be a critical issue.

But ideally, we can do better by using Arc::Arc_cycle to pass a weak pointer directly into LsmMvccInner::new.

12. TxnIterator, LsmIterator is not compatible on non-mvcc version

Currently TxnIterator uses TwoMergeIterator<TxnLocalIterator, LsmIterator> to iterate over the underlying storage.

When switching to the non-mvcc version, creating an empty transaction without mvcc can be tricky.

It might be better to use Option<A> in TwoMergeIterator and Option<Arc<Transaction>> in TxnIterator to handle this situation.

Again, thank you for such a great tutorial on LSM-Tree storage engine.

failed to load manifest for workspace member

When I run the command cargo x install-tools, I found the following error:

$ cargo x install-tools

error: failed to load manifest for workspace member `/Users/wjjiang/rustproject/mini-lsm/mini-lsm`

Caused by:
  failed to parse manifest at `/Users/wjjiang/rustproject/mini-lsm/mini-lsm/Cargo.toml`

Caused by:
  invalid type: map, expected a string for key `package.edition`

I don't know how to fix it, could someone help?

Feedback after coding day 1

Hey,
Discovered this project recently and I find it super interesting, I had a very superficial understanding of LSM engines having worked a little with Cassandra, but I never had the time and need to dig in and understand how it works in details. This tutorial sparked the motivation to do so and so far I find it really interesting! Thanks for taking your time to build such interesting learning resources.

I just finished the day 1 part for now but I feel that I have a bit of feedback to give:

I took me some time to understand that it's normal and expected that the BlockBuilder::add method is not responsible for sorting the entries of the block, but that instead the caller is responsible to insert entries in sorted order. After thinking quite a bit about it it finally made sense because blocks (and SSTables) are crafted from the in-memory tables and we therefore have all the data available when building a block, but it was not clear in the text and I spent some time wondering why we were implementing a bisect search in the block iterator while we didn't enforce any sorting in the block building.
Some things doesn't feel really "rusty". I'm thinking about the iterator interface you designed for BlockIterator, which looks really different from iterators we are used to have in rust. In particular, the idea of "invalidating" the iterator by using empty vecs as key and value when we consume all the iterator, looks really different from the pattern of the next method returning an Option<T>, with None marking end of iteration. This is not bad per se, this custom iterator having different requirements, but it's a bit surprising.
I think your unit tests could have more coverage, and also be split into smaller tests that do test a single property. For instance, when in the tutorial you specify a property like "Key length and value length are 2B, which means their maximum length is 65536.", I'm expecting to see a unit test that will cover this property and make sure it is enforced. I did some of this in my day 1 work in my fork if you wanna check it out (LeBoucEtMistere#1)
I think the tutorial steps could provide clearer entry points in code, i.e. "to solve this task, you will need to implement function in XXX".
I noticed you liberally use as conversions in your solution, they tend to be risky especially when you convert to a more restrictive type, e.g. usize to u16. In that case it's best to use try_into to handle the case in which the conversion fails. That's not a big deal but since that's a tutorial it's always nice to leave a note about this to teach people about the risks of as conversions :)
I never used the bytes crate before so I had to spend some time reading the doc and understanding how it works by looking at the solution code. I would have loved it if there was a note in the tutorial that mentioned it would be wise to leverage this crate to build the solution (because it is not mentioned anywhere except in the solution so I suspect many people might not know it's available to help), and maybe also giving a very quick example in the tutorial of how it can help playing with bytes.
In the part about the encoding of blocks, you give great schemas of the format, but I would also have loved seeing a concrete example to make it clearer, with actual example data.
I would have loved seeing some pointers to crates/algorithms in the extra tasks. I'm not very familiar with algos for compression/checksums, especially in the context of LSM engines, and it felt quite hard to explore this topic without a few pointers.

These are just a bunch of suggestions from my personal experience but I hope it helps you make your tutorial even better. Thanks for the great work, looking forward to working on day 2 tasks :D 🚀 !

Confusing SST structure figure in W1D4

The SST stucture is described in W1D4 like this.

-------------------------------------------------------------------------------------------
|         Block Section         |          Meta Section         |          Extra          |
-------------------------------------------------------------------------------------------
| data block | ... | data block |            metadata           | meta block offset (u32) |
-------------------------------------------------------------------------------------------

I found "meta block offset (u32)" is included in meta data instead of Extra. Am I missing something? Or is there an error in this figure?

Wouldn't it be more clearer if it were described like this?

-------------------------------------------------------------------------------------------
|         Block Section         |          Meta Section         |          Extra          |
-------------------------------------------------------------------------------------------
| data block | ... | data block |  data meta | ... | data meta  |       checksum (u32)    |
-------------------------------------------------------------------------------------------

WAL Atomiticy

As pointed out by someone in Discord, we do not guarantee atomicity for WAL, in the case of txn commit.

Should function signatures be consistent between mini-lsm-starter and mini-lsm?

Hi, I noticed that some function signatures is different beween mini-lsm-starter and mini-lsm.

For example, decode_block_meta in mini-lsm-starter

mini-lsm/mini-lsm-starter/src/table.rs

Lines 45 to 47 in bcaab6f

 /// Decode block meta from a buffer. 

 pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> { 

 unimplemented!()

decode_block_meta in mini-lsm

mini-lsm/mini-lsm/src/table.rs

Lines 64 to 66 in bcaab6f

 pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> { 

 let mut block_meta = Vec::new(); 

 let num = buf.get_u32() as usize;

According to the commit history, it seems that decode_block_meta in mini-lsm was later modified, causing the function signature to change. I found many similar situations while reading the code.

Does it need to be fixed? If it is, let me know and I can help to fix them.

no such command: `nextest`

I got an error when using cargo x scheck to check the style and run all test cases according to Tutorial:

cargo fmt
cargo check
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
cargo nextest run
error: no such command: `nextest`

        Did you mean `test`?

        View all installed commands with `cargo --list`
Error: command ["cargo", "nextest", "run"] exited with code 101

after I replace nextest with test in test(), only style checking works, no test cases run.

Could you please offer a Chinese version tutorial?

Nextest required but not included

Just a heads up: I followed the environment setup steps and was getting:

cargo nextest run
error: no such command: `nextest`

	Did you mean `test`?

	View all installed commands with `cargo --list`
	Find a package to install `nextest` with `cargo search cargo-nextest`
Error: command ["cargo", "nextest", "run"] exited with code 101

It looks like I had to cargo install cargo-nextest to get things working. Seems like this should be included in the environment setup docs.

[Bug] compaction-simulator-ref leveled display bug

The SST IDs inserted in the compression plan for L4 should be the cumulative result of 1 and 2, which are 3 and 4.
However, after compression, the displayed SST IDs remain as the original 1 and 2.

Leveled compaction crashes when recovering from manifest

The LeveledCompactionController::apply_compaction_result method tries to sort the merged SSTs by key inside the method, which will cause problem when this method is called in manifest recovery context, because at that point no actual SSTs are loaded.

LsmStorage::scan does not include level 1-6 iterators

I noticed that the reference scan function in https://github.com/skyzh/mini-lsm/blob/main/mini-lsm/src/lsm_storage.rs doesn't include l1-l6 iterators. It stops at merging the memtables and the l0_sstables. Wondering if this is a mistake or there's a reason behind doing so. I would suppose scan would return an iterator over all the data in the storage?

Why does HeapWrapper use Box

mini-lsm/mini-lsm/src/iterators/merge_iterator.rs

Line 9 in 130b47b

struct HeapWrapper<I: StorageIterator>(pub usize, pub Box<I>);

I noticed that the 'Box' here seems to be unnecessary.
struct HeapWrapper<I: StorageIterator>(pub usize, pub I) passes all tests. (I just delete all Box::new())

RLock or WLock when we do a put/deletion?

In docs, it says that we should take a write lock when we do a put/deletion:

extra-tasks

But in reference solution, we take a read lock actually:

line109 and line119

It seems that RLock is enough, because MemTable is concurrent safe?

Scheme fork of mini-lsm

I am working on mini-lsm in Scheme. It support:

bigger than volatile memory;
in-range key count
in-range byte count;

Open for cooperation.

Hello, when is the next update plan for the tutorial?

question about `MergeIterator.next()`

mini-lsm/mini-lsm/src/iterators/merge_iterator.rs

Lines 129 to 134 in 9e1c0ca

 // Otherwise, compare with heap top and swap if necessary. 

 if let Some(mut inner_iter) = self.iters.peek_mut() { 

 if *current < *inner_iter { 

 std::mem::swap(&mut *inner_iter, current); 

 } 

 }

I'm not familiar with rust, so I do not clearly understand the meaning of std::mem::swap(&mut *inner_iter, current) here in MergeIterator.next impl.

I guess it maybe means that heap.pop() first, and then heap.push(current)? And why we have to use swap function here?

[Doc] Any plans to support chinese version tutorial

Hi, I think this is a very good repository for understanding LSM and Rust. But I found that the tutorial does not seem to have a Chinese version, which seems to be difficult for non-English native developers to understand.

Do we have any plans to support chinese version tutorial? If not, I think I can help to support it.

[Bug] An error is reported when the source code is compiled

When I compiled according to the source code, it said that the Rust version was too low.Probably because I don't know Rust and just started learning.

This bug looks as if it is a dependency issue

	// Add the flushed L0 table to the list.
	{
	let mut guard = self.inner.write();
	let mut snapshot = guard.as_ref().clone();
	// Remove the memtable from the immutable memtables.
	snapshot.imm_memtables.pop();
	// Add L0 table
	snapshot.l0_sstables.push(sst);
	// Update SST ID
	snapshot.next_sst_id += 1;
	// Update the snapshot.
	*guard = Arc::new(snapshot);
	}

	/// Decode block meta from a buffer.
	pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> {

	/// Decode block meta from a buffer.
	pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> {

	/// Apply the compaction result.
	///
	/// The compactor will call this function with the compaction task and the list of SST ids generated. This function applies the
	/// result and generates a new LSM state. The functions should only change `l0_sstables` and `levels` without changing memtables
	/// and `sstables` hash map. Though there should only be one thread running compaction jobs, you should think about the case
	/// where an L0 SST gets flushed while the compactor generates new SSTs, and with that in mind, you should do some sanity checks
	/// in your implementation.
	pub fn apply_compaction_result(

	/// Decode block meta from a buffer.
	pub fn decode_block_meta(buf: impl Buf) -> Vec<BlockMeta> {
	unimplemented!()

	pub fn decode_block_meta(mut buf: &[u8]) -> Result<Vec<BlockMeta>> {
	let mut block_meta = Vec::new();
	let num = buf.get_u32() as usize;

	// Otherwise, compare with heap top and swap if necessary.
	if let Some(mut inner_iter) = self.iters.peek_mut() {
	if current < inner_iter {
	std::mem::swap(&mut *inner_iter, current);
	}
	}