Giter Club home page Giter Club logo

mini-lsm's Introduction

banner

LSM in a Week

CI (main)

Build a simple key-value storage engine in a week! And extend your LSM engine on the second + third week.

The Mini-LSM book is available at https://skyzh.github.io/mini-lsm. You may follow this guide and implement the Mini-LSM storage engine. We have 3 weeks (parts) of the tutorial, each of them consists of 7 days (chapters).

Community

You may join skyzh's Discord server and study with the mini-lsm community.

Join skyzh's Discord Server

Add Your Solution

If you finished at least one full week of this tutorial, you can add your solution to the community solution list at SOLUTIONS.md. You can submit a pull request and we might do a quick review of your code in return of your hard work.

Development

For Students

You should modify code in mini-lsm-starter directory.

cargo x install-tools
cargo x copy-test --week 1 --day 1
cargo x scheck
cargo run --bin mini-lsm-cli
cargo run --bin compaction-simulator

For Course Developers

You should modify mini-lsm and mini-lsm-mvcc

cargo x install-tools
cargo x check
cargo x book

If you changed public API in the reference solution, you might also need to synchronize it to the starter crate. To do this, use cargo x sync.

Code Structure

  • mini-lsm: the final solution code for <= week 2
  • mini-lsm-mvcc: the final solution code for week 3 MVCC
  • mini-lsm-starter: the starter code
  • mini-lsm-book: the tutorial

We have another repo mini-lsm-solution-checkpoint at https://github.com/skyzh/mini-lsm-solution-checkpoint. In this repo, each commit corresponds to a chapter in the tutorial. We will not update the solution checkpoint very often.

Demo

You can run the reference solution by yourself to gain an overview of the system before you start.

cargo run --bin mini-lsm-cli-ref
cargo run --bin mini-lsm-cli-mvcc-ref

And we have a compaction simulator to experiment with your compaction algorithm implementation,

cargo run --bin compaction-simulator-ref
cargo run --bin compaction-simulator-mvcc-ref

Tutorial Structure

We have 3 weeks + 1 extra week (in progress) for this tutorial.

  • Week 1: Storage Format + Engine Skeleton
  • Week 2: Compaction and Persistence
  • Week 3: Multi-Version Concurrency Control
  • The Extra Week / Rest of Your Life: Optimizations (unlikely to be available in 2024...)

Tutorial Roadmap

Week + Chapter Topic
1.1 Memtable
1.2 Merge Iterator
1.3 Block
1.4 Sorted String Table (SST)
1.5 Read Path
1.6 Write Path
1.7 SST Optimizations: Prefix Key Encoding + Bloom Filters
2.1 Compaction Implementation
2.2 Simple Compaction Strategy (Traditional Leveled Compaction)
2.3 Tiered Compaction Strategy (RocksDB Universal Compaction)
2.4 Leveled Compaction Strategy (RocksDB Leveled Compaction)
2.5 Manifest
2.6 Write-Ahead Log (WAL)
2.7 Batch Write and Checksums
3.1 Timestamp Key Encoding
3.2 Snapshot Read - Memtables and Timestamps
3.3 Snapshot Read - Transaction API
3.4 Watermark and Garbage Collection
3.5 Transactions and Optimistic Concurrency Control
3.6 Serializable Snapshot Isolation
3.7 Compaction Filters

License

The Mini-LSM starter code and solution are under Apache 2.0 license. The author reserves the full copyright of the tutorial materials (markdown files and figures).

mini-lsm's People

Contributors

ben1009 avatar caicancai avatar cypppper avatar dimbtp avatar el-even-11 avatar felixonmars avatar husharp avatar lacop avatar leiysky avatar letterbeezps avatar mahinshaw avatar pinelliac avatar ppdogg avatar redixhumayun avatar skyzh avatar ss18 avatar stopire avatar xiaguan avatar xxchan avatar xzhseh avatar yangchenye323 avatar yongxin-hu avatar yyin-dev avatar zeng1998 avatar zxch3n avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mini-lsm's Issues

no such command: `nextest`

I got an error when using cargo x scheck to check the style and run all test cases according to Tutorial:

cargo fmt
cargo check
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
cargo nextest run
error: no such command: `nextest`

        Did you mean `test`?

        View all installed commands with `cargo --list`
Error: command ["cargo", "nextest", "run"] exited with code 101

after I replace nextest with test in test(), only style checking works, no test cases run.

Feedback after coding day 1

Hey,
Discovered this project recently and I find it super interesting, I had a very superficial understanding of LSM engines having worked a little with Cassandra, but I never had the time and need to dig in and understand how it works in details. This tutorial sparked the motivation to do so and so far I find it really interesting! Thanks for taking your time to build such interesting learning resources.

I just finished the day 1 part for now but I feel that I have a bit of feedback to give:

  • I took me some time to understand that it's normal and expected that the BlockBuilder::add method is not responsible for sorting the entries of the block, but that instead the caller is responsible to insert entries in sorted order. After thinking quite a bit about it it finally made sense because blocks (and SSTables) are crafted from the in-memory tables and we therefore have all the data available when building a block, but it was not clear in the text and I spent some time wondering why we were implementing a bisect search in the block iterator while we didn't enforce any sorting in the block building.
  • Some things doesn't feel really "rusty". I'm thinking about the iterator interface you designed for BlockIterator, which looks really different from iterators we are used to have in rust. In particular, the idea of "invalidating" the iterator by using empty vecs as key and value when we consume all the iterator, looks really different from the pattern of the next method returning an Option<T>, with None marking end of iteration. This is not bad per se, this custom iterator having different requirements, but it's a bit surprising.
  • I think your unit tests could have more coverage, and also be split into smaller tests that do test a single property. For instance, when in the tutorial you specify a property like "Key length and value length are 2B, which means their maximum length is 65536.", I'm expecting to see a unit test that will cover this property and make sure it is enforced. I did some of this in my day 1 work in my fork if you wanna check it out (LeBoucEtMistere#1)
  • I think the tutorial steps could provide clearer entry points in code, i.e. "to solve this task, you will need to implement function in XXX".
  • I noticed you liberally use as conversions in your solution, they tend to be risky especially when you convert to a more restrictive type, e.g. usize to u16. In that case it's best to use try_into to handle the case in which the conversion fails. That's not a big deal but since that's a tutorial it's always nice to leave a note about this to teach people about the risks of as conversions :)
  • I never used the bytes crate before so I had to spend some time reading the doc and understanding how it works by looking at the solution code. I would have loved it if there was a note in the tutorial that mentioned it would be wise to leverage this crate to build the solution (because it is not mentioned anywhere except in the solution so I suspect many people might not know it's available to help), and maybe also giving a very quick example in the tutorial of how it can help playing with bytes.
  • In the part about the encoding of blocks, you give great schemas of the format, but I would also have loved seeing a concrete example to make it clearer, with actual example data.
  • I would have loved seeing some pointers to crates/algorithms in the extra tasks. I'm not very familiar with algos for compression/checksums, especially in the context of LSM engines, and it felt quite hard to explore this topic without a few pointers.

These are just a bunch of suggestions from my personal experience but I hope it helps you make your tutorial even better. Thanks for the great work, looking forward to working on day 2 tasks :D 🚀 !

failed to load manifest for workspace member

When I run the command cargo x install-tools, I found the following error:

$ cargo x install-tools

error: failed to load manifest for workspace member `/Users/wjjiang/rustproject/mini-lsm/mini-lsm`

Caused by:
  failed to parse manifest at `/Users/wjjiang/rustproject/mini-lsm/mini-lsm/Cargo.toml`

Caused by:
  invalid type: map, expected a string for key `package.edition`

I don't know how to fix it, could someone help?

question about `MergeIterator.next()`

// Otherwise, compare with heap top and swap if necessary.
if let Some(mut inner_iter) = self.iters.peek_mut() {
if *current < *inner_iter {
std::mem::swap(&mut *inner_iter, current);
}
}

I'm not familiar with rust, so I do not clearly understand the meaning of std::mem::swap(&mut *inner_iter, current) here in MergeIterator.next impl.

I guess it maybe means that heap.pop() first, and then heap.push(current)? And why we have to use swap function here?

Why remove the `memtable` from `imm_memtables` in `LsmStorgae.sync()`?

// Add the flushed L0 table to the list.
{
let mut guard = self.inner.write();
let mut snapshot = guard.as_ref().clone();
// Remove the memtable from the immutable memtables.
snapshot.imm_memtables.pop();
// Add L0 table
snapshot.l0_sstables.push(sst);
// Update SST ID
snapshot.next_sst_id += 1;
// Update the snapshot.
*guard = Arc::new(snapshot);
}

In sync() we push the memtable to imm_memtables first, but remove it during flush to ssTable. What's the reason for removing memtable from imm_memtables? If we do not remove it, we can get key-value in imm_memtables which is located in memory.

And what is the role of imm_memtables in LsmStorage? It's unclear to me.

Scheme fork of mini-lsm

I am working on mini-lsm in Scheme. It support:

  • bigger than volatile memory;
  • in-range key count
  • in-range byte count;

Open for cooperation.

Leveled compaction crashes when recovering from manifest

The LeveledCompactionController::apply_compaction_result method tries to sort the merged SSTs by key inside the method, which will cause problem when this method is called in manifest recovery context, because at that point no actual SSTs are loaded.

[Doc] Any plans to support chinese version tutorial

Hi, I think this is a very good repository for understanding LSM and Rust. But I found that the tutorial does not seem to have a Chinese version, which seems to be difficult for non-English native developers to understand.

Do we have any plans to support chinese version tutorial? If not, I think I can help to support it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.