Giter Club home page Giter Club logo

parity-db's Introduction

A database for the blockchain.

ParityDb is an embedded persistent key-value store optimized for blockchain applications.

Design considerations

  • The database is intended to be used for efficiently storing blockchain state encoded into the Patricia-Merkle trie. Which means most of the keys are fixed size and uniformly distributed. Most values are small. Values over 16 kbytes are rare. Trie nodes may be shared by multiple tries and/or branches, therefore reference counting is required.
  • Read performance is more important than write performance for blockchain transaction throughput. Writes are typically performed in large batches, when the new block is imported. There's usually some time between subsequent writes when the blockchain client is idle or executes the block.

API

The database is a universal key-value storage that supports transactions. The API allows the data to be partitioned into columns. It is recommended that each column contains entries corresponding to a single data type. E.g. state trie node, block headers, blockchain transactions, etc. Two types of column indexes are supported: Hash and Btree.

Transactions

Database supports multiple concurrent readers. All writes are serialized. Writes are performed in batches, also known as transactions. Transactions are applied atomically. Either all of the transaction data is written, or none. Queries can't retrieve partially committed data.

No cache

Database does not implement any custom data caching. Instead, it relies on OS page cache. Performance of a large database therefore depends on how much system memory is available to be used in OS page cache.

Durability

Database is restored to consistent state if IO is interrupted at any point. Note that this state might not be the latest state before the interruption. Writes already committed but not already saved to disk by the background threads are lost.

Implementation details

Data structure

Each column stores data in a set of 256 value tables, with 255 tables containing entries of certain size range up to 32kbytes limit. The last 256th value table size stores entries that are over 32k split into multiple parts. Hash columns also include a hash index file.

Metadata

Metadata file contains database definition. This includes a set of columns with configuration specified for each column.

Hash index

Hash index is an is mmap-backed dynamically sized probing hash table. For each key the index computes a uniformly distributed 256-bit hash 'k'. For index of size n, the first n bits of k map to the 512 byte index page. Each page is an unordered list of 64 8-byte entries. Each 8-byte entry contains a value address and some additional bits of k. An empty entry is denoted with a zero value. An empty database starts with n = 16. A value address includes an 8-bit value table index and an index of an entry in that table. The first 16 kbytes of each index file is used to store statistics for the column.

Value tables

Value table is a linear array of fixed-size entries that can grow as necessary. Each entry may contain one of the following:

  • Filled entry that contains 240 bits of k, 15 bit data value size, compression flag, optional reference counter, and the actual value.
  • Tombstone entry. This contains an index of the previous tombstone, forming a linked list of empty entries.
  • Multipart entry. This is much like Filled, additionally holding an address of the next entry that holds the continuation of the data.

Hash index operations

Hash index lookup

Compute k, find index page using first n bits. Search for a matching entry that has matching key bits. Use the address in the entry to query the partial k and value from a value table. Confirm that k matches expected value.

Hash index insertion

If an insertion is attempted into a full index page a reindex is triggered. Page size of 64 index entries trigger a reindex once load factor reaches about 0.52.

Reindex

When a collision can't be resolved, a new index table is created with twice the capacity. Insertion is immediately continued to the new table. A background process is started that moves entries from the old table to the new. All queries during that process check both tables.

BTree index operations.

TODO

Transaction pipeline

On commit all data is moved to an in-memory overlay, making it available for queries. That data is then added to the commit queue. This allows for commit function to return as early as possible. Commit queue is processed by a commit worker that collects data that would be modified in the index or value tables and writes it to the binary log file as a sequence of commands. All modified index and value table pages are placed in the in-memory overlay. The file is then handled to another background thread that flushes it to disk and adds it to the finalization queue. Finally, another thread handles the finalization queue. It reads the binary log file and applies all changes to the tables, clearing the page overlay.

On startup if any log files exist, they are validated for corruption and enacted upon the tables.

License

ParityDb is primarily distributed under the terms of both the MIT license and the APACHE license (Version 2.0), at your choice.

See LICENSE-APACHE and LICENSE-MIT for details.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in ParityDb by you, as defined in the APACHE 2.0 license, shall be dual licensed as above, without any additional terms or conditions.

parity-db's People

Contributors

adoerr avatar arkpar avatar b0xtch avatar cheme avatar davxy avatar debris avatar dvc94ch avatar gilescope avatar gnunicorn avatar hanabi1224 avatar heiher avatar hirschenberger avatar i1i1 avatar kinosang avatar koushiro avatar lesnyrumcajs avatar matthalpinparity avatar nazar-pc avatar nbaztec avatar nuke-web3 avatar ordian avatar sandreim avatar shamilsan avatar tpt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parity-db's Issues

DB corrupted on column config partial writes

When writing only a part of a column configuration, the DB gets corrupted:

To reproduce with #113:

	#[test]
	fn test_from_fuzzer() {
		crate::set_number_of_allowed_io_operations(4);
		let tmp = tempdir().unwrap();
		let mut options = Options::with_columns(tmp.path(), 1);
		options.columns[0].btree_index = true;
		assert!(Db::open_or_create(&options).is_err()); //It's failing because of I/O limit
		Db::open_or_create(&options).unwrap(); //It's going to fail again with 'Column config mismatch'
	}

Debug assertion failure during btree forward iteration

WIP issue (this bug is random and hard to reproduce consistently)

Durring fuzzing, the assertion *sep < ORDER here fails in an inconsistent manner during full forward iteration.

It is easy to reproduce when running the simple_model fuzzers on top of #123.

Here is an example of workload where it happens sometimes:
       (
            Config {
                btree_index: true,
                compression: NoCompression,
                number_of_allowed_io_operations: 0,
            },
            [
                Restart,
                Restart,
                Transaction(
                    [
                        (
                            59,
                            Some(
                                35,
                            ),
                        ),
                        (
                            47,
                            Some(
                                63,
                            ),
                        ),
                        (
                            255,
                            None,
                        ),
                    ],
                ),
                Transaction(
                    [
                        (
                            93,
                            Some(
                                127,
                            ),
                        ),
                        (
                            1,
                            Some(
                                47,
                            ),
                        ),
                        (
                            0,
                            Some(
                                5,
                            ),
                        ),
                        (
                            115,
                            Some(
                                115,
                            ),
                        ),
                        (
                            115,
                            Some(
                                178,
                            ),
                        ),
                        (
                            50,
                            Some(
                                31,
                            ),
                        ),
                        (
                            6,
                            Some(
                                53,
                            ),
                        ),
                    ],
                ),
                Transaction(
                    [
                        (
                            5,
                            Some(
                                255,
                            ),
                        ),
                        (
                            223,
                            Some(
                                47,
                            ),
                        ),
                        (
                            37,
                            Some(
                                51,
                            ),
                        ),
                        (
                            107,
                            Some(
                                47,
                            ),
                        ),
                        (
                            255,
                            Some(
                                255,
                            ),
                        ),
                        (
                            247,
                            Some(
                                35,
                            ),
                        ),
                    ],
                ),
                Transaction(
                    [
                        (
                            0,
                            None,
                        ),
                    ],
                ),
                Transaction(
                    [],
                ),
                Transaction(
                    [],
                ),
                Transaction(
                    [],
                ),
                Transaction(
                    [],
                ),
                Transaction(
                    [],
                ),
            ],
        )

@cheme Does it rings a bell to you? Do you want me to investigate it further?

Unexcepted write amplification

Hi,

We are merging polkadot 0.9.37 into Khala network and testing if we can ship paritydb as the default database.

We have found some unexcepted write amplification that would kill the SSD.

During the test, I bought 2 brand new HP FX900 Pro 4T SSD and used them to sync Kusama and Khala from the beginning. I set a RAID1 ZFS pool and the node VM used xfs as the FS. I noticed that the substrate node had written over 50T within 24 hours.

image

Also a heavy, intolerable fragmentation happened. It only takes 24 hours to make ZFS fragmentation to 59% on a brand new SSD.

image

Should I provide any information to debug with it?

Question: how a retrieve the raw key during DB iteration

Hello, I have a scenario that requires iterating the database and retrieving the original key, but I could only find the hashed key from an IterState, my questions are

  • Is it possible to retrieve the raw key during iteration? If so, how?
  • DB iteration seems to be slower than I would expect, ~5 hours on a 200GB db with ~100M entries, is that expected? Is there any trick to improve that?

Looking forward to your reply, thanks in advance!

Make `find_entry_sse2` optional behind a feature

Hello, I constantly getting lookup failures with keys that are supposed to exist in the DB with v0.4.4, after downgrading to v0.4.3 the issue is gone. Unfortunately I'm not able to make a minimal repo cuz the DB I use is at ~200GB with 100M+ records, but I strongly suspect there's bug in find_entry_sse2. Is there any chance to make it behind a cargo feature so that I can opt-out without disabling sse2 cpu-feature? Thanks

Optional reference couting

Allow a column to be configured for reference counting. Re-insertion increments reference counter. Deletion decrements and removes if it reaches zero.

Iteration behavior changes depending on the log worker thread work

In the following test, the presence or abscence of the db.process_commits call changes the iteration behavior of the last iter.prev() call.

If db.process_commit is never called, the last iter.prev() returns the key [0] as exepected, if db.process_commit is called before iterating, it returns None.

This test is doing both forward and backward iterations, I don't know if we want to support both on the same iterator...

	#[cfg(feature = "instrumentation")]
	#[test]
	fn test_process_commit_in_the_middle_of_iter() {
		let tmp = tempdir().unwrap();
		let mut options = Options::with_columns(tmp.path(), 1);
		options.columns[0].btree_index = true;
		options.always_flush = true;
		options.with_background_thread = false;

		let db = Db::open_or_create(&options).unwrap();
		db.commit::<_, Vec<u8>>(vec![(0, vec![0], Some(vec![0]))]).unwrap();
		db.process_commits().unwrap();
		let mut iter = db.iter(0).unwrap();
		iter.seek_to_last().unwrap();
		assert_eq!(iter.prev().unwrap(), Some((vec![0], vec![0]))); // State = 0
		assert_eq!(iter.next().unwrap(), None); // State = End
		assert_eq!(iter.prev().unwrap(), Some((vec![0], vec![0]))); // Fails because is None, work properly if process_commit is not called
	}

Changelog: 2022-10-24: Updated the bug description, the bug happens even if process_commits is called before iterating.

Migrate function only works for `uniform` true.

I just realize that he migrate function is reading all hash(key) and value and write them to the target column.
This means that it currently only works for uniform = true.
So when migrating destination column should be open with uniform = true for the migration and afterward restore the the right configuration.

This is something good to have, as migrating with parity-db-admin with two identical configuration allows reclaiming space in a simple way.

crash_DB

in the event of an emergency shutdown of the virtual disk on the home computer (when the light turns off) in most cases there is a drop in the base (an attachment error is added). I understand that running the node on a home computer does not meet the requirements of the data center where uninterruptible power supply, but for statistics I send a report
https://ibb.co/yFhF8gq

vmware 14.1.4
ubuntu 18,04 lts
2cpu
3 GB
Hdd 80 gb

Meta Question: How does design consideration translate to implementation details?

The README in this repo does a great job laying out some of the design considerations for this project, and why these decisions are good for the blockchain.

For Example:

  • Small fixed sized keys
  • Uniform distribution
  • Very few large keys
  • Reading more than writing

This repo also does a great job at listing some of the implementation details.

However, not being familiar with database development, how can I better understand which implementation details result in the design considerations?

For example the section Data structure notes a unique construction of database tables. Does this help with the fact that most values are small? Or just that values are uniformly distributed?

Not asking for a full explanation of everything, but maybe a off the head mapping of considerations to implementation details.

ParityDB Questions

Hey folks, few questions which I realize may be sensitive / too detailed so understand if there's no good answer:

  1. Is there design document on ParityDB?
  2. Have you compared it with other DB choices, like this? https://github.com/ledgerwatch/erigon/wiki/Choice-of-storage-engine
  3. How should users think of stability/feature-completeness over time? Where is the project today / where do you expect it to be in 3/6/9/12 months?

Panic after partial commit processing

If the I/O fails in the middle of a commit processing it seems it possible to end up with a corrupted database.

Here is a test to reproduce the bug (with the instrumentation feature enabled):

	#[test]
	fn test_from_fuzzer() {
		crate::set_number_of_allowed_io_operations(20);
		let tmp = tempdir().unwrap();
		let mut options = Options::with_columns(tmp.path(), 1);
		options.columns[0].btree_index = true;
		let inner_options = InternalOptions {
			create: true,
			commit_stages: EnableCommitPipelineStages::CommitOverlay,
			..Default::default()
		};
		let db = Db::open_inner(&options, &inner_options).unwrap();
		db.commit::<_, Vec<u8>>(vec![(0, vec![250], Some(vec![0]))]).unwrap();
		db.inner.flush_logs(0).unwrap();
		db.commit::<_, Vec<u8>>(vec![(0, vec![250], None)]).unwrap();
		assert!(db.inner.process_commits().is_err()); // Fails because of instrumentation
		crate::set_number_of_allowed_io_operations(usize::MAX);
		db.inner.process_commits().unwrap();
		drop(db);
		Db::open_or_create(&options).unwrap(); // Will panic
	}

Log:

[2022-10-04T15:17:34Z DEBUG parity-db] Created value table 00-00
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Inserting into new slot 1
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 1: no_hash
[2022-10-04T15:17:34Z DEBUG parity-db] Opened db Options { path: "/tmp/.tmptW0mVS", columns: [ColumnOptions { preimage: false, uniform: false, ref_counted: false, compression: NoCompression, btree_index: true }], sync_wal: true, sync_data: true, stats: true, salt: None, compression_threshold: {} }, metadata=Metadata { salt: [188, 75, 55, 221, 113, 40, 51, 252, 114, 89, 82, 214, 70, 205, 146, 31, 104, 128, 115, 132, 202, 238, 70, 97, 138, 42, 9, 135, 110, 90, 106, 103], version: 7, columns: [ColumnOptions { preimage: false, uniform: false, ref_counted: false, compression: NoCompression, btree_index: true }] }
[2022-10-04T15:17:34Z DEBUG parity-db] Replay is complete.
[2022-10-04T15:17:34Z DEBUG parity-db] Queued commit 2, 2 bytes
[2022-10-04T15:17:34Z DEBUG parity-db] Queued commit 3, 1 bytes
[2022-10-04T15:17:34Z DEBUG parity-db] Removed 2. Still queued commits 1 bytes
[2022-10-04T15:17:34Z DEBUG parity-db] Processing commit 2, record 1, 2 bytes
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Query slot 1
[2022-10-04T15:17:34Z TRACE parity-db] 0: Inserting new no_hash, size = 1
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Inserting into new slot 2
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 2: no_hash
[2022-10-04T15:17:34Z TRACE parity-db] 0: Inserting new no_hash, size = 26
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Inserting into new slot 3
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 3: no_hash
[2022-10-04T15:17:34Z TRACE parity-db] 0: Replacing no_hash
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 1: no_hash
[2022-10-04T15:17:34Z DEBUG parity-db] Flush: Activated new writer 0
[2022-10-04T15:17:34Z DEBUG parity-db] Removed 1. Still queued commits 0 bytes
[2022-10-04T15:17:34Z DEBUG parity-db] Processing commit 3, record 2, 1 bytes
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Query slot 1
[2022-10-04T15:17:34Z TRACE parity-db] 0: Inserting new no_hash, size = 8
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Inserting into new slot 4
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 4: no_hash
[2022-10-04T15:17:34Z TRACE parity-db] 0: Replacing no_hash
[2022-10-04T15:17:34Z TRACE parity-db] 00-00: Writing slot 1: no_hash
[2022-10-04T15:17:34Z DEBUG parity-db] Finalizing log record 2 (0 index, 3 value)
[2022-10-04T15:17:34Z DEBUG parity-db] Processed commit 3 (record 2), 1 ops, 87 bytes written
[2022-10-04T15:17:34Z DEBUG parity-db] Opened existing log /tmp/.tmptW0mVS/log0, first record_id = 1
[2022-10-04T15:17:34Z DEBUG parity-db] Opened log 0, record 1
[2022-10-04T15:17:34Z DEBUG parity-db] Opened value table 00-00 with 2 entries, entry_size=32
[2022-10-04T15:17:34Z DEBUG parity-db] Opened db Options { path: "/tmp/.tmptW0mVS", columns: [ColumnOptions { preimage: false, uniform: false, ref_counted: false, compression: NoCompression, btree_index: true }], sync_wal: true, sync_data: true, stats: true, salt: None, compression_threshold: {} }, metadata=Metadata { salt: [188, 75, 55, 221, 113, 40, 51, 252, 114, 89, 82, 214, 70, 205, 146, 31, 104, 128, 115, 132, 202, 238, 70, 97, 138, 42, 9, 135, 110, 90, 106, 103], version: 7, columns: [ColumnOptions { preimage: false, uniform: false, ref_counted: false, compression: NoCompression, btree_index: true }] }
[2022-10-04T15:17:34Z DEBUG parity-db] Replay: Activated log reader 0
[2022-10-04T15:17:34Z DEBUG parity-db] Replaying database log 0
[2022-10-04T15:17:34Z DEBUG parity-db] Enacting log 1
thread 'db::tests::test_from_fuzzer' panicked at 'index out of bounds: the len is 1 but the index is 2', src/db.rs:566:49

It seems that during log replay the log reader reads a column number "2" even if there is only a single column. I am not sure what causes the error, the log replay should validate log using the checksum and not read garbage data. I plan to investigate this further in the next few days.

iter_column_while doesn't return correct number of elements

Simple test case:

#[test]
fn iter_column_while() {
    for n in 0..10000 {
        let dir = tempfile::TempDir::new().unwrap();
        let options = parity_db::Options::with_columns(dir.path(), 1);
        let db = parity_db::Db::open_or_create(&options).unwrap();

        let keys = vec![[0u8; 32], [1u8; 32], [2u8; 32], [3u8; 32], [4u8; 32]];

        let tx = keys.iter().map(|key| (0, key, Some(key.to_vec())));
        db.commit(tx).unwrap();

        for key in &keys {
            assert!(db.get(0, key).unwrap().is_some());
        }

        let mut keys_count = 0;
        db.iter_column_while(0, |_iter_state| {
            keys_count += 1;

            true
        })
            .unwrap();

        assert_eq!(keys_count, 5, "iteration {}", n);
    }
}

Fails with error like this:

---- object_mappings::tests::iter_column_while stdout ----
---- object_mappings::tests::iter_column_while stdout ----
thread 'object_mappings::tests::iter_column_while' panicked at 'assertion failed: `(left == right)`
  left: `4`,
 right: `5`: iteration 29', crates/subspace-farmer/src/object_mappings/tests.rs:152:9

Number of elements read can vary, for instance with these 5 keys, hashing disabled and salt set to zero iterator returns no elements at all:

0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb5
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eb8
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2eba
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ebb
0061277d94811d9d65f82e9d267ae62a00219382f109cb4ca806ee0f9d6c2ece

SIMD optimization

IndexTable::find_entry could benefit from SIMD optimizations for parallel entry search.

  • x86_64 (#176)
  • aarch64

Panic when doing duplicated references in the same transaction on a ref-counted btree

The following test:

#[test]
fn test_duplicated_reference_on_ref_counted_btree() {
	let tmp = tempdir().unwrap();
	let mut options = Options::with_columns(tmp.path(), 1);
	options.columns[0].ref_counted = true;
	options.columns[0].btree_index = true;
	let db = Db::open_or_create(&options).unwrap();
	db.commit_changes(vec![(0, Operation::Reference(vec![255])), (0, Operation::Reference(vec![255]))]).unwrap();
}

Fails with a panic:

thread 'db::tests::test_duplicated_reference_' panicked at 'assertion failed: changes[0].key() < changes[1].key()', src/btree/node.rs:76:17

(first test found with fuzzing!)

Warn about database reaching its limits

Print a warning if any of the internal hard limits is at 50% capacity.
This includes:

  • 32-bit value table entry address space
  • fields of the 64-bit index entry
    Also detect and complain about index degradation due to high collision rate.

questions about using parity-db for ipfs

- Can reads happen from multiple threads? The db is not cloneable. dumb question, put it in an Arc 🤦

  • Is the maximum value size limited to 16kb?

  • What is a good way to determine the size tiers?

  • Can the reference count be interacted with somehow (ex increment, decrement, query the reference count in a transaction)?

Ipfs refcounting requirements are likely sufficiently different that we disable ref counting and implement our own. A block can references other blocks. The pin count determines if it's needs to be kept around. Once the pin count and the reference count are zero the block can be removed, and the reference count of all it's children is decremented.

CRC-32 verification of log records

For extra protection against log corruption put a CRC-32 record checksum at the end of the log record.
This should be checked during log validation on startup.

Consider using direct I/O

Since caching individual trie nodes is inefficient, there may be advantages to bypassing the OS page cache and using our own caches exclusively.

DB corrupted after a partial write in LogChange::flush_to_file

Test from fuzzer:

#[test]
fn test_from_fuzzer() {
	crate::set_number_of_allowed_io_operations(18);
	let tmp = tempdir().unwrap();
	let mut options = Options::with_columns(tmp.path(), 1);
	options.columns[0].btree_index = true;
	let db = Db::open_or_create(&options).unwrap();
	db.commit(vec![(0, vec![249], None)]).unwrap();
	drop(db); // The intrumented failure happens here
	crate::set_number_of_allowed_io_operations(usize::MAX);
	Db::open_or_create(&options).unwrap(); // Fails with Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })
}

If I/O operations fail inside of LogChange::flush_to_file it seems it might lead to a corrupted log and, hence, a corrupted database.

Very slow on spinning disks

Syncing on HDDs is very slow (much slower than RocksDB). This might be a “won’t fix” but if so it should be documented.

Corruption: Missing btree entry at addr 45:61327

Here is one of our users reporting getting "Corruption: Missing btree entry at addr 45:61327" on ParityDb database. Keys and values are both 8 bytes in in one column and one 11 byte key with 8 bytes value in second column.

I suspect application was killed due to shutting down slowly by systemd, which is the type of interruption ParityDb claims to be handling:

Database is restored to consistent state if IO is interrupted at any point.

I don't have much more details at the moment, but I hope this is helpful.

After recovery on corrupted log, existing values might get overriden after unrelated commit processing

Senario:

  1. The database is created with reference-counting.
  2. A first commit C1 is done, inserting a key K1, the commit is processed and the log is flushed
  3. A first commit C2 is done, inserting a key K2 is inserted, and the commit is processed.
  4. A third commit C3 is done (might be empty), but the I/Os fail during commit processing.
  5. The database is closed and open again. The second log file is invalid and removed.
  6. The database is closed and open a second time. K1 is still present in the DB.
  7. A fourth commit C4 is done and processed. When writing a Key mismatch at 2 log is printed.
  8. K1 is not in the database anymore.

As a test:

	#[test]
	fn test_from_fuzzer() {
		let tmp = tempdir().unwrap();
		let mut options = Options::with_columns(tmp.path(), 1);
		options.columns[0].ref_counted = true;
		options.columns[0].preimage = true;
		options.always_flush = true;
		options.with_background_thread = false;

		{
			let db = Db::open_or_create(&options).unwrap();
			db.commit_changes(vec![(0, Operation::Set(vec![59], vec![59]))]).unwrap();
			db.process_commits().unwrap();
			db.flush_logs().unwrap();
			db.commit_changes(vec![(0, Operation::Set(vec![0], vec![0]))]).unwrap();
			db.process_commits().unwrap();
			db.commit_changes(vec![]).unwrap();
			crate::set_number_of_allowed_io_operations(3);
			assert!(db.process_commits().is_err()); // Partial write
			crate::set_number_of_allowed_io_operations(usize::MAX);
		}

		{
			let db = Db::open(&options).unwrap();
			assert!(db.get(0, &[0]).unwrap().is_some()) // Key 0 is here
		}

		{
			let db = Db::open(&options).unwrap();
			assert!(db.get(0, &[0]).unwrap().is_some()); // Key 0 is here
			db.commit_changes(vec![(0, Operation::Set(vec![196], vec![196]))])
			.unwrap();
			db.process_commits().unwrap();
			assert!(db.get(0, &[0]).unwrap().is_none()) // Key 0 has been removed
		}
	}

Before C4 the database content is:

Index key: [22, 1f, f2, 5f, 13, 93, be, df, 22, cc, fd, 8e, 6b, 6b, bf, e1, bf, d6, 3f, 3c, 27, 2a, 9c, 1f, db, 53, be, a3, 18, 15, 32, e0]
        Rc: 1
Value: 3b
Index key: [f3, 71, 9b, 9e, 9f, be, 4, e3, 5f, 25, 5c, d7, a9, 29, c5, 24, d9, 8d, b1, 21, 19, 2f, 64, 6e, 52, c3, 3b, 9b, 1b, d8, 6d, 6f]
        Rc: 1
Value: 00

And after:

for 65536 chunks of column 0
Index key: [22, 1f, f2, 5f, 13, 93, be, df, 22, cc, fd, 8e, 6b, 6b, bf, e1, bf, d6, 3f, 3c, 27, 2a, 9c, 1f, db, 53, be, a3, 18, 15, 32, e0]
        Rc: 1
parity_db::column] Value: 3b
Index key: [a5, fa, ce, bb, 91, 75, b7, d7, c5, 4b, de, f6, 65, e1, a5, 2d, 82, b2, 29, 33, 2c, 60, ab, 29, ab, 57, 8, 15, b2, dc, c6, ed]
        Rc: 1
Value: c4
Index key: [f3, 71, 9b, 9e, 9f, be, b7, d7, c5, 4b, de, f6, 65, e1, a5, 2d, 82, b2, 29, 33, 2c, 60, ab, 29, ab, 57, 8, 15, b2, dc, c6, ed]
        Rc: 1
Value: c4

It seems like somehow the (0, 0) key-value partially overridden by the (196-196) key-value.

`btree_index` in existing database

If I change the configuration for a column from default (btree_index = false) to btree_index = true and the db already existed, will it automatically index that column when I open the db?

Compilation on Windows fails

   Compiling parity-db v0.1.0
error[E0433]: failed to resolve: could not find `unix` in `os`
  --> C:\Users\Swader\.cargo\registry\src\github.com-1ecc6299db9ec823\parity-db-0.1.0\src\table.rs:18:14
   |
18 | use std::os::unix::fs::FileExt;
   |              ^^^^ could not find `unix` in `os`

error[E0599]: no method named `read_exact_at` found for struct `std::fs::File` in the current scope
   --> C:\Users\Swader\.cargo\registry\src\github.com-1ecc6299db9ec823\parity-db-0.1.0\src\table.rs:124:8
    |
124 |         file.read_exact_at(&mut header, 0)?;
    |              ^^^^^^^^^^^^^ method not found in `std::fs::File`

error[E0599]: no method named `read_exact_at` found for struct `std::fs::File` in the current scope
   --> C:\Users\Swader\.cargo\registry\src\github.com-1ecc6299db9ec823\parity-db-0.1.0\src\table.rs:148:16
    |
148 |         Ok(self.file.read_exact_at(buf, offset)?)
    |                      ^^^^^^^^^^^^^ method not found in `std::fs::File`

error[E0599]: no method named `write_all_at` found for struct `std::fs::File` in the current scope
   --> C:\Users\Swader\.cargo\registry\src\github.com-1ecc6299db9ec823\parity-db-0.1.0\src\table.rs:152:16
    |
152 |         Ok(self.file.write_all_at(buf, offset)?)
    |                      ^^^^^^^^^^^^ method not found in `std::fs::File`

error: aborting due to 4 previous errors

Some errors have detailed explanations: E0433, E0599.
For more information about an error, try `rustc --explain E0433`.
error: could not compile `parity-db`.
warning: build failed, waiting for other jobs to finish...
error: build failed

Address space overflow

I just noticed this in logs:

2022-06-02 03:29:36 05-17: Address space overflow at 21913: addr ff:8389378    
2022-06-02 03:29:36 Started reindex for 05-17    
2022-06-02 03:29:37 Completed reindex 05-17 into 05-18    

This is a warning according to the codebase, is it something to worry about or normal and expected behavior?

It is with quickly (really, like ~3 MiB/s) growing blockchain database and happened when database grew to ~33.5 GiB.

2nd level index

Introduce a second level hash index. This will be used to store smart contract tries, child tries, etc. The main requirement here is that the sub-index can be dropped in amortized constant time.

API requirements:
The hash column can be configured to use a second level index. All insert/get commands for such column accept two keys: K1 and K2 where K1 identifies the index, and K2 is the value key in that index. Additionally, there's a drop command that accepts a single key and deletes all values in the index identified by the key.

Implementation ideas are welcome. As far as I see it, the structure could be made similar to existing index with all values in a single index. Deletion could be done by background process, similar to re-indexing. When an index is dropped, it is added to the deletion queue. A background process checks all index entries and deletes any values that are in the indexes being dropped.

Optimize memory usage

Under certain write workloads, when there's a lot of commits with a lot of small value insertions, the memory used by the index overlay can be blown by a factor of 10 or more when compared to the actual data inserted. There's a good example in #93. This is because the overlay keeps the whole index page in memory, even though only a single entry in the page may be modified.
I can think of two ways to fix that:

  1. Don't keep the whole page, but rather only modified entries.

  2. Investigate existing mechanisms in linux to control page flushing for memory mapped files. If it is possible to control when exactly individual pages are flushed to disk, we don't need the overlay in the first place.

In any case, this is only worth fixing if the solution does not introduce any performance hits.

Values do not seems to be persisted on ref-counted BTree

Here is a failing test I believe should pass:

	#[test]
	fn fuzz_test() {
		let dir = tempdir().unwrap();
		let options = Options {
			path: dir.path().to_owned(),
			columns: vec![ColumnOptions {
				ref_counted: true,
				btree_index: true,
				..ColumnOptions::default()
			}],
			sync_wal: true,
			sync_data: true,
			stats: false,
			salt: None,
		};

		let db = Db::open_or_create(&options).unwrap();
		db.commit_changes(vec![(0, Operation::Set(vec![0], vec![0]))]).unwrap();

		// The value is there before closing and opening again
		assert_eq!(db.get(0, &[0]).unwrap(), Some(vec![0]));

		// Ensures we should properly join on shutdown
		assert!(db.join_on_shutdown);

		// The value is not there anymore
		drop(db);
		let db = Db::open_read_only(&options).unwrap();
		assert_eq!(db.get(0, &[0]).unwrap(), Some(vec![0]));
	}

High memory usage on default allocator

After switching from rocksdb to parity db we discovered high memory usage on system allocator:

system jemalloc
btree 5.93G 4.34G
kv 7.27G 5.49G

Here is the code for reproducing the issue:

#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn main() {
    let batch_size = 1024 * 1024;
    let handles = (0..4)
        .map(|i| {
            let p = format!("some-path/{i}");
            let _ = std::fs::remove_dir_all(&p);
            let opts = parity_db::Options {
                path: p.into(),
                columns: vec![parity_db::ColumnOptions {
                    preimage: false,
                    btree_index: true,
                    uniform: false,
                    ref_counted: false,
                    compression: parity_db::CompressionType::NoCompression,
                    compression_threshold: 4096,
                }],
                sync_wal: true,
                sync_data: true,
                stats: false,
                salt: None,
            };
            std::thread::spawn(move || {
                let db = parity_db::Db::open_or_create(&opts).unwrap();

                for range in (0..100u64).map(|i| i * batch_size..(i + 1) * batch_size) {
                    db.commit(range.map(|i| (0, i.to_be_bytes(), Some(i.to_le_bytes().to_vec()))))
                        .unwrap();
                }
            })
        })
        .collect::<Vec<_>>();
    for handle in handles {
        handle.join().unwrap();
    }
}

Possible unnoticed corruption

I would like help confirming something. I run a few archive nodes for Kusama and Polkadot. They are all fully synced, but when querying data on this snapshot of Kusama I am noticing missing blocks (hash is there, block is not) and the finalized head is at the genesis block.

I would like help confirming this:

Steps:

  • Download the parityDB snapshot from here either via direct link or IPFS
  • Unzip into your local Kusama DB location or set a new base path with this as the destination
  • Run kusama with --db paritydb --pruning archive, let is sync to idle
  • Check if you have a finalized head

My assumption is that one of the tables got corrupted when my VM crashed. I have big RAM problems with Polkadot and Kusama nodes on WSL2 so the VM can crash during sync, which may have borked a table, but this is just me guessing.

More worrying is that if the node can't tell it has a corrupt DB, it might be dangerous for a validator as a backup option.

Expose column stats

Current stats can be written in text form, but it would be useful to access them programmatically too. For instance, I'm interested in number of elements stored in a hash column and/or amount of data.

The data is already there, but no API to access it unfortunately.

Properly fuzz concurrency behaviors

Currently the fuzzers only write and read using a single thread and do not control the background threads concurrency. It might be nice to use a tool like loom to properly cover possible concurrency issues.

It should also make the debugging of bugs like #124 easier (no more reproduction issues).

Add compressed info to MULTIHEAD

Multipart storage only store the compression bit at the very last chunk.
It could be good to be able to know if compression is applied from the very first chunk.
Can use a new const in this case:

const MULTIHEAD_COMPRESSED: &[u8] = &[0xfc, 0xff];

or &[0xfd, 0x7f].
This may requires a new db_version for compatibility, or at least to remove dead code when dropping compatibility.

Slice indexation error after BTree header write failure

First bug uncovered thanks to #113.

When a partial write happens after opening the database with a single BTree the iteration fails on the following db opening with 'range end index 8 out of range for slice of length 0' inside of Entry::read_slice nested in BTreeTable::btree_header.

Reproduction test (requires #113 and the enabeling of the instrumentation feature):

	#[test]
	fn test_from_fuzzer() {
		crate::set_number_of_allowed_io_operations(12);
		let tmp = tempdir().unwrap();
		let mut options = Options::with_columns(tmp.path(), 1);
		options.columns[0].btree_index = true;
		assert!(Db::open_or_create(&options).is_err()); //It's going to fail

		let db = Db::open_or_create(&options).unwrap();
		db.iter(0).unwrap(); //Read is panicking here
	}

Corrupted btree entries should not result in panics

When parsing value table entries all expected reads should be validated agains actual buffer size. I believe regular hash value table already have checks in place. Btree entry parsing however does not and should be fixed.

Example of such panic: #190

io_uring

This is a significant performance win.

seek_read/seek_write are not guaranteed to read/write exact amount of data on Windows

I was reading through this:

parity-db/src/file.rs

Lines 137 to 150 in 0a0491b

#[cfg(windows)]
pub fn read_at(&self, buf: &mut [u8], offset: u64) -> Result<()> {
use std::os::windows::fs::FileExt;
self.file.read().as_ref().unwrap().seek_read(buf, offset)?;
Ok(())
}
#[cfg(windows)]
pub fn write_at(&self, buf: &[u8], offset: u64) -> Result<()> {
use std::os::windows::fs::FileExt;
self.dirty.store(true, Ordering::Relaxed);
self.file.read().as_ref().unwrap().seek_write(buf, offset)?;
Ok(())
}

And noticed that you're not checking how much data was actually read/written, potentially causing serious issues on Windows and permanently corrupting database.

Release 0.9.8 and paritydb

Op: Ubuntu 20.04.2 LTS
Db: Paritydb
Polkadot release version: 0.9.8
Chain: Kusama

This bug is occurring on all my nodes with paritydb and kusama after the update to polkadot 0.9.8
Nodes with chain = polkadot are good

Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]: 2021-07-07 20:22:00.433  INFO tokio-runtime-worker substrate: ✨ Imported #8244832 (0x45e7…4532)
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]: ====================
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]: Version: 0.9.8-3a10ee63c-x86_64-linux-gnu
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    0: sp_panic_handler::set::{{closure}}
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    1: std::panicking::rust_panic_with_hook
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:595:17
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    2: std::panicking::begin_panic_handler::{{closure}}
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:495:13
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys_common/backtrace.rs:141:18
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    4: rust_begin_unwind
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:493:5
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    5: core::panicking::panic_fmt
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:92:14
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    6: core::panicking::panic
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:50:5
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    7: parity_db::db::DbInner::enact_logs
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    8: parity_db::db::Db::commit_worker
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:    9: std::sys_common::backtrace::__rust_begin_short_backtrace
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:   10: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:   11: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/alloc/src/boxed.rs:1546:9
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/alloc/src/boxed.rs:1546:9
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:       std::sys::unix::thread::Thread::new::thread_start
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:              at rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys/unix/thread.rs:71:17
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:   12: start_thread
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:   13: clone
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]: Thread '<unnamed>' panicked at 'attempt to subtract with overflow', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/parity-db-0.2.4/src/db.rs:523
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]: This is a bug. Please report it at:
Jul 07 20:22:00 Kusama-Cosmoon2 polkadot[1537]:         https://github.com/paritytech/polkadot/issues/new
Jul 07 20:22:00 Kusama-Cosmoon2 systemd[1]: polkadot-validator.service: Main process exited, code=exited, status=1/FAILURE
Jul 07 20:22:00 Kusama-Cosmoon2 systemd[1]: polkadot-validator.service: Failed with result 'exit-code'.
Jul 07 20:24:00 Kusama-Cosmoon2 systemd[1]: polkadot-validator.service: Scheduled restart job, restart counter is at 2.
Jul 07 20:24:00 Kusama-Cosmoon2 systemd[1]: Stopped Polkadot Validator.
Jul 07 20:24:00 Kusama-Cosmoon2 systemd[1]: Started Polkadot Validator.
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.979  INFO main polkadot_cli::command: ----------------------------
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main polkadot_cli::command: This chain is not in any way
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main polkadot_cli::command:       endorsed by the
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main polkadot_cli::command:      KUSAMA FOUNDATION
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main polkadot_cli::command: ----------------------------
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main sc_cli::runner: Parity Polkadot
Jul 07 20:24:00 Kusama-Cosmoon2 polkadot[2080]: 2021-07-07 20:24:00.980  INFO main sc_cli::runner: ✌️  version 0.9.8-3a10ee63c-x86_64-linux-gnu

v0.2.3 has a broken Cargo.toml

I'm trying to follow https://substrate.dev/docs/en/tutorials/create-your-first-substrate-chain/setup

but my build fails with:

$ cargo build --release
error: failed to download `parity-db v0.2.3`

Caused by:
  unable to get packages from source

Caused by:
  failed to parse manifest at `/home/bear/.cargo/registry/src/github.com-1ecc6299db9ec823/parity-db-0.2.3/Cargo.toml`

Caused by:
  failed to parse the version requirement `0.11	` for dependency `parking_lot`

Caused by:
  expected comma after minor version number, found '\t'

I downloaded parity-db==0.2.3 via cargo-download and found that Cargo.toml indeed has an extra \t on the parking_lot declaration:

# THIS FILE IS AUTOMATICALLY GENERATED BY CARGO
#
# When uploading crates to the registry Cargo will automatically
# "normalize" Cargo.toml files for maximal compatibility
# with all versions of Cargo and also rewrite `path` dependencies
# to registry (e.g., crates.io) dependencies
#
# If you believe there's an error in this file please file an
# issue against the rust-lang/cargo repository. If you're
# editing this file be aware that the upstream Cargo.toml
# will likely look very different (and much more reasonable)

[package]
edition = "2018"
name = "parity-db"
version = "0.2.3"
authors = ["Parity Technologies <[email protected]>"]
description = "Key-value database for the blockchain"
homepage = "https://substrate.dev"
license = "MIT OR Apache-2.0"
repository = "https://github.com/paritytech/parity-db/"
[profile.release]
lto = "fat"
codegen-units = 1
debug = true
panic = "abort"
[dependencies.blake2-rfc]
version = "0.2.18"

[dependencies.crc32fast]
version = "1.2.0"

[dependencies.fs2]
version = "0.4.3"

[dependencies.hex]
version = "0.4.2"

[dependencies.libc]
version = "0.2"

[dependencies.log]
version = "0.4.8"

[dependencies.memmap2]
version = "0.2"

[dependencies.parking_lot]
version = "0.11\t"

[dependencies.rand]
version = "0.8.2"
[dev-dependencies.env_logger]
version = "0.8.2"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.