cosmos / cosmos-db Goto Github PK

License: Apache License 2.0

Go 97.21% Makefile 1.13% Dockerfile 0.29% Nix 1.37%

cosmos-db's Introduction

Cosmos DB

Common database interface for various database backends. Primarily meant for applications built on Tendermint, such as the Cosmos SDK, but can be used independently of these as well.

Minimum Go Version

Go 1.19+

Supported Database Backends

MemDB [stable]: An in-memory database using Google's B-tree package. Has very high performance both for reads, writes, and range scans, but is not durable and will lose all data on process exit. Does not support transactions. Suitable for e.g. caches, working sets, and tests. Used for IAVL working sets when the pruning strategy allows it.
GoLevelDB: a pure Go implementation of LevelDB (see below). Currently the default on-disk database used in the Cosmos SDK.
LevelDB using levigo Go wrapper. Uses LSM-trees for on-disk storage, which have good performance for write-heavy workloads, particularly on spinning disks, but requires periodic compaction to maintain decent read performance and reclaim disk space. Does not support transactions.
RocksDB: A Go wrapper around RocksDB. Similarly to LevelDB (above) it uses LSM-trees for on-disk storage, but is optimized for fast storage media such as SSDs and memory. Supports atomic transactions, but not full ACID transactions.
Pebble: a RocksDB/LevelDB inspired key-value database in Go using RocksDB file format and LSM-trees for on-disk storage. Supports snapshots.

Meta-databases

PrefixDB [stable]: A database which wraps another database and uses a static prefix for all keys. This allows multiple logical databases to be stored in a common underlying databases by using different namespaces. Used by the Cosmos SDK to give different modules their own namespaced database in a single application database.

Tests

To test common databases, run make test. If all databases are available on the local machine, use make test-all to test them all.

cosmos-db's People

Contributors

Stargazers

Watchers

cosmos-db's Issues

add NewBatchWithSize to DB interface

This is a sub task of this issue. We need to add a method for creating pre-allocated batch since batch size will have limit.

Remotedb: are there users?

I don't know of any users of the remotedb feature, but feel it's best to check.

Current Default

target_file_size_multiplier = 1
block_size = 4096
OptimizeLevelStyleCompaction(512M) implies
- target_file_size_base = 64M
- snappy/lz4 compression types

Problem

sst file size cap at 64M, big number of files
small block_size leads to bigger index/filter block and less efficient compression
could use more compression at lower levels, zstd with preset dictionary is pretty good according to our tests.

Tuning For DB Size

Increase sst file sizes of lower levels, 300M+, set target_file_size_multiplier = 2?
Increase block_size to 32k.
Use higher compression at lower levels, zstd with preset dictionary.

We manage to reduce a testnet node's application.db from 256G -> 174G by doing a manual compaction with new parameters.

Other Options

optimize_filters_for_hits = 1
level_compaction_dynamic_level_bytes = true
format_version = 4 more efficient index/filter block: http://rocksdb.org/blog/2019/03/08/format-version-4.html
format_version = 5, optimize_filters_for_memory=true and jemalloc, more efficient bloom filter.
ribbon filter seems a good trade off, saving memory and disk space.

MemTable Optimizations

memtable_whole_key_filtering
memtable_prefix_bloom_size_ratio

Pure Go: Do we want that?

The only reason that this repository has ever needed a pretty complex build system is that leveldb and rocksdb both use cgo.

Since we're adding pebble, which outperforms rocksdb and leveldb isn't very much of a superstar and goleveldb is more ergonomic, there's a question if we'd like to make the databases pure go, mainly for ergonomics.

support rocksdb column family

column family fits well with different iavl stores, at least it's easier to check the db states of different stores, potentially could tune db options separately.

But it seems not easy to do that transparently, without changing a few apis.

pebble: .DeleteSync has suspicious nil return for error condition as well as extraneous if err != nil + returns that masked the error

If we examine this code, it does an unnecessary if err != nil check that then masks a suspicious nil error that was actually non-nil

If we just did return condition as a simple coding style, there wouldn't be any problems

/cc @elias-orijtech @tac0turtle @julienrbrt

Add Size() method for Batch interface

This is a sub-task of this issue. As discussed in that issue, we need to divide the each of the module's commits into many smaller db batches of pre-configured size to improve performance of the underlying LSM backend. Hence we need adding Size() method to Batch interface so that we can constrains the size of each db batches to be the pre-configured size.

memdb: Make iterators not channel based

REF: tendermint/tm-db#188

opening this issue to reference tm-db issue by @ValarDragon

reported cleveldb caused corruption

https://discord.com/channels/669268347736686612/680435043570941973/1050846995876683956

lets remove cleveldb, one less db to maintain

cc @faddat

Add GetNearest method to the DB interface

Summary

There is an issue in cosmos-sdk: Make KVStore interface have methods to getNearest entry.

To address this issue, we can add a new method, GetNearest to the DB interface and implement it for all supported storage engines. This method should have a similar cost to the Get operation and provide a more efficient alternative to iterators for finding the nearest key.

// GetNearest retrieves the nearest key to the given key.
// If 'ascending' is true, the method returns the smallest key greater than the given key.
// If 'ascending' is false, the method returns the largest key smaller than the given key.
// In case there is no key in the desired direction, the method returns nil.
// CONTRACT: key, value read-only []byte
GetNearest(key []byte, ascending bool) ([]byte, error)

Once implemented, the method can be used in the cosmos-sdk repo to add a GetNearest function to the KVStore interface, as described in the original issue.

introduce DB options when creating new backend DB

When testing Cronos functionalities, we think the changes in tendermint/tm-db#218 made the DB consume too much RAM. Instead of hardcoding the max open file number, we prefer it to be adjustable based on the instance's resources.
Therefore,

We may introduce DBOptions into dbCreator, i.e.

type (
	dbCreator func(name string, dir string, opts DBOptions) (DB, error)

	DBOptions interface {
		Get(string) interface{}
	}
)

and introduce a new method and then update the original NewDB implementation

func NewDBwithOptions(name string, backend BackendType, dir string, opts DBOptions) (DB, error) {

func NewDB(name string, backend BackendType, dir string) (DB, error) {
	return NewDBwithOptions(name, backend, dir, nil)
}

Therefore, it will be easier to make DB adjustments to the Cosmos SDK when the DB supports certain options.

DBS that aren't LSMs

@ValarDragon mentions:

https://twitter.com/valardragon/status/1558129554312536066

So, while this would conflict with the notion of a pure cgo cosmos-db, it may be worth checking out mdbx.

Also, this issue is generally for collecting information about databases we could use here, that are not LSMs.

memdb: rewrite to use tidwall/btree

https://github.com/tidwall/btree

remove cleveldb

I'm not aware of anyone successfully or preferentially using cleveldb, so I think that we should remove it.

Make a second DB "IterKey" API that returns a key to the caller that can potentially mutate

(X-post of cometbft/cometbft-db#156 )

Currently the Iterator.Key() API makes a copy of the key it gets from the database. This is because the database's iterator returns something it will mutate on the subsequent .Next() call for heap efficiency. This extra copy causes very large heap allocation (and time overheads) to query serving nodes, and a 1% time overhead to the entire state machine time for Osmosis.

On a heap allocation profile of a query serving Osmosis RPC node over an hour, it has 450 gigabytes allocated from this API. On spot-check, none of the big ones need this copying behavior. (160GB removed from a tendermint update, but the remaining 290GB are still from this API)

In the state machine, we see 1% of state machine execution time is blocked on copying this key, again in situations where I don't think we need any of this either.

Proposal: Add a new method KeyMut() to the interface for Iterator. The caller should not mutate this key, and the expectation is that the key may get mutated on the next .Next() call.

I'm not stoked about the naming of this method, so happy for better ideas

Benchmarks

So, real-world performance comparisons have showed that Pebbledb is much faster, but in unexpected ways:

each node will consume 10x less write throughput
each node will consume 27x less read throughput
each node will deliver 95%-105% the query traffic

meaning that
users can serve dramatically more from the same hardware.

Previously we benchmarked by cloning iavl, and replacing with our fork of tmdb.

I'd like to have benchmarks here so that we can observe changes commit by commit.

Thoughts on style?