Comments (3)
I am thinking of two approaches. The first is limited to a special case while the latter is more generic and more expensive.
Setting
The hybrid log is organized as segments of a considerable size. Garbage collection happens at the beginning of the hybrid log. What are the records that can be garbage collected?
- Records belonging to keys that were deleted later in the hybrid log (which are denoted using tombstone records in FASTER.
- Records belonging to keys that have a new value mapping later in the hybrid log.
We call the remaining keys in segment live. A live key is one which has the latest value mapping in the garbage collected segment and none in any later segments. In order to GC a segment, we need to identify all live keys and re-insert their value mapping from the GCed segment back to the tail of the hybrid log.
Using Bitmaps
Key Assumption: There exists a collision-free hash function perfect_hash
that can be used to design a bitmap representation for set of keys in the database.
Each segment can then be mapped to a set of keys represented using a bitmap. In order to determine non-live keys in segment S
, we simply find the intersection of bitmap of S
with union of all bitmaps of segments that follow.
void garbage_collect(Segment segment, List<Segment> laterSegments) {
List<Record> liveRecords;
Bitmap U = new Bitmap();
foreach(Segment other in laterSegments) {
U.union(other.GetBitmap());
}
Bitmap B = segment.GetBitmap();
Bitmap non_live_keys = B.intersect(U);
for(Record record in segment reverse order) {
if(!non_live_keys.Exists(record.key)) {
liveRecords.Add(record);
// we need only the latest value mapping in the segment
non_live_keys.Insert(record.key);
}
}
}
This algorithm eliminates expensive reading of segments into memory.
Using Bloomfilters
The collision-free perfect hash function assumption is unreasonable in many use-cases. So, we need to resort to an approximate set membership representation such as bloom filter. The false positives come with a lot of overhead because we need an exact membership test. Here is a preliminary design.
void garbage_collect(Segment segment, List<Segment> laterSegments) {
// Build a set of keys in the page/segment to be GCed
Set<Key> keys;
foreach(Record record in segment) {
if(!keys.Has(record.key)) {
keys.Insert(record.key);
}
}
BloomFilter current = segment.GetBloomFilter();
for(auto other : laterSegments) {
BloomFilter otherBF = other.GetBloomFilter();
auto common = otherBF.intersection(current);
auto check = false;
if(!common.Empty()) {
for(auto key: keys) {
if(common.Exists(key)) {
check = true;
break;
}
}
// our bloom filter check tells us that there might be some
// keys in this page that can make invalidate some live keys
if(check) {
for(Record rec in other) {
if(keys.Has(rec.key)) {
keys.Delete(rec.key);
}
}
// update our bloom filter for the new set of keys
// hmm this is inefficient - we could use a counting filter
// that supports delete.
current = new BloomFilter();
for(auto key: keys) {
current.Add(key);
}
}
}
}
}
from faster.
It is impossible to avoid scanning through the records (to check the key) when you use an approximate summary of keys such as bloom filter. One alternative is to store the keys in a separate file alongside the hybrid log. This can also be computed as an offline post-processing.
For every page or segment we have a keys file. We wish to garbage collect segments[0]
and the subsequent segments are segments[1], ..., segments[n]
.
Hash-Table Based Solution
The only requirement on the keys file is that it must contain all keys that have a record in the corresponding hybrid log segment.
hashTable = create hash table from segments[0].keys file to {valid, invalid}
for(int i = 1; i <= n; i++) {
other = segments[i].keys;
foreach(var key in other) {
if(hashTable[key] == 'valid') {
hashTable[key] = 'invalid';
}
}
}
foreach(var key in hashTable.keys) {
if(hashTable[key] == 'invalid') {
create corresponding record at tail;
}
}
Sort-Merge-Diff Solution
Another variant is to have a sorted keys file for each segment. For each segments[i]
with i > 0
, we scan through sorted keys and merge-diff with that of segments[0]
res = segments[0].sorted_keys;
for(int i = 1; i <= n; i++) {
seg = res;
res.clear();
other = segments[1].sorted_keys;
seg.begin();
other.begin();
while(seg.hasNext() && other.hasNext()) {
if(seg.front() == other.front()) {
seg.next();
other.next();
} else if (seg.front() < other.front()) {
res.add(seg.front());
seg.next();
} else {
other.next();
}
}
while(seg.hasNext()) {
res.add(seg.next());
}
seg.close();
other.close();
res.close();
}
Comparison
Hash-Table-Based | Sort-Merge-Diff | |
---|---|---|
Memory | Entire key-set of one segment (its hash table) and a page | Two pages |
GC-Time | Scan through only keys in other segment | Scan through both segments and rewrite keys for GC segment n times |
Pre-processing | Simply write out keys | Need to sort the keys |
from faster.
Compaction is now merged to master!
from faster.
Related Issues (20)
- in memory concurrent support dictionary HOT 2
- JVM language support HOT 1
- ShardedStorageDevice and related classes/interfaces HOT 1
- using fasterkv with string key and string value produces byte[] of humongous size in heap memory HOT 5
- Can FASTER replace Redis? Is there a caching mechanism similar to Redis RDP? HOT 1
- When upsert ArgumentOutOfRangeException HOT 6
- Log tail iteration does not end when log completed HOT 2
- Implement atomic GetOrAdd/AddOrUpdate equivalency from ConcurrentDictionary<> HOT 1
- Failure to resume previous session on restart
- Is it safe to reuse buffers between Upsert calls? HOT 1
- Is it ok to create and use many KV sessions and never dispose them? HOT 1
- FasterKV: Is there a way to wait for flush globally including all pending operations? HOT 1
- Reduce memory usage for FasterKV HOT 2
- Full checkpoint deletes only one log file HOT 4
- After dispose and restore obj.log is growing always
- Deserializing page content despite errorCode != 0?
- Checkpoints get deleted on recovery HOT 1
- Immutable records with mangled keys returned during key iteration HOT 10
- Is there a CPP client available? HOT 1
- store.TakeFullCheckpointAsync(CheckpointType.FoldOver) stucks forever HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from faster.