Currently we support garbage collection based on expiry of keys, via truncation from t

Compaction based garbage collection about faster HOT 3 CLOSED

microsoft commented on May 18, 2024

Compaction based garbage collection

from faster.

Comments (3)

gunaprsd commented on May 18, 2024

I am thinking of two approaches. The first is limited to a special case while the latter is more generic and more expensive.
Setting
The hybrid log is organized as segments of a considerable size. Garbage collection happens at the beginning of the hybrid log. What are the records that can be garbage collected?

Records belonging to keys that were deleted later in the hybrid log (which are denoted using tombstone records in FASTER.
Records belonging to keys that have a new value mapping later in the hybrid log.

We call the remaining keys in segment live. A live key is one which has the latest value mapping in the garbage collected segment and none in any later segments. In order to GC a segment, we need to identify all live keys and re-insert their value mapping from the GCed segment back to the tail of the hybrid log.

Using Bitmaps
Key Assumption: There exists a collision-free hash function perfect_hash that can be used to design a bitmap representation for set of keys in the database.

Each segment can then be mapped to a set of keys represented using a bitmap. In order to determine non-live keys in segment S, we simply find the intersection of bitmap of S with union of all bitmaps of segments that follow.

void garbage_collect(Segment segment, List<Segment> laterSegments) {
  List<Record> liveRecords;
  Bitmap U = new Bitmap();
  foreach(Segment other in laterSegments) {
    U.union(other.GetBitmap());
  }
  Bitmap B = segment.GetBitmap();
  Bitmap non_live_keys = B.intersect(U);
  for(Record record in segment reverse order) {
    if(!non_live_keys.Exists(record.key)) {
      liveRecords.Add(record);
      // we need only the latest value mapping in the segment
      non_live_keys.Insert(record.key); 
    }
  }
}

This algorithm eliminates expensive reading of segments into memory.

Using Bloomfilters
The collision-free perfect hash function assumption is unreasonable in many use-cases. So, we need to resort to an approximate set membership representation such as bloom filter. The false positives come with a lot of overhead because we need an exact membership test. Here is a preliminary design.

void garbage_collect(Segment segment, List<Segment> laterSegments) {
// Build a set of keys in the page/segment to be GCed
Set<Key> keys;
foreach(Record record in segment) {
  if(!keys.Has(record.key)) {
    keys.Insert(record.key);
  }
}

BloomFilter current = segment.GetBloomFilter();
for(auto other : laterSegments) {
  BloomFilter otherBF = other.GetBloomFilter();
  auto common = otherBF.intersection(current);
  auto check = false;
  if(!common.Empty()) {
    for(auto key: keys) {
      if(common.Exists(key)) {
        check = true;
        break;
      }
    }
    // our bloom filter check tells us that there might be some 
    // keys in this page that can make invalidate some live keys
    if(check) {
      for(Record rec in other) {
        if(keys.Has(rec.key)) {
          keys.Delete(rec.key);
        }
      }
      // update our bloom filter for the new set of keys
      // hmm this is inefficient - we could use a counting filter
      // that supports delete. 
      current = new BloomFilter();
      for(auto key: keys) {
        current.Add(key);
      }
    }
  }
}
}

from faster.

gunaprsd commented on May 18, 2024

It is impossible to avoid scanning through the records (to check the key) when you use an approximate summary of keys such as bloom filter. One alternative is to store the keys in a separate file alongside the hybrid log. This can also be computed as an offline post-processing.

For every page or segment we have a keys file. We wish to garbage collect segments[0] and the subsequent segments are segments[1], ..., segments[n].

Hash-Table Based Solution

The only requirement on the keys file is that it must contain all keys that have a record in the corresponding hybrid log segment.

hashTable = create hash table from segments[0].keys file to {valid, invalid}
for(int i = 1; i <= n; i++) {
  other = segments[i].keys;
  foreach(var key in other) {
    if(hashTable[key] == 'valid') {
      hashTable[key] = 'invalid';
    }
  }
}

foreach(var key in hashTable.keys) {
  if(hashTable[key] == 'invalid') {
    create corresponding record at tail;
  }
}

Sort-Merge-Diff Solution

Another variant is to have a sorted keys file for each segment. For each segments[i] with i > 0, we scan through sorted keys and merge-diff with that of segments[0]

res = segments[0].sorted_keys;
for(int i = 1; i <= n; i++) {
 seg = res;
 res.clear();
 other = segments[1].sorted_keys;
 seg.begin();
 other.begin();
 while(seg.hasNext() && other.hasNext()) {
   if(seg.front() == other.front()) {
    seg.next();
    other.next();
   } else if (seg.front() < other.front()) {
    res.add(seg.front());
    seg.next();
   } else {
    other.next();
   }
 }
 while(seg.hasNext()) {
  res.add(seg.next());
 }
 seg.close();
 other.close();
 res.close();
}

Comparison

	Hash-Table-Based	Sort-Merge-Diff
Memory	Entire key-set of one segment (its hash table) and a page	Two pages
GC-Time	Scan through only keys in other segment	Scan through both segments and rewrite keys for GC segment n times
Pre-processing	Simply write out keys	Need to sort the keys

from faster.

badrishc commented on May 18, 2024

Compaction is now merged to master!

from faster.

Compaction based garbage collection about faster HOT 3 CLOSED

Comments (3)

Hash-Table Based Solution

Sort-Merge-Diff Solution

Comparison

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent