Giter Club home page Giter Club logo

Comments (3)

gunaprsd avatar gunaprsd commented on May 18, 2024

I am thinking of two approaches. The first is limited to a special case while the latter is more generic and more expensive.
Setting
The hybrid log is organized as segments of a considerable size. Garbage collection happens at the beginning of the hybrid log. What are the records that can be garbage collected?

  1. Records belonging to keys that were deleted later in the hybrid log (which are denoted using tombstone records in FASTER.
  2. Records belonging to keys that have a new value mapping later in the hybrid log.

We call the remaining keys in segment live. A live key is one which has the latest value mapping in the garbage collected segment and none in any later segments. In order to GC a segment, we need to identify all live keys and re-insert their value mapping from the GCed segment back to the tail of the hybrid log.

Using Bitmaps
Key Assumption: There exists a collision-free hash function perfect_hash that can be used to design a bitmap representation for set of keys in the database.

Each segment can then be mapped to a set of keys represented using a bitmap. In order to determine non-live keys in segment S, we simply find the intersection of bitmap of S with union of all bitmaps of segments that follow.

void garbage_collect(Segment segment, List<Segment> laterSegments) {
  List<Record> liveRecords;
  Bitmap U = new Bitmap();
  foreach(Segment other in laterSegments) {
    U.union(other.GetBitmap());
  }
  Bitmap B = segment.GetBitmap();
  Bitmap non_live_keys = B.intersect(U);
  for(Record record in segment reverse order) {
    if(!non_live_keys.Exists(record.key)) {
      liveRecords.Add(record);
      // we need only the latest value mapping in the segment
      non_live_keys.Insert(record.key); 
    }
  }
}

This algorithm eliminates expensive reading of segments into memory.

Using Bloomfilters
The collision-free perfect hash function assumption is unreasonable in many use-cases. So, we need to resort to an approximate set membership representation such as bloom filter. The false positives come with a lot of overhead because we need an exact membership test. Here is a preliminary design.

void garbage_collect(Segment segment, List<Segment> laterSegments) {
// Build a set of keys in the page/segment to be GCed
Set<Key> keys;
foreach(Record record in segment) {
  if(!keys.Has(record.key)) {
    keys.Insert(record.key);
  }
}

BloomFilter current = segment.GetBloomFilter();
for(auto other : laterSegments) {
  BloomFilter otherBF = other.GetBloomFilter();
  auto common = otherBF.intersection(current);
  auto check = false;
  if(!common.Empty()) {
    for(auto key: keys) {
      if(common.Exists(key)) {
        check = true;
        break;
      }
    }
    // our bloom filter check tells us that there might be some 
    // keys in this page that can make invalidate some live keys
    if(check) {
      for(Record rec in other) {
        if(keys.Has(rec.key)) {
          keys.Delete(rec.key);
        }
      }
      // update our bloom filter for the new set of keys
      // hmm this is inefficient - we could use a counting filter
      // that supports delete. 
      current = new BloomFilter();
      for(auto key: keys) {
        current.Add(key);
      }
    }
  }
}
}

from faster.

gunaprsd avatar gunaprsd commented on May 18, 2024

It is impossible to avoid scanning through the records (to check the key) when you use an approximate summary of keys such as bloom filter. One alternative is to store the keys in a separate file alongside the hybrid log. This can also be computed as an offline post-processing.

For every page or segment we have a keys file. We wish to garbage collect segments[0] and the subsequent segments are segments[1], ..., segments[n].

Hash-Table Based Solution

The only requirement on the keys file is that it must contain all keys that have a record in the corresponding hybrid log segment.

hashTable = create hash table from segments[0].keys file to {valid, invalid}
for(int i = 1; i <= n; i++) {
  other = segments[i].keys;
  foreach(var key in other) {
    if(hashTable[key] == 'valid') {
      hashTable[key] = 'invalid';
    }
  }
}

foreach(var key in hashTable.keys) {
  if(hashTable[key] == 'invalid') {
    create corresponding record at tail;
  }
}

Sort-Merge-Diff Solution

Another variant is to have a sorted keys file for each segment. For each segments[i] with i > 0, we scan through sorted keys and merge-diff with that of segments[0]

res = segments[0].sorted_keys;
for(int i = 1; i <= n; i++) {
 seg = res;
 res.clear();
 other = segments[1].sorted_keys;
 seg.begin();
 other.begin();
 while(seg.hasNext() && other.hasNext()) {
   if(seg.front() == other.front()) {
    seg.next();
    other.next();
   } else if (seg.front() < other.front()) {
    res.add(seg.front());
    seg.next();
   } else {
    other.next();
   }
 }
 while(seg.hasNext()) {
  res.add(seg.next());
 }
 seg.close();
 other.close();
 res.close();
}

Comparison

Hash-Table-Based Sort-Merge-Diff
Memory Entire key-set of one segment (its hash table) and a page Two pages
GC-Time Scan through only keys in other segment Scan through both segments and rewrite keys for GC segment n times
Pre-processing Simply write out keys Need to sort the keys

from faster.

badrishc avatar badrishc commented on May 18, 2024

Compaction is now merged to master!

from faster.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.