I might be missing something, but Frost appears to not use a rolling hash. <p dir=

Dedupe appears to not work as expected? about frost HOT 6 CLOSED

bdb commented on September 26, 2024

Dedupe appears to not work as expected?

from frost.

Comments (6)

X-Ryl669 commented on September 26, 2024

Thank you for the detailed report.

Frost use a rolling checksum based on Adler32 to split data in chunk, with the following restrictions:

The minimum chunk size is fixed (to avoid chunk of 11 bytes, like in your example)
The maximum chunk size is also fixed (to avoid huge chunk that never fit the division rule for chunking, see below)
The chunker uses two divisors to decide if it's time to split in chunk or not.

I've implemented the algorithm described in this document: http://www.philippheckel.com/files/syncany-heckel-thesis.pdf
Basically, the chunker is called Two Threshold Two Divider.

It runs a rolling checksum over the input data with two hash algorithm, one being the Adler32 and the other being SHA1 (used after the chunk split position is found). When it has received enough data (1856 bytes per default), it starts trying to divide the Adler32 rolling checksum with the low threshold divider (that is, a divider that divides a lot of numbers). If it succeed dividing, then it remember that there is a backup split position here. It also tries to divide with a high divider threshold (less chance to succeed).
If it succeed while still in the good chunk size range, then a chunk is emitted.
Else it falls back to the low divider's position.

You can have a look at file TTTDChunker.cpp for the algorithm, it's very easy to follow.

Now, to answer why it did not find a "common pattern" in your example, I guess it's because the file is too small for the algorithm to find a "common" divider. It only accept a min chunk of 1856 bytes, so it can not give you a 11 bytes chunk and thus, split at 11 bytes. Even after 1856 bytes, the algorithm must "forget" the 11 bytes of "insertion" pollution before it can recover to a viable divider.
Remember that you are folding a huge input in a 32 bits number. Depending on the starting point, it might never find a divider (in the worst case).

Try with a huge file (at least some MB) and you'll see it does recover after some thousands of bytes. This is one of the test suite of the application, where it generate some big random file, backup it up, insert and modify some bytes but none after half the file, and re-backup. It'll usually recover and avoid backing up the complete file twice.

The reason why I made it this way is because of the compromise between deduplication performance and size of the index. Allowing small chunks really explodes the index's size and then, the performance degrades quickly when backing up big folders.

If you change the min size (in the TTTDChunker option) to, let's say 10 bytes, then it'll split after "hello world" and work as you expect, but it'll not perform well on large filebase, since a huge lot of chunk will be generated and searched for each new chunk found. Typically, dividing by 10 the min chunk size multiplies by 10 the number of chunks in the index, and multiply by log(10) = 2.3 the processing time.

So in your example, with a TTTD min size of 10, you'd have ~200x more chunks in your index, and a ~6x longer backup procedure. I'm not sure it make sense.

from frost.

bdb commented on September 26, 2024

But if you are rolling a checksum window of say 64bytes across the whole input, then the chunks should eventually match even with a small change in the file. Because that small window decides the break points and not the chunk size. (You would just not attempt to break until your min chunk size is reached.)

Plus I don't see where your Adler window is. It appears to be a straight Adler32 implementation which is not rolling.

This was from a quick look at the file mentioned, so again, maybe I'm missing something.

Thanks for your time.

from frost.

X-Ryl669 commented on September 26, 2024

I'm not sure we are speaking of the same thing. A rolling checksum is an algorithm that adds one byte to the checksum, updating the checksum, but it does not remove the previous summed bytes.

It's used here
It's well defined for Adler32 (check Adler32.cpp for the implementation)

It's being used because one does not need to recompute the complete checksum of a fixed window for each byte like you seem to infer, so it's very fast.

Typically, the rolling CS of a string at position p is the same as the real checksum of the midstring(0, p). It's not the checksum of midstring(p - 64, p) or whatever window's size.

Adding one byte to the (rolling) checksum is fast (unlike a usual hash function) and you immediately get the checksum's value (also unlike a usual hash function). Remember that this must be O(n) since it's done for all the backup data set.

If you had a 64 bytes window, then every time any 64 bytes are similar, it would create the same checksum. However this would trigger computing SHA1 for the past data, and this would fail to match the chunk's hash with a previous one (most of the time) and you'd have lost a lot of time rejecting the checksum. Computing SHA1 is expensive, you can't do that for each 64 bytes of the backup data set.

That being said, your idea makes sense for a different algorithm that would use a bunch of rolling checksum instances (each shifted by some fixed amount of bytes) to find out if any of them would divide the threshold, and stop on the first one that does. That would trigger on a "one byte" insertion case (up to the bunch size), but that would increase the memory consumption by a whole margin (at least by the number of instance required * Adler32's state size). It seems unrealistic to have 1856 instances for example. I'll try to figure out if it makes sense and how to test an implementation of your idea.

from frost.

X-Ryl669 commented on September 26, 2024

Ok, I can try changing this in a new branch with some (smaller) window. The sum being easy to compute, this would only add 4 additions (actually subtraction) per byte.
I've found this: https://software.intel.com/en-us/articles/fast-computation-of-adler32-checksums, I might implement if it's too slow.

That'll change the interface of TTTDChunker (to have an additional option: the rolling checksum memory size) and the implementation of Adler32's interface. This would break compatibility with a previous backup set not using the same option (requiring new complete backup set).

I'll let you know when I'm done.

from frost.

bdb commented on September 26, 2024

Ok, I think we are talking about 2 separate things with the same term. Plain Adler is a running hash, not rolling. To make it rolling you need to add the sliding window. Bup does this

And this explains in the poor dedupe I was seeing. With a rolling hash only 1 (at most 2) chunks would have been new instead of all 17.

Also the rolling hash doesn't change what you SHA, it just determines your chunk break points. So there's pretty much no penalty to speed.

As an aside, SHA1 is pretty damn fast on modern machines. SHA2 for that matter. Both are hardware accelerated.

Thanks for looking into this.

from frost.

X-Ryl669 commented on September 26, 2024

Ok, I've fixed this locally and I'm integrating new features too for a new release soon.

from frost.

Dedupe appears to not work as expected? about frost HOT 6 CLOSED

Comments (6)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent