nlfiedler / fastcdc-rs Goto Github PK

View Code? Open in Web Editor NEW

136.0 8.0 24.0 207 KB

FastCDC implementation in Rust

Home Page: https://crates.io/crates/fastcdc

License: MIT License

Rust 100.00%

rust deduplication chunking-algorithm

fastcdc-rs's Introduction

FastCDC

This crate contains multiple implementations of the "FastCDC" content defined chunking algorithm originally described in 2016, and later enhanced in 2020, by Wen Xia, et al. A critical aspect of its behavior is that it returns exactly the same results for the same input. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

Rust stable (2018 edition)

Building and Testing

$ cargo clean
$ cargo build
$ cargo test

Example Usage

Examples can be found in the examples directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples use the memmap2 crate to read large files efficiently.

$ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/v2020 --size 16384 test/fixtures/SekienAkashita.jpg`
hash=17968276318003433923 offset=0 size=21325
hash=4098594969649699419 offset=21325 size=17140
hash=15733367461443853673 offset=38465 size=28084
hash=4509236223063678303 offset=66549 size=18217
hash=2504464741100432583 offset=84766 size=24700

Non-streaming

An example using FastCDC to find chunk boundaries in data loaded into memory:

let contents = std::fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::FastCDC::new(&contents, 16384, 32768, 65536);
for chunk in chunker {
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Streaming

Both the v2016 and v2020 modules have a streaming version of FastCDC named StreamCDC, which takes a Read and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::StreamCDC::new(source, 4096, 16384, 65535);
for result in chunker {
    let chunk = result.unwrap();
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Async Streaming

The v2020 module has an async streaming version of FastCDC named AsyncStreamCDC, which takes an AsyncRead (both tokio and futures are supported via feature flags) and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::AsyncStreamCDC::new(&source, 4096, 16384, 65535);
let stream = chunker.as_stream();
let chunks = stream.collect::<Vec<_>>().await;

for result in chunks {
    let chunk = result.unwrap();
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Migration from pre-3.0

If you were using a release of this crate from before the 3.0 release, you will need to make a small adjustment to continue using the same implementation as before.

Before the 3.0 release:

let chunker = fastcdc::FastCDC::new(&contents, 8192, 16384, 32768);

After the 3.0 release:

let chunker = fastcdc::ronomon::FastCDC::new(&contents, 8192, 16384, 32768);

The cut points produced will be identical to previous releases as the ronomon implementation was never changed in that manner. Note, however, that the other implementations will produce different results.

Reference Material

The original algorithm from 2016 is described in FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication, while the improved "rolling two bytes each time" version from 2020 is detailed in The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems.

Other Implementations

jrobhoward/quickcdc
- Similar but slightly earlier algorithm by some of the same authors?
rdedup_cdc at docs.rs
- Alternative implementation in Rust.
ronomon/deduplication
- C++ and JavaScript implementation of a variation of FastCDC.
titusz/fastcdc-py
- Pure Python port of FastCDC. Compatible with this implementation.
wxiacode/FastCDC-c
- Canonical algorithm in C with gear table generation and mask values.
wxiacode/restic-FastCDC
- Alternative implementation in Go with additional mask values.

fastcdc-rs's People

Contributors

Stargazers

Watchers

fastcdc-rs's Issues

Provide hash in chunk struct

One limitation I'm running into is identifying chunks by a unique identifier without having to rehash the contents. I'm wondering if it would be useful to expose the hash value of the chunk in the Chunk struct? The cut function could potentially return (u32, usize) and then add that to the return of the iterator.

Thanks for the crate it works well!

Test, example, API docs for AsyncStreamCDC

The newly introduced AsyncStreamCDC needs unit tests, API documentation, and a working example. The first question is how to supply an appropriate AsyncRead to the chunker since tokio is not compatible.

Please revert v1.0.7 due to breaking change in public interface

Public hash field was added to Chunk struct. This breaks code since it is public interface. According to semver such changes should not go to patch version.

Hash Collisions

In testing, we've observed that some chunks experience hash collisions. Is there any way to avoid this?

Purpose of Box in StreamCDC struct

Hey!

The StreamCDC struct currently contains a Box<dyn Read>, which requires an additional heap allocation and potentially prevents some compiler optimisations by obscuring the reader type.

Is there a particular reason for this Box that isn't immediately obvious, or would you accept a change to parameterise StreamCDC over an R: Read, such that monomorphisation can occur?

Thanks for this crate! :)

v2020 equivalent of FastCDC::with_eof

Need to figure out how to use fastcdc::v2020::StreamCDC in order to obtain the same effect as the ronomon's FastCDC::with_eof

My goal is to rewrite the current puzzlefs implementation (based on fastcdc 1.0.5) which uses a custom streaming implementation: https://github.com/anuvu/puzzlefs/blob/master/builder/src/fastcdc_fs.rs

It would also address project-machine/puzzlefs#15

Is there a im tool as https://gitter.im/dpc/rdedup or slack

If we have a slack channel or a gitter as an im tool, we can communicate freely and in time.

Anyone have a link to the original QuickCDC paper?

I am pretty sure there exists research for a CDC algorithm named QuickCDC from at least 4 years ago (jrobhoward/quickcdc is seemingly based on it but references other research). However, I cannot find that paper anywhere, and the only results that turn up now are from 2021, which is too recent to be the same thing. If you have a link, please add a comment here. Thank you.

Version 3.1.0 breaks semver

I have the following error when building:

error[E0107]: missing generics for struct `fastcdc::v2020::StreamCDC`
   --> puzzlefs-lib/src/builder.rs:111:18
    |
111 |     mut chunker: StreamCDC,
    |                  ^^^^^^^^^ expected 1 generic argument

Semver link: https://semver.org/#summary

Maybe size_hint() should return the estimated remaining chunk number

Hi there!👋

I looked into the code of v2020::<FastCDC as Iterator>::size_hint(), which returns the estimated total number of chunks. But the documentation for Iterator::size_hint() says:

size_hint() returns the bounds on the remaining length of the iterator.

So it may be better to return the estimated number of remaining chunks:

    fn size_hint(&self) -> (usize, Option<usize>) {
        // NOTE: This intentionally returns the upper bound for both `size_hint`
        // values, as the upper bound doesn't actually seem to get used by `std`
        // and using the actual lower bound is practically guaranteed to require
        // a second capacity growth.
        let upper_bound = self.remaining / self.min_size;
        (upper_bound, Some(upper_bound))
    }

I'd be happy to make a PR if it is resonable.❤️

v2020 version is not 30-40% better than v2016 version?

In my benchmarks, I have not been able to reproduce the experimental result in the paper that the "rolling two bytes each time" optimization improves performance by 30-40%.

One benchmark is to use the example in this repo:

hyperfine \
  --warmup 2 \
  --parameter-list version v2016,v2020 \
  'cargo run --example {version} -- --size 16384 test/fixtures/SekienAkashita.jpg'

hyperfine reports ~50ms duration on my machine for both versions, with 1.00 +/- 0.02x performance difference.

I also ran a different benchmark on one of my own datasets, with 2 GB of cache artifacts generated by a bazel build. In my experimental setup I was using 8 threads, and using fastcdc-rs (comparing v2016 vs v2020) to chunk all 2GB of files in parallel (just reading the files and chunking them, and throwing away the result - nothing else).

With 8 threads enabled I found that v2020 was only 1.01X faster than v2016. With 1 thread enabled I see that v2020 is 1.07X faster. Two observations about these results:

The best case (7% faster single-threaded performance) is much less than 30-40%
Multi-threaded performance (in which I saw only 1% improvement) is probably more relevant for a real-world data deduplication system

Maybe the README could mention this data point as a caveat, to help manage expectations? Something like "the authors claim a 30-40% performance improvement, but one benchmark using this rust implementation showed a more modest improvement (roughly a 1% improvement for multi-threaded chunking, or a 7% improvement for single-threaded chunking)."

Here is my benchmark as a gist in case anyone would like to test with their dataset. Copy it to src/main.rs in a clone of this repo, then run it like cargo add walkdir crossbeam-channel && cargo build && hyperfine --warmup 2 --parameter-list version 2016,2020 './target/release/fastcdc DATA_DIR' -- replace DATA_DIR with your custom data directory, preferably with many files exceeding the min chunk size.

doc: make it clear the iterator returns Chunks

The connection between the FastCDC struct, the Iterator, and the type of its Item is not clear at all, need some explanation to tie it together. Add a code example to the FastCDC struct to show the basic operation.

Implementing SuperCDC 2022

Hi there 👋

First, thank you very much for this awesome crate, I learned a lot reading the code and the documentation is top tier!

I would like to know if you are interested in adding SuperCDC (https://www.computer.org/csdl/proceedings-article/iccd/2022/618600a170/1JeFWRlp7gs) to this crate?

I've just glanced over the paper, but it shows a throughput improvement of ~1.5 to 8-10 times over FastCDC 2020.

If you are interested, please let me know how I can help.

Req: supply consume the byte as stream to avoid big buf as input

background

We need to chunking the file from oss on cloud, and the file maybe big, and the chunking server may not have enough memory to hold in the chunk process, so we want to chunking the oss input stream which opened by chunking process, the chunking process use fastcdc-rs as a lib to cutting stream into chunk, but the chunking logic is in the fastcdc-rs lib.

request

If fastcdc-rs supply a way to consume the byte as stream, we can avoid big buf
Supply to use this lib through jni or jnr to let java use this useful lib

Problems with right-shift in Gear rollsum.

This implementation copies another implementation that uses a right-shift in its Gear rollsum. This has problems I reported here;

ronomon/deduplication#7

Extremely low MIN,AVG,MAX on v2020

Hello! Thank you for your effort in maintaining this crate!

I'm currently using the ronomon version of the FastCDC algorithm and have been looking to switch over to v2020. I am having issues since the consts max for MIN, AVERAGE, and MAX have been lowered significantly and the assertions confirming their sizes panic before I can recover: panicked at 'assertion failed: avg_size <= AVERAGE_MAX'

Here is the AVERAGE_MAX const per algorithm:

Ronomon: pub const AVERAGE_MAX: usize = 268_435_456;
v2020: pub const AVERAGE_MAX: u32 = 4_194_304;

Any reason the average max only be set to 4 MB? Are these values set and specified in the whitepaper? It's quite limiting and the amount of chunks generated by one file would increase dramatically if I switch algorithms.

Cheers,
ciehanski

[feature] Support stream of bytes

Would be great if the crate could take a (possibly async) stream of bytes, did it's own buffering, and return a stream of chunks. Or perhaps something more general like a struct implementing Read.

FastCDC Iterator size_hint

It seems like it would be a small optimization to supply Iterator::size_hint() to cut down on allocations when using collect(). Given the bounds on chunk sizes, it seems like this is straightforward to calculate, where size is [source.len() / min_size, source.len() / max_size].

One other consideration, from my little bit of digging in to Vec's FromIter and reading https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187/2, it doesn't seem like the upper bound is ever used, so my preference would be to "incorrectly" supply the upper bound as both the lower bound and upper bound and avoid the essentially guaranteed additional allocation if we start with the lower bound for capacity.

If this change seems reasonable, I'd be happy to make a PR 🙂