Giter Club home page Giter Club logo

fastcdc-rs's Introduction

FastCDC docs.rs Crates.io Test

This crate contains multiple implementations of the "FastCDC" content defined chunking algorithm originally described in 2016, and later enhanced in 2020, by Wen Xia, et al. A critical aspect of its behavior is that it returns exactly the same results for the same input. To learn more about content defined chunking and its applications, see the reference material linked below.

Requirements

  • Rust stable (2018 edition)

Building and Testing

$ cargo clean
$ cargo build
$ cargo test

Example Usage

Examples can be found in the examples directory of the source repository, which demonstrate finding chunk boundaries in a given file. There are both streaming and non-streaming examples, where the non-streaming examples use the memmap2 crate to read large files efficiently.

$ cargo run --example v2020 -- --size 16384 test/fixtures/SekienAkashita.jpg
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/examples/v2020 --size 16384 test/fixtures/SekienAkashita.jpg`
hash=17968276318003433923 offset=0 size=21325
hash=4098594969649699419 offset=21325 size=17140
hash=15733367461443853673 offset=38465 size=28084
hash=4509236223063678303 offset=66549 size=18217
hash=2504464741100432583 offset=84766 size=24700

Non-streaming

An example using FastCDC to find chunk boundaries in data loaded into memory:

let contents = std::fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::FastCDC::new(&contents, 16384, 32768, 65536);
for chunk in chunker {
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Streaming

Both the v2016 and v2020 modules have a streaming version of FastCDC named StreamCDC, which takes a Read and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::StreamCDC::new(source, 4096, 16384, 65535);
for result in chunker {
    let chunk = result.unwrap();
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Async Streaming

The v2020 module has an async streaming version of FastCDC named AsyncStreamCDC, which takes an AsyncRead (both tokio and futures are supported via feature flags) and uses a byte vector with capacity equal to the specified maximum chunk size.

let source = std::fs::File::open("test/fixtures/SekienAkashita.jpg").unwrap();
let chunker = fastcdc::v2020::AsyncStreamCDC::new(&source, 4096, 16384, 65535);
let stream = chunker.as_stream();
let chunks = stream.collect::<Vec<_>>().await;

for result in chunks {
    let chunk = result.unwrap();
    println!("offset={} length={}", chunk.offset, chunk.length);
}

Migration from pre-3.0

If you were using a release of this crate from before the 3.0 release, you will need to make a small adjustment to continue using the same implementation as before.

Before the 3.0 release:

let chunker = fastcdc::FastCDC::new(&contents, 8192, 16384, 32768);

After the 3.0 release:

let chunker = fastcdc::ronomon::FastCDC::new(&contents, 8192, 16384, 32768);

The cut points produced will be identical to previous releases as the ronomon implementation was never changed in that manner. Note, however, that the other implementations will produce different results.

Reference Material

The original algorithm from 2016 is described in FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication, while the improved "rolling two bytes each time" version from 2020 is detailed in The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems.

Other Implementations

fastcdc-rs's People

Contributors

ariel-miculas avatar blackbeam avatar cdata avatar ciehanski avatar dristic avatar flokli avatar jorickert avatar joshtriplett avatar lokyinzhao avatar maobaolong avatar mzr avatar nagy avatar nlfiedler avatar rickvanprim avatar smoozilla avatar swatinem avatar titusz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastcdc-rs's Issues

Provide hash in chunk struct

One limitation I'm running into is identifying chunks by a unique identifier without having to rehash the contents. I'm wondering if it would be useful to expose the hash value of the chunk in the Chunk struct? The cut function could potentially return (u32, usize) and then add that to the return of the iterator.

Thanks for the crate it works well!

Test, example, API docs for AsyncStreamCDC

The newly introduced AsyncStreamCDC needs unit tests, API documentation, and a working example. The first question is how to supply an appropriate AsyncRead to the chunker since tokio is not compatible.

Hash Collisions

In testing, we've observed that some chunks experience hash collisions. Is there any way to avoid this?

Purpose of Box in StreamCDC struct

Hey!

The StreamCDC struct currently contains a Box<dyn Read>, which requires an additional heap allocation and potentially prevents some compiler optimisations by obscuring the reader type.

Is there a particular reason for this Box that isn't immediately obvious, or would you accept a change to parameterise StreamCDC over an R: Read, such that monomorphisation can occur?

Thanks for this crate! :)

Anyone have a link to the original QuickCDC paper?

I am pretty sure there exists research for a CDC algorithm named QuickCDC from at least 4 years ago (jrobhoward/quickcdc is seemingly based on it but references other research). However, I cannot find that paper anywhere, and the only results that turn up now are from 2021, which is too recent to be the same thing. If you have a link, please add a comment here. Thank you.

Version 3.1.0 breaks semver

I have the following error when building:

error[E0107]: missing generics for struct `fastcdc::v2020::StreamCDC`
   --> puzzlefs-lib/src/builder.rs:111:18
    |
111 |     mut chunker: StreamCDC,
    |                  ^^^^^^^^^ expected 1 generic argument

Semver link: https://semver.org/#summary

Maybe size_hint() should return the estimated remaining chunk number

Hi there!๐Ÿ‘‹

I looked into the code of v2020::<FastCDC as Iterator>::size_hint(), which returns the estimated total number of chunks. But the documentation for Iterator::size_hint() says:

size_hint() returns the bounds on the remaining length of the iterator.

So it may be better to return the estimated number of remaining chunks:

    fn size_hint(&self) -> (usize, Option<usize>) {
        // NOTE: This intentionally returns the upper bound for both `size_hint`
        // values, as the upper bound doesn't actually seem to get used by `std`
        // and using the actual lower bound is practically guaranteed to require
        // a second capacity growth.
        let upper_bound = self.remaining / self.min_size;
        (upper_bound, Some(upper_bound))
    }

I'd be happy to make a PR if it is resonable.โค๏ธ

v2020 version is not 30-40% better than v2016 version?

In my benchmarks, I have not been able to reproduce the experimental result in the paper that the "rolling two bytes each time" optimization improves performance by 30-40%.

One benchmark is to use the example in this repo:

hyperfine \
  --warmup 2 \
  --parameter-list version v2016,v2020 \
  'cargo run --example {version} -- --size 16384 test/fixtures/SekienAkashita.jpg'

hyperfine reports ~50ms duration on my machine for both versions, with 1.00 +/- 0.02x performance difference.

I also ran a different benchmark on one of my own datasets, with 2 GB of cache artifacts generated by a bazel build. In my experimental setup I was using 8 threads, and using fastcdc-rs (comparing v2016 vs v2020) to chunk all 2GB of files in parallel (just reading the files and chunking them, and throwing away the result - nothing else).

With 8 threads enabled I found that v2020 was only 1.01X faster than v2016. With 1 thread enabled I see that v2020 is 1.07X faster. Two observations about these results:

  • The best case (7% faster single-threaded performance) is much less than 30-40%
  • Multi-threaded performance (in which I saw only 1% improvement) is probably more relevant for a real-world data deduplication system

Maybe the README could mention this data point as a caveat, to help manage expectations? Something like "the authors claim a 30-40% performance improvement, but one benchmark using this rust implementation showed a more modest improvement (roughly a 1% improvement for multi-threaded chunking, or a 7% improvement for single-threaded chunking)."

Here is my benchmark as a gist in case anyone would like to test with their dataset. Copy it to src/main.rs in a clone of this repo, then run it like cargo add walkdir crossbeam-channel && cargo build && hyperfine --warmup 2 --parameter-list version 2016,2020 './target/release/fastcdc DATA_DIR' -- replace DATA_DIR with your custom data directory, preferably with many files exceeding the min chunk size.

doc: make it clear the iterator returns Chunks

The connection between the FastCDC struct, the Iterator, and the type of its Item is not clear at all, need some explanation to tie it together. Add a code example to the FastCDC struct to show the basic operation.

Req: supply consume the byte as stream to avoid big buf as input

background

We need to chunking the file from oss on cloud, and the file maybe big, and the chunking server may not have enough memory to hold in the chunk process, so we want to chunking the oss input stream which opened by chunking process, the chunking process use fastcdc-rs as a lib to cutting stream into chunk, but the chunking logic is in the fastcdc-rs lib.

request

  • If fastcdc-rs supply a way to consume the byte as stream, we can avoid big buf
  • Supply to use this lib through jni or jnr to let java use this useful lib

Extremely low MIN,AVG,MAX on v2020

Hello! Thank you for your effort in maintaining this crate!

I'm currently using the ronomon version of the FastCDC algorithm and have been looking to switch over to v2020. I am having issues since the consts max for MIN, AVERAGE, and MAX have been lowered significantly and the assertions confirming their sizes panic before I can recover: panicked at 'assertion failed: avg_size <= AVERAGE_MAX'

Here is the AVERAGE_MAX const per algorithm:

Ronomon: pub const AVERAGE_MAX: usize = 268_435_456;
v2020: pub const AVERAGE_MAX: u32 = 4_194_304;

Any reason the average max only be set to 4 MB? Are these values set and specified in the whitepaper? It's quite limiting and the amount of chunks generated by one file would increase dramatically if I switch algorithms.

Cheers,
ciehanski

[feature] Support stream of bytes

Would be great if the crate could take a (possibly async) stream of bytes, did it's own buffering, and return a stream of chunks. Or perhaps something more general like a struct implementing Read.

FastCDC Iterator size_hint

It seems like it would be a small optimization to supply Iterator::size_hint() to cut down on allocations when using collect(). Given the bounds on chunk sizes, it seems like this is straightforward to calculate, where size is [source.len() / min_size, source.len() / max_size].

One other consideration, from my little bit of digging in to Vec's FromIter and reading https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187/2, it doesn't seem like the upper bound is ever used, so my preference would be to "incorrectly" supply the upper bound as both the lower bound and upper bound and avoid the essentially guaranteed additional allocation if we start with the lower bound for capacity.

If this change seems reasonable, I'd be happy to make a PR ๐Ÿ™‚

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.