Giter Club home page Giter Club logo

chunker's Introduction

Documentation Build Status Go Report Card

Introduction

restic is a backup program that is fast, efficient and secure. It supports the three major operating systems (Linux, macOS, Windows) and a few smaller ones (FreeBSD, OpenBSD).

For detailed usage and installation instructions check out the documentation.

You can ask questions in our Discourse forum.

Quick start

Once you've installed restic, start off with creating a repository for your backups:

$ restic init --repo /tmp/backup
enter password for new backend:
enter password again:
created restic backend 085b3c76b9 at /tmp/backup
Please note that knowledge of your password is required to access the repository.
Losing your password means that your data is irrecoverably lost.

and add some data:

$ restic --repo /tmp/backup backup ~/work
enter password for repository:
scan [/home/user/work]
scanned 764 directories, 1816 files in 0:00
[0:29] 100.00%  54.732 MiB/s  1.582 GiB / 1.582 GiB  2580 / 2580 items  0 errors  ETA 0:00
duration: 0:29, 54.47MiB/s
snapshot 40dc1520 saved

Next you can either use restic restore to restore files or use restic mount to mount the repository via fuse and browse the files from previous snapshots.

For more options check out the online documentation.

Backends

Saving a backup on the same machine is nice but not a real backup strategy. Therefore, restic supports the following backends for storing backups natively:

Design Principles

Restic is a program that does backups right and was designed with the following principles in mind:

  • Easy: Doing backups should be a frictionless process, otherwise you might be tempted to skip it. Restic should be easy to configure and use, so that, in the event of a data loss, you can just restore it. Likewise, restoring data should not be complicated.

  • Fast: Backing up your data with restic should only be limited by your network or hard disk bandwidth so that you can backup your files every day. Nobody does backups if it takes too much time. Restoring backups should only transfer data that is needed for the files that are to be restored, so that this process is also fast.

  • Verifiable: Much more important than backup is restore, so restic enables you to easily verify that all data can be restored.

  • Secure: Restic uses cryptography to guarantee confidentiality and integrity of your data. The location the backup data is stored is assumed not to be a trusted environment (e.g. a shared space where others like system administrators are able to access your backups). Restic is built to secure your data against such attackers.

  • Efficient: With the growth of data, additional snapshots should only take the storage of the actual increment. Even more, duplicate data should be de-duplicated before it is actually written to the storage back end to save precious backup space.

Reproducible Builds

The binaries released with each restic version starting at 0.6.1 are reproducible, which means that you can reproduce a byte identical version from the source code for that release. Instructions on how to do that are contained in the builder repository.

News

You can follow the restic project on Mastodon @resticbackup or subscribe to the project blog.

License

Restic is licensed under BSD 2-Clause License. You can find the complete text in LICENSE.

Sponsorship

Backend integration tests for Google Cloud Storage and Microsoft Azure Blob Storage are sponsored by AppsCode!

Sponsored by AppsCode

chunker's People

Contributors

aparcar avatar cathalgarvey avatar chmduquesne avatar fd0 avatar fw42 avatar golint-fixer avatar michaeleischer avatar muesli avatar nak3 avatar savetherbtz avatar wjiec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chunker's Issues

chunker.Next with no allocations

Chunker.Next([]byte) consumes a slice on the every iteration in order to copy data for a particular chunk from the internal buffer. In Restic sync.Pool is used to decrease GC pressure, but there are a lot of use-cases where it's possible to read the initial io.Reader again and reuse the memory it occupies. The resulting Chunk.Data will be just a reference to the initial data slice, not a copy of it.

I would like to propose additional method (smth. like Chunker.NextNoCopy() or Chunker.NextMeta()) that will produce Chunk with empty Data. Client will be responsible to fill the Data on his own (according to the Chunk metainformation - Start and Length). Would you agree with such interface extension?

Kind regards,
Vitaly

Chunk size distribution problem.

I've tried to use restic's chunker instead of a FastCDC and noticed that compression ratios dropped substantially. Looking deeper into the issue I've found that most of the chunks produced were right at the lower bound of the MinSize (which is set pretty low for my use case: 16384).

I've narrowed down the issue to

if (digest&c.splitmask) == 0 || add >= maxSize {

Changing the code to use a different constant (e.g. == 1) fixes the problem. So the distribution of the digest values are likely to blame here. Math in the chunker is above my ability to review but I would assume that for chunker to work reasonably well digest should be uniformly distributed.

Configurable averageBits

@fd0 thanks for a wonderful library. Could you please clarify, why did you decide to make averageBits a private constant? The lower averageBits, the bigger amount of smaller chunks will be identified. This may be very convenient for third-party clients like me :) Would you consider PR providing this feature?

Let the chunker return the content of the chunks

ATM, the chunker only returns the breakpoints found in a stream of data.

type Chunk struct {
    Start  uint
    Length uint
    Cut    uint64
    Digest []byte
}

This is fine for chunking files which are read a first time for finding chunks and a second time for persisting chunks. However, this is not appropriate when dealing with stream coming from stdin such as mysql dumps. In that case, the chunker should return chunks that contains the actual bytes of the chunk.

type Chunk struct {
    Start  uint
    Length uint
    Cut    uint64
    Digest []byte
    Content []byte
}

Chunker on data containing series of zeroes

The purpose of the chunker is to find some markers in the data. It is best if those markers do not have a probability to appear which is above the expectations.

Currently, there is a problem with how your chunker is working as it will look for markers whose hash finishes with zeroes. I want to point that there is a very common marker which has this property : series of zeroes. They hash exactly to uint64(0).

I suggest a change in your code :

if (digest&splitmask) == 0

Should be replaced by something like :

if (digest&splitmask) == splitValue

Where splitValue is a fixed random number (preferably not a special or common value).

chunker.Next: read gives input/output error

I don't know why or if this is important. But in my system I always have the following errors.

error for /home/ogarcia/.config/chromium/Profile 1/IndexedDB/https_drive.google.com_0.indexeddb.leveldb/000005.ldb: chunker.Next: read /home/ogarcia/.config/chromium/Profile 1/IndexedDB/https_drive.google.com_0.indexeddb.leveldb/000005.ldb: input/output error
[260B blob data]
error for /home/ogarcia/.config/chromium/Profile 1/IndexedDB/https_telefonica.webex.com_0.indexeddb.leveldb/000003.log: chunker.Next: read /home/ogarcia/.config/chromium/Profile 1/IndexedDB/https_telefonica.webex.com_0.indexeddb.leveldb/000003.log: input/output error
[268B blob data]

Both files are binary, I can read an copy them without problem.

  File: https_drive.google.com_0.indexeddb.leveldb/000005.ldb
  Size: 4318689   	Blocks: 8440       IO Block: 4096   regular file
Device: 33h/51d	Inode: 8113        Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/ ogarcia)   Gid: ( 1000/ ogarcia)
Access: 2018-04-16 09:05:18.159071948 +0200
Modify: 2017-04-03 16:21:55.638453896 +0200
Change: 2018-03-23 19:25:40.499529114 +0100
 Birth: -
  File: https_telefonica.webex.com_0.indexeddb.leveldb/000003.log
  Size: 1688401   	Blocks: 3304       IO Block: 4096   regular file
Device: 33h/51d	Inode: 8209        Links: 1
Access: (0600/-rw-------)  Uid: ( 1000/ ogarcia)   Gid: ( 1000/ ogarcia)
Access: 2018-04-16 09:05:18.358075491 +0200
Modify: 2018-01-29 17:23:20.571128436 +0100
Change: 2018-03-23 19:25:41.487529181 +0100
 Birth: -

Should I use chunker in production?

I'm to implement a file storage web application. We are planning to implement versioning of files soon. It will require diffing and we will store data in chunks for deduplication (we store old blocks with new blocks as revision).

We will have client (smartphone, PC) based and server based deduplication both, as we'll have web uploads too. I had idea about executing chunker from our program which is not in go. We'll serve millions of users and will have petabyte scale data stores for users.

Can I use chunker to accomplish above tasks and ship in production?

I also needed help in designing versioning index (pointers to chunks--bins) as I couldn't come with better versioning design. I faced pointers to block redundancy as new version will have repeated old blocks with new blocks.

Thanks.

Separate the rolling hash function from the chunker

At the moment the rolling hash function is integrated in the chunker. I think it would make sense to separate the rolling hash function from the chunker and declare interfaces for both structs. This would allows to implement and compare different kind of chunking algorithms (fixed size, thresholds, etc.), and to implement different rolling hash functions (rabinkarp, buzhash, rolling adler, etc).

some cut numbers are zeros

Hi, I have been doing some testing with this lib, the math is fascinating, though still wrapping my head around that part.

So I have been running this on some large vmdk files and ISO images and I have noticed that the cut (3rd col below) has a lot of chunks that are all zeros, but others are not. I am trying to work out why this is happening for my own knowledge and if this is a concern / bug or not? My math skills are a bit lacking here :(

start, length, cut, data sha256 0,524288,0000000000000000,28e4a81308369bd0aff1d2b22d892d04ce4a527f550377359459d3948d32d63d 524288,524288,0000000000000000,2a640f73aa78eb8763a5f018a087bd05d3736abd4986cc3e31aefb811cc94f37 1048576,973787,0000000000000000,edb77219e672dac7ce1672144d6a9ec53e7816b8eb4e5445a0774453850c49f4 2022363,524288,0000000000000000,782d745c64635fecdbb1135900d19291f509db504cc2da10e9a3da085be1f6df 2546651,524288,0000000000000000,ea3f4607a9d118329a9ffd559cb66028dc82b2ad29d56743f74231ef301d6583 3070939,524288,0000000000000000,79f64ccd636ba4cc30961138568580e30c39d16df72802fef651c721f1ae12e1 3595227,524288,0000000000000000,c31af1a65702f7c8c12b4dab3a89f67491a71e66ab072e1ff94cd2b2db2877c6 4119515,855180,0000000000000000,deee3fe183d9f6071aa2444a600d64083ff6877d1dd411ee016272a211d77bb0 4974695,1323860,0000000000000000,0e649631dad8491040858e846f9c89e3d00f8140cd196f2a0f501f5e94340789 6298555,524288,0000000000000000,c23d7b73fceba5bb05ce9940824a0e510b61a458a1a913bb27874232ffcf2aaf 6822843,849545,0000000000000000,dbf2962a6667e3d3f61bb8f192a29f2f9b1f2bc45fb82d1d919c527e5bbd15d5 7672388,1590272,0000000000000000,976f3baa39e9cc717c64820758ff1b9cfa902174671368c80cb77fcf3f5fdaa8 9262660,551276,0000000000000000,b8c560f3e8a0fc7540b7f993b5fdee2fddefc81969d410ac43025556a068e4c0 9813936,756302,001c2d1b34300000,ee095546549e7bac07e0a6bb66ab6e2e5298b4ec666c7d6b4edec86883a3e61e 10570238,573630,0000000000000000,fd3b486ab38e1099d3e99fa1911c8d6ffe5f8629bf8a69170104eeee33e5535c 11143868,591063,0000000000000000,e67d05dc0c87d1c93e252906816e734af542e226f9acfed920b2d95340d8b336 11734931,703008,0000000000000000,ef183eb9d66c97bfdb3fe0c447f19a8dbbce3b1d2907fe4d852a1a1815ebedf2 12437939,1735045,0000000000000000,a4e6953b704c6f135927637718027f2040112ae7c658ec336034d85c75af03e5 14172984,525899,0000000000000000,876e510585e0393cef81321e76e1a0f0c54734021bdaff7b20e3d02c7866d042 14698883,1160820,001c2bf278400000,fa594805c673c8383c5ad2abbf60f4a3bb82d0b48e2cc3eaea9611aca1f3e472 15859703,1379366,001db4ebf3900000,693a599241e0f1be475ccc3e38e5a9a7621a9b506d6e516baddb06f0af65232c 17239069,755131,001f8ba7d6600000,e856655192cde5781af5c56c62021b11657dff40f91083543f8c984130dc25c8 17994200,784832,0000000000000000,c53fac497eee05cd11553eb6abdfcd30f3d7cf183b5902ccf2d2f8f0939f46d8 18779032,524288,0000000000000000,8e813549020533c276d4e9be59355823c63284b1f278b64bb7a886e222e02378 19303320,602703,0000000000000000,c2140a64fa266b6356ba9bd6d301693eeebc5de7e909649c79eb8adf6b52d118 19906023,2779791,0000000000000000,377a1ad55e145fda2e69a19bc257af94f8f5168b127ed3be04aef915fd8faed1 22685814,2922869,001492e64e500000,8a8e2434db0d92faa88f529b26d6b0833934a93bc2109033ccc159ed7906ce62 25608683,693701,00113c4e0e000000,1c2ceb34bd22b29f9ff41be422a45d1fc154f4395652573179191849aad2c363 26302384,3420819,0010535a1d700000,29255b8d98898367443bf78e6ce66dedad7029a99f98b78458ef4caf32d7a577 29723203,1284896,0014b881b9700000,72cee412d6834390ff399f8764a5303b027ec560a874392d6cc0d51078c02564 31008099,1110826,0000000000000000,ac95d0772821cefcb3239057a53f359e04dc4793f5d08966c02aea31edfbd059 32118925,610052,001efe9c6ab00000,35b9061f17f6b1dd7382c93ba09dda25754afe8c3efe9f75e8b129bb580a504c 32728977,846558,00137eaa57f00000,5b002379337bd9c2595c93b4ac31ca42d77771aff85537f1943a209717b94a8b 33575535,6596661,0017c23658500000,93997224daeb73b02b2b8057ba586215fd6d3b180fbd723d88ea68ccae041e07 40172196,6895050,0011b6238d400000,6d36ebd9c1752354302f842b582b9b43e6c7eb8a9655b94315d3c260f3865bcc 47067246,2451102,0011c0316b400000,ff48b61821fe0d080fd5773b750046e38f26080743f32159b5cff289d825e464 49518348,1176209,000d3a5ef8700000,1b98059a6e92c211d81c06fea94c4cef47a6bfa37d340cd9b6e8ec66eebf3243 50694557,1234094,0008694076a00000,39989fc8ff08be64145d5e10a76f68f584474beca88093f59b01928a6ca3321a 51928651,998650,0019243585f00000,deb1ec69aa141e99c36305d952d4a5e381a5624577fd99dde29e292c3c48555f 52927301,2345517,001122cb64200000,4f6f365964fd3eb05cddf44af9bb21bb31c5008daae480515a3f85c9bacd445f 55272818,1468224,0008dabb19c00000,dd47b820ca2050a59de86a3061937e14b6921b37c90be1d2629c7f0d54bafd40 56741042,907787,000b95f594b00000,471c967a00a920485796fd59bd3fd16c964447f8459b99a7e82082f15d5b3919 57648829,903786,001a881f2b800000,6ec5226f65a75bbc116ef7dd57b9cc41ee8f9a829bf8caacb60784ffe8afb4c7 58552615,795319,000168f791400000,f877afc68cbce724e7cd64cfce510247c4955722b3890b48f2e77b6798f0a0d9 59347934,814157,0000ee44cd000000,edd8f812e62a5b158e12f764a3badca102d14fead26d7a7f2744d459c36f0a33 60162091,828363,000b295ed8b00000,2415db6814c57132d5876ed6b777f180f876567441353acfc2ff5c74dc180651 60990454,4107954,0006d096f0d00000,747716c0359a140118021c3b70c3e0e838c479ceaf5b90b5eeda64771438ba24 65098408,1290395,000e4f8da4a00000,57da0dbf272590726b9044bcc343662091ea452d4283cff489d166ffb1d38bdf 66388803,1021273,001d60c1b9d00000,1df47d9d5e8300deda94afc2612a90a4c98f2419d227b904965fc5da7fa01103 67410076,3309847,000053cff2d00000,424b5980be4fbb9ea8cca575ec770d12d49abbea1d96ce2dfa564ebd523b894d 70719923,661376,000ec23dc6d00000,cdd8d62d86788583ed574ab7f5f5cc0266de0ae44b88bedc69d2724af1e78cfe 71381299,2045453,0018315fd2d00000,ab6565f4b5c1bbcd584a46a161c7e56fbc56deed47a3211045acc89a9a8608b2 73426752,1410403,000a31abbed00000,992eb5bfe672aa456a16653d477065ef151501673502a13df09a928425bce748 74837155,867156,000aa5c0d4f00000,5a2edaf327288d74d52b78ec6799151512526892447b84ef5a16ce75ec6eba4f 75704311,819145,000b998b42500000,7457294c6f18b247c20126dbd0fd8a5969fb6c9640fd63d054d4ddd54cae7aa1 76523456,681149,000eaec51c000000,81932fa073682d3d7cc164624f51d788168d4205102ab1e8e7571b608cfcbc8c

Inserting 1 at the beginning of the hashed data

Hi,

I think that it is not useful to insert 1 at the beginning of the hashed data when we are using the Rabin hash as a rolling hash.

Probably, the intent of this prefix is to make the number of leading zeroes in the data to be hashed counted : because zeroes hash to zero, without the 1 prefix, [42] would hash to the same as [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 42].

release

Would you consider adding a release of chunker? I would help Debian packaging.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.