Giter Club home page Giter Club logo

conserve's Introduction

Conserve: a robust backup program

https://github.com/sourcefrog/conserve/

GitHub build status crates.io Maturity: Beta

Conserve's guiding principles:

  • Safe: Conserve is written in Rust, a fast systems programming language with compile-time guarantees about types, memory safety, and concurrency. Conserve uses a conservative log-structured format.

  • Robust: If one file is corrupted in storage or due to a bug in Conserve, or if the backup is interrupted, you can still restore what was written. (Conserve doesn't need a large transaction to complete for data to be accessible.)

  • Careful: Backup data files are never touched or altered after they're written, unless you choose to purge them.

  • When you need help now: Restoring a subset of a large backup is fast, because it doesn't require reading the whole backup.

  • Always making progress: Even if the backup process or its network connection is repeatedly killed, Conserve can quickly pick up where it left off and make forward progress.

  • Ready for today: The storage format is fast and reliable on on high-latency, limited-capability, unlimited-capacity, eventually-consistent cloud object storage.

  • Fast: Conserve exploits Rust's fearless concurrency to make full use of multiple cores and IO bandwidth. (In the current release there's still room to add more concurrency.)

  • Portable: Conserve is tested on Windows, Linux (x86 and ARM), and OS X.

Quick start guide

Conserve storage is within an archive directory created by conserve init:

conserve init /backup/home.cons

conserve backup copies a source directory into a new version within the archive. Conserve copies files, directories, and (on Unix) symlinks. If the conserve backup command completes successfully (copying the whole source tree), the backup is considered complete.

conserve backup /backup/home.cons ~ --exclude /.cache

conserve diff shows what's different between an archive and a source directory. It should typically be given the same --exclude options as were used to make the backup.

conserve diff /backup/home.cons ~ --exclude /.cache

conserve versions lists the versions in an archive, whether or not the backup is complete, the time at which the backup started, and the time taken to complete it. Each version is identified by a name starting with b.

$ conserve versions /backup/home.cons
b0000                      complete   2016-11-19T07:30:09+11:00     71s
b0001                      incomplete 2016-11-20T06:26:46+11:00
b0002                      incomplete 2016-11-20T06:30:45+11:00
b0003                      complete   2016-11-20T06:42:13+11:00    286s
b0004                      complete   2016-12-01T07:08:48+11:00     84s
b0005                      complete   2016-12-18T02:43:59+11:00      4s

conserve ls shows all the files in a particular version. Like all commands that read a band from an archive, it operates on the most recent by default, and you can specify a different version using -b. (You can also omit leading zeros from the backup version.)

conserve ls -b b0 /backup/home.cons | less

conserve restore copies a version back out of an archive:

conserve restore /backup/home.cons /tmp/trial-restore

conserve validate checks the integrity of an archive:

conserve validate /backup/home.cons

conserve delete deletes specific named backups from an archive:

conserve delete /backup/home.cons -b b1

Exclusions

The --exclude GLOB option can be given to commands that operate on files, including backup, restore, ls and list-source.

A / at the start of the exclusion pattern anchors it to the top of the backup tree (not the root of the filesystem.) ** recursively matches any number of directories. *.o matches anywhere in the tree.

--exclude-from reads exclusion patterns from a file, one per line, ignoring leading and trailing whitespace, and skipping comment lines that start with a #.

The syntax is comes from the Rust globset crate.

Directories marked with CACHEDIR.TAG are automatically excluded from backups.

S3 support

From 23.9 Conserve supports storing backups in Amazon S3. AWS IAM credentials are read from the standard sources: the environment, config file, or, on EC2, the instance metadata service.

S3 support can be turned off by passing cargo install --no-default-features. (There's no runtime impact if it is not used, but it does add a lot of build-time dependencies.)

To use this, just specify an S3 URL for the archive location. The bucket must already exist.

conserve init s3://my-bucket/
conserve backup s3://my-bucket/ ~

Files are written in the INTELLIGENT_TIERING storage class.

(This should work on API-compatible services but has not been tested; experience reports are welcome.)

Install

To build Conserve you need Rust and a C compiler that can be used by Rust.

To install the most recent release from crates.io, run

cargo install conserve

To install from a git checkout, run

cargo install -f --path .

On nightly Rust only, and only on x86_64, you can enable a slight speed-up with

cargo +nightly install -f --path . --features blake2-rfc/simd_asm

Arch Linux

To install from from available AUR packages, use an AUR helper:

yay -S conserve

More documentation

Performance on Windows

Windows Defender and Windows Search Indexing can severely slow down any program that does intensive file IO, including Conserve. I recommend you exclude the backup directory from both systems.

Project status

Conserve is at a reasonable level of maturity; the format is stable and the basic features are complete. I have used it as a primary backup system for over a year. There is still room for several performance improvements and features.

The current data format (called "0.6") will be readable by future releases for at least two years.

Be aware Conserve is developed as a part-time non-commercial project and there's no guarantee of support or reliability. Bug reports are welcome but I cannot promise they will receive a resolution within any particular time frame.

Licence and non-warranty

Copyright 2012-2023 Martin Pool.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

conserve's People

Contributors

believeinlain avatar dependabot[bot] avatar ebroto avatar fgadaleta avatar gitter-badger avatar orhun avatar sailreal avatar sourcefrog avatar wolverindev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

conserve's Issues

Review storage design for better deduplication support

Because I really like the deduplication properties of Borg/Attic, I've been thinking about the data storage pool of conserve. Specifically, how the ability to only reference data in the parent backup limits the extent of deduplication possible.

Because I've thought about this for long enough, I'm dumping a mostly-formed idea here rather than endlessly iterate in 5-minute increments locally ๐Ÿ˜„

I think this preserves Conserve's append-only approach (modulo the file renaming involved) and human-recoverableness. I also think it's safely lock-free (in the sense that multiple conserve operations can be safely performed in parallel on the same storage pool without relying on filesystem locking).

Assumptions:

Filesystem renames are cheap
Filesystem renames cannot lose file data
Filesystem will handle large numbers of empty files reasonably efficiently
There's a rmdir() or equivalent that will only remove empty directories
Some basic filesystem ordering properties:

  1. If a file a is created before a file b, then a process will never observe a directory listing containing b but not a.

Backup Sequence

Backup 0012 wants to store a data block with SHA256 $HASH

  1. Create empty file pool/$HASH/0012.in-progress
  2. Searches directory listing for data-*
    i. If data-$GEN_NUMBER is found, rename() data-$GEN_NUMBER to
    data-($GEN_NUMBER + 1)
    ii. If no data-* found, write block data to a temporary file then
    rename() to data-0001.
  3. rename()s 0012.in-progress to 0012.

Failure conditions:

  • If (1) fails, then a block with hash $HASH does not already exist;
    Backup 0012 creates the directory, then continues to (2 ii).
  • If (2 i) fails then either the block data has been deleted or another
    backup process has moved it. In either case, restart at (2).
  • If (2 ii) fails then another backup process has already started writing
    the data. Restart at (2).

Deletion sequence

To delete backup 0070 Conserve starts with list of blocks from the backup
description. For each block $HASH:

  1. Gets directory listing for pool/$HASH
  2. Deletes pool/$HASH/0070
  3. If this is last non-data-* file in pool/$HASH, according to
    precomputed directory listing, then
    i. Delete the data-* files present in the listing.
    ii. rmdir() pool/$HASH

Failure conditions:

  • If (1) fails, then this block has already been deleted, possibly by a
    previously interrupted delete operation? We need do nothing.
  • If (2) fails, then this block has already been processed for this
    delete operation. We need do nothing
  • If (3 i) fails, continue to (3 ii)
  • If (3 ii) fails because the directory is not empty, then we're racing
    a backup operation. Do nothing.

Core Dump

Hi Martin,

I installed conserve per the instructions on Ubuntu 13.10. I did autogen, configure, make, make check (everything passed), and sudo make install. I then did conserve init /media/joey/joey-work/conserve and it created the directory with the binary file.

I then ran this:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:19:39.968262 17188 bzdatawriter.cc:61] Check failed: bytes_read > 0 : Is a directory [21]
*** Check failure stack trace: ***
@ 0x7fd475d17daa (unknown)
@ 0x7fd475d17ce4 (unknown)
@ 0x7fd475d176e6 (unknown)
@ 0x7fd475d174fb (unknown)
@ 0x7fd475d18477 (unknown)
@ 0x41224e conserve::BzDataWriter::store_file()
@ 0x411aaa conserve::BlockWriter::add_file()
@ 0x407d49 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7fd475019de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

In the backup directory I have a folder called b0000 now. If I call that command again:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:16:57.847147 17149 util.cc:46] Check failed: fd > 0 : File exists [17]
*** Check failure stack trace: ***
@ 0x7f867546fdaa (unknown)
@ 0x7f867546fce4 (unknown)
@ 0x7f867546f6e6 (unknown)
@ 0x7f867546f4fb (unknown)
@ 0x7f8675470477 (unknown)
@ 0x41544b conserve::write_proto_to_file()
@ 0x40911a conserve::BandWriter::start()
@ 0x406d75 conserve::Archive::start_band()
@ 0x407d06 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7f8674771de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

Joey

break into blocks on whole files

Start a new file when:

  • current block is over a maximum compressed byte size, or
  • number of files in the block is over a particular size (correlated to an index block size limit)

Will require updating read code to handle multiple blocks.

Cope with race creating band

ERROR: File exists (os error 17)
stack backtrace:
   0:        0x100e351be - backtrace::backtrace::trace::h1b789d8e1542dad0
   1:        0x100e358fc - backtrace::capture::Backtrace::new::h3d062d22ca4dc5a6
   2:        0x100e34d36 - error_chain::make_backtrace::h48f950ecdd86ad84
   3:        0x100e34ded - _$LT$error_chain..State$u20$as$u20$core..default..Default$GT$::default::he14c64e0c2fc5de7
   4:        0x100e0a3bc - conserve::band::Band::create::h1d318051cfc1c04f
   5:        0x100e07ca4 - conserve::archive::Archive::create_band::h342b7c2956161944
   6:        0x100e088ba - conserve::backup::backup::hca5a5659f78774db
   7:        0x100d9f81a - conserve::backup::h055dd3b66439ec1d
   8:        0x100d9edd7 - conserve::main::h53c12c360c8dc5eb
   9:        0x100e8c38a - __rust_maybe_catch_panic
  10:        0x100e8b906 - std::rt::lang_start::ha9be7b379cf1665e

switch to cap'n proto?

http://kentonv.github.io/capnproto

pros:

  • has proto-to-text and text-to-proto which could be nice for debugging or testing
  • more compact encoding? (may not make a big difference after gzip)
  • less memory usage than protos? (probably also not a big deal because we want to be streaming through blocks, but it can't hurt)
  • actively-maintained open implementation - but, protobuf is still seeing new releases and is widely used

cons:

  • it does require gcc >= 4.7 which is pretty new: not in Ubuntu Precise but is in Debian Stable
  • capnproto packages themselves limited availability; not in Debian Jessie
  • API is a little harder?
  • maybe more likely to have API churn?
  • protobuf C++ fast enough for Google (maybe not quite the same code)

irrelevant:

  • zero serialization work: not a big deal, unlikely to be a dominant factor for Conserve
  • does have an RPC system

recurse through source

At the moment you can't give a source directory: you must name all the files in order.

`diff` command

extract files that diff from the tree (or maybe between two versions) and run a command on them - or just say which ones differ

delete named versions

conserve delete -b b1234 /backup/home.c6
  • allow specifying multiple bands
  • gc after deletion
  • but have an option to not gc

validate backup

conserve validate exists and checks some invariants, but there are others that could usefully be checked.

Split cli to a separate sub crate?

This would probably get clippy working and might work better with Cargo dependencies. A sub crate might break simple cargo install though.

glob select files to restore

This is somewhat handled by --exclude on restore but a specific option to select the things you do want would be more straightforward.

remove band head files?

Do we really need band head and tail files? We assume we can list the directory. Do they store any new information as distinct from just the presence of data blocks?

The tail does let us more quickly tell that the band is complete: this could be in the final data block but that would be slower to read.

Do we need to do anything to detect a concurrent attempt to write the same band or block? Maybe easiest to just detect the written data block already exists.

idea: loose storage of large files

A large fraction of data backed up probably will be fairly large and incompressible: jpgs, mpgs, previously-compressed archives, etc.

Trying to compress them will just waste CPU time on insertion and removal

For files above our minimum granularity goal, combining them with others will only slow retrieval.

Storing them as objects with no compression conceivably makes recovery from a badly broken archive easier. However it also runs some risk that they might be directly seen and edited inside the archive, leading to corruption.

We probably don't want a separate band header for every such loose object, since the band heads will be very small and the headers too choppy.

Therefore perhaps we want, multiple data files per band, which might be aggregates, or might be single objects. We could reorder the small/compressible files to be inside the aggregate.

Restore mtime

  • store mtimes
  • restore mtime on plain files
  • add tests for restoring mtimes
  • fix old-version archives to all use the same mtimes
  • check mtimes of directories restored from old-version archives
  • restore mtime of symlinks
  • restore mtime of directories, after creating all entries directly inside them

Parallelize compression

Do block compression on a thread pool, writing files as they complete.

I'm assuming here that compression is the expensive operation we want to parallelize, but depending on the observed balance of time possibly hashing could be parallelized too.

Since the index is sorted we need to accept completed blocks in order by the filename that includes them. So this is complementary to, and related to, storing content from multiple files in a single block. Although we know the hash in advance, we shouldn't write the index until the block data has been written.

Because blocks are of limited size we can keep them entirely in memory, and we can just transfer ownership of the block to the worker thread.

Run ~n_cpus compression worker threads.

One main thread reads files, hashes them, accumulates data into blocks. Push the blocks onto a queue for compression, along with a channel through which the worker can indicate that it's complete.

The compression worker thread returns a Result<()>.

pseudocode:

pending_files = Deque()
for file in source:
  while queue.length > N:
    pending_files[0].channels[0].wait()
  while pending_files[0].channels.all_complete(): 
    index.add(pending_files.pop(0))
  file_completion_channels = []
  for block in pack_blocks_for_file:
    file_completion_channels.append(block_channel = channel())
    queue.push(block_bytes, block_channel)  
 pending_files.push(file, file_completion_channels)

Pre-count tree before backup

Just counting the number of files or maybe their total size will let us show percent completion for the whole backup. If the tree changes in between this preview and the actual backup, progress won't be absolutely accurate but that's fine.

This will take some time so should be optional.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.