sourcefrog / conserve Goto Github PK

View Code? Open in Web Editor NEW

249.0 14.0 20.0 3.09 MB

🌲 Robust file backup tool in Rust

License: Other

Rust 99.78% Shell 0.10% Dockerfile 0.12%

backup rust restore robust linux windows macos

conserve's Introduction

Conserve: a robust backup program

https://github.com/sourcefrog/conserve/

Conserve's guiding principles:

Safe: Conserve is written in Rust, a fast systems programming language with compile-time guarantees about types, memory safety, and concurrency. Conserve uses a conservative log-structured format.
Robust: If one file is corrupted in storage or due to a bug in Conserve, or if the backup is interrupted, you can still restore what was written. (Conserve doesn't need a large transaction to complete for data to be accessible.)
Careful: Backup data files are never touched or altered after they're written, unless you choose to purge them.
When you need help now: Restoring a subset of a large backup is fast, because it doesn't require reading the whole backup.
Always making progress: Even if the backup process or its network connection is repeatedly killed, Conserve can quickly pick up where it left off and make forward progress.
Ready for today: The storage format is fast and reliable on on high-latency, limited-capability, unlimited-capacity, eventually-consistent cloud object storage.
Fast: Conserve exploits Rust's fearless concurrency to make full use of multiple cores and IO bandwidth. (In the current release there's still room to add more concurrency.)
Portable: Conserve is tested on Windows, Linux (x86 and ARM), and OS X.

Quick start guide

Conserve storage is within an archive directory created by conserve init:

conserve init /backup/home.cons

conserve backup copies a source directory into a new version within the archive. Conserve copies files, directories, and (on Unix) symlinks. If the conserve backup command completes successfully (copying the whole source tree), the backup is considered complete.

conserve backup /backup/home.cons ~ --exclude /.cache

conserve diff shows what's different between an archive and a source directory. It should typically be given the same --exclude options as were used to make the backup.

conserve diff /backup/home.cons ~ --exclude /.cache

conserve versions lists the versions in an archive, whether or not the backup is complete, the time at which the backup started, and the time taken to complete it. Each version is identified by a name starting with b.

$ conserve versions /backup/home.cons
b0000                      complete   2016-11-19T07:30:09+11:00     71s
b0001                      incomplete 2016-11-20T06:26:46+11:00
b0002                      incomplete 2016-11-20T06:30:45+11:00
b0003                      complete   2016-11-20T06:42:13+11:00    286s
b0004                      complete   2016-12-01T07:08:48+11:00     84s
b0005                      complete   2016-12-18T02:43:59+11:00      4s

conserve ls shows all the files in a particular version. Like all commands that read a band from an archive, it operates on the most recent by default, and you can specify a different version using -b. (You can also omit leading zeros from the backup version.)

conserve ls -b b0 /backup/home.cons | less

conserve restore copies a version back out of an archive:

conserve restore /backup/home.cons /tmp/trial-restore

conserve validate checks the integrity of an archive:

conserve validate /backup/home.cons

conserve delete deletes specific named backups from an archive:

conserve delete /backup/home.cons -b b1

Exclusions

The --exclude GLOB option can be given to commands that operate on files, including backup, restore, ls and list-source.

A / at the start of the exclusion pattern anchors it to the top of the backup tree (not the root of the filesystem.) ** recursively matches any number of directories. *.o matches anywhere in the tree.

--exclude-from reads exclusion patterns from a file, one per line, ignoring leading and trailing whitespace, and skipping comment lines that start with a #.

The syntax is comes from the Rust globset crate.

Directories marked with CACHEDIR.TAG are automatically excluded from backups.

S3 support

From 23.9 Conserve supports storing backups in Amazon S3. AWS IAM credentials are read from the standard sources: the environment, config file, or, on EC2, the instance metadata service.

S3 support can be turned off by passing cargo install --no-default-features. (There's no runtime impact if it is not used, but it does add a lot of build-time dependencies.)

To use this, just specify an S3 URL for the archive location. The bucket must already exist.

conserve init s3://my-bucket/
conserve backup s3://my-bucket/ ~

Files are written in the INTELLIGENT_TIERING storage class.

(This should work on API-compatible services but has not been tested; experience reports are welcome.)

Install

To build Conserve you need Rust and a C compiler that can be used by Rust.

To install the most recent release from crates.io, run

cargo install conserve

To install from a git checkout, run

cargo install -f --path .

On nightly Rust only, and only on x86_64, you can enable a slight speed-up with

cargo +nightly install -f --path . --features blake2-rfc/simd_asm

Arch Linux

To install from from available AUR packages, use an AUR helper:

yay -S conserve

Performance on Windows

Windows Defender and Windows Search Indexing can severely slow down any program that does intensive file IO, including Conserve. I recommend you exclude the backup directory from both systems.

Project status

Conserve is at a reasonable level of maturity; the format is stable and the basic features are complete. I have used it as a primary backup system for over a year. There is still room for several performance improvements and features.

The current data format (called "0.6") will be readable by future releases for at least two years.

Be aware Conserve is developed as a part-time non-commercial project and there's no guarantee of support or reliability. Bug reports are welcome but I cannot promise they will receive a resolution within any particular time frame.

Licence and non-warranty

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

conserve's People

Contributors

Stargazers

Watchers

conserve's Issues

Review storage design for better deduplication support

Because I really like the deduplication properties of Borg/Attic, I've been thinking about the data storage pool of conserve. Specifically, how the ability to only reference data in the parent backup limits the extent of deduplication possible.

Because I've thought about this for long enough, I'm dumping a mostly-formed idea here rather than endlessly iterate in 5-minute increments locally 😄

I think this preserves Conserve's append-only approach (modulo the file renaming involved) and human-recoverableness. I also think it's safely lock-free (in the sense that multiple conserve operations can be safely performed in parallel on the same storage pool without relying on filesystem locking).

Assumptions:

Filesystem renames are cheap
Filesystem renames cannot lose file data
Filesystem will handle large numbers of empty files reasonably efficiently
There's a rmdir() or equivalent that will only remove empty directories
Some basic filesystem ordering properties:

If a file a is created before a file b, then a process will never observe a directory listing containing b but not a.

Backup Sequence

Backup 0012 wants to store a data block with SHA256 $HASH

Create empty file pool/$HASH/0012.in-progress
Searches directory listing for data-*
i. If data-$GEN_NUMBER is found, rename() data-$GEN_NUMBER to
data-($GEN_NUMBER + 1)
ii. If no data-* found, write block data to a temporary file then
rename() to data-0001.
rename()s 0012.in-progress to 0012.

Failure conditions:

If (1) fails, then a block with hash $HASH does not already exist;
Backup 0012 creates the directory, then continues to (2 ii).
If (2 i) fails then either the block data has been deleted or another
backup process has moved it. In either case, restart at (2).
If (2 ii) fails then another backup process has already started writing
the data. Restart at (2).

Deletion sequence

To delete backup 0070 Conserve starts with list of blocks from the backup
description. For each block $HASH:

Gets directory listing for pool/$HASH
Deletes pool/$HASH/0070
If this is last non-data-* file in pool/$HASH, according to
precomputed directory listing, then
i. Delete the data-* files present in the listing.
ii. rmdir() pool/$HASH

Failure conditions:

If (1) fails, then this block has already been deleted, possibly by a
previously interrupted delete operation? We need do nothing.
If (2) fails, then this block has already been processed for this
delete operation. We need do nothing
If (3 i) fails, continue to (3 ii)
If (3 ii) fails because the directory is not empty, then we're racing
a backup operation. Do nothing.

store filenames as lists of components

option to select version for `restore`, `ls`, etc

change back from gflags

Nice, but

not idiomatic unix style
ugly default help
awkward dependency on older distros

compress indexes

Show date and completion in `versions`

Warn when reading from an incomplete band

add transport abstraction

maybe leave this until much later and just use FUSE for remote filesystems?

fuse readonly access

WISHLIST: Compare against Back in Time

Hi Martin,

Comparing Conserve

https://github.com/sourcefrog/conserve/wiki/Compared-to-others

against

http://backintime.le-web.org/

which is part of many Linux distro's repos would be helpful. It's very fast (rsync fast) yet does a lot of things that rsync does not. I'm personally interested in this particular comparison because I use backintime today to do system backups but I like your approach better.

Joey

exclude directories containing a file with a certain name

There is a Linux standard (where?) that cache directories contain a file with a certain name as a marker.

However, I'm not sure it's very commonly used or very important. Probably better to just exclude things by name.

Resume interrupted backups

Show filenames in progress bar

Might require a two-line progress bar...

if compressing a block didn't help much, just store it uncompressed

Synchronize progress and log messages

To emit a log message we should erase the progress bar, print the message, then put the progress
bar back. (Or maybe lazily put it back.)

Refuse to restore over existing directory

Unless forced

Core Dump

Hi Martin,

I installed conserve per the instructions on Ubuntu 13.10. I did autogen, configure, make, make check (everything passed), and sudo make install. I then did conserve init /media/joey/joey-work/conserve and it created the directory with the binary file.

I then ran this:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:19:39.968262 17188 bzdatawriter.cc:61] Check failed: bytes_read > 0 : Is a directory [21]
*** Check failure stack trace: ***
@ 0x7fd475d17daa (unknown)
@ 0x7fd475d17ce4 (unknown)
@ 0x7fd475d176e6 (unknown)
@ 0x7fd475d174fb (unknown)
@ 0x7fd475d18477 (unknown)
@ 0x41224e conserve::BzDataWriter::store_file()
@ 0x411aaa conserve::BlockWriter::add_file()
@ 0x407d49 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7fd475019de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

In the backup directory I have a folder called b0000 now. If I call that command again:

joey@warthog:~$ conserve backup /home/joey/ /media/joey/joey-work/conserve/
F1121 12:16:57.847147 17149 util.cc:46] Check failed: fd > 0 : File exists [17]
*** Check failure stack trace: ***
@ 0x7f867546fdaa (unknown)
@ 0x7f867546fce4 (unknown)
@ 0x7f867546f6e6 (unknown)
@ 0x7f867546f4fb (unknown)
@ 0x7f8675470477 (unknown)
@ 0x41544b conserve::write_proto_to_file()
@ 0x40911a conserve::BandWriter::start()
@ 0x406d75 conserve::Archive::start_band()
@ 0x407d06 conserve::cmd_backup()
@ 0x4125d0 conserve::run_command_line()
@ 0x406629 conserve::main()
@ 0x7f8674771de5 (unknown)
@ 0x4068ee (unknown)
@ (nil) (unknown)
Aborted (core dumped)

Joey

break into blocks on whole files

Start a new file when:

current block is over a maximum compressed byte size, or
number of files in the block is over a particular size (correlated to an index block size limit)

Will require updating read code to handle multiple blocks.

store data compression method in block index

exclusion patterns

Don't try to compress incompressible files

Probably look at the extension or filename pattern (eg to handle git packs).

Might be simplest to compress them at level 0.

Option to restore etc from older band

Cope with race creating band

ERROR: File exists (os error 17)
stack backtrace:
   0:        0x100e351be - backtrace::backtrace::trace::h1b789d8e1542dad0
   1:        0x100e358fc - backtrace::capture::Backtrace::new::h3d062d22ca4dc5a6
   2:        0x100e34d36 - error_chain::make_backtrace::h48f950ecdd86ad84
   3:        0x100e34ded - _$LT$error_chain..State$u20$as$u20$core..default..Default$GT$::default::he14c64e0c2fc5de7
   4:        0x100e0a3bc - conserve::band::Band::create::h1d318051cfc1c04f
   5:        0x100e07ca4 - conserve::archive::Archive::create_band::h342b7c2956161944
   6:        0x100e088ba - conserve::backup::backup::hca5a5659f78774db
   7:        0x100d9f81a - conserve::backup::h055dd3b66439ec1d
   8:        0x100d9edd7 - conserve::main::h53c12c360c8dc5eb
   9:        0x100e8c38a - __rust_maybe_catch_panic
  10:        0x100e8b906 - std::rt::lang_start::ha9be7b379cf1665e

Would be nice to know how this compares to other services

For the README.md: how does conserve compare with rdiff-backup or even rsync?

Don't restore incomplete backups unless forced

Maybe default to restore from the latest complete backup.

switch to cap'n proto?

http://kentonv.github.io/capnproto

pros:

has proto-to-text and text-to-proto which could be nice for debugging or testing
more compact encoding? (may not make a big difference after gzip)
less memory usage than protos? (probably also not a big deal because we want to be streaming through blocks, but it can't hurt)
actively-maintained open implementation - but, protobuf is still seeing new releases and is widely used

cons:

it does require gcc >= 4.7 which is pretty new: not in Ubuntu Precise but is in Debian Stable
capnproto packages themselves limited availability; not in Debian Jessie
API is a little harder?
maybe more likely to have API churn?
protobuf C++ fast enough for Google (maybe not quite the same code)

irrelevant:

zero serialization work: not a big deal, unlikely to be a dominant factor for Conserve
does have an RPC system

recurse through source

At the moment you can't give a source directory: you must name all the files in order.

`diff` command

extract files that diff from the tree (or maybe between two versions) and run a command on them - or just say which ones differ

delete named versions

conserve delete -b b1234 /backup/home.c6

allow specifying multiple bands
gc after deletion
but have an option to not gc

Implement higher-tier backups

Without this every backup stores everything which is pretty inefficient.

validate backup

conserve validate exists and checks some invariants, but there are others that could usefully be checked.

Skip unreadable source files

Exclusive lock, permission denied etc.

Should keep a count of errors.

add i18n framework

'problem' abstraction: choose whether to abort or continue

If there's a problem such as missing or corrupt data, optionally try to continue on the next thing.

restore full contents

verify backup against source

Perhaps this is the same as having a diff command.

auto choose version name

create and read archive header

Split cli to a separate sub crate?

This would probably get clippy working and might work better with Cargo dependencies. A sub crate might break simple cargo install though.

glob select files to restore

This is somewhat handled by --exclude on restore but a specific option to select the things you do want would be more straightforward.

remove band head files?

Do we really need band head and tail files? We assume we can list the directory. Do they store any new information as distinct from just the presence of data blocks?

The tail does let us more quickly tell that the band is complete: this could be in the final data block but that would be slower to read.

Do we need to do anything to detect a concurrent attempt to write the same band or block? Maybe easiest to just detect the written data block already exists.

idea: loose storage of large files

A large fraction of data backed up probably will be fairly large and incompressible: jpgs, mpgs, previously-compressed archives, etc.

Trying to compress them will just waste CPU time on insertion and removal

For files above our minimum granularity goal, combining them with others will only slow retrieval.

Storing them as objects with no compression conceivably makes recovery from a badly broken archive easier. However it also runs some risk that they might be directly seen and edited inside the archive, leading to corruption.

We probably don't want a separate band header for every such loose object, since the band heads will be very small and the headers too choppy.

Therefore perhaps we want, multiple data files per band, which might be aggregates, or might be single objects. We could reorder the small/compressible files to be inside the aggregate.

Compression is too slow (try gzip?)

Conserve currently only writes at about 1MB/s on a fast machine. Maybe Brotli is a bad choice here and we should try gzip.

Show percent completion for restore

We should easily know at least how many index blocks need to be restored and from that can get rough completion percentage

BlockDir::store doesn't actually flush to disk

option to choose compression method, level

Restore mtime

store mtimes
restore mtime on plain files
add tests for restoring mtimes
fix old-version archives to all use the same mtimes
check mtimes of directories restored from old-version archives
restore mtime of symlinks
restore mtime of directories, after creating all entries directly inside them

Parallelize compression

Do block compression on a thread pool, writing files as they complete.

I'm assuming here that compression is the expensive operation we want to parallelize, but depending on the observed balance of time possibly hashing could be parallelized too.

Since the index is sorted we need to accept completed blocks in order by the filename that includes them. So this is complementary to, and related to, storing content from multiple files in a single block. Although we know the hash in advance, we shouldn't write the index until the block data has been written.

Because blocks are of limited size we can keep them entirely in memory, and we can just transfer ownership of the block to the worker thread.

Run ~n_cpus compression worker threads.

One main thread reads files, hashes them, accumulates data into blocks. Push the blocks onto a queue for compression, along with a channel through which the worker can indicate that it's complete.

The compression worker thread returns a Result<()>.

pseudocode:

pending_files = Deque()
for file in source:
  while queue.length > N:
    pending_files[0].channels[0].wait()
  while pending_files[0].channels.all_complete(): 
    index.add(pending_files.pop(0))
  file_completion_channels = []
  for block in pack_blocks_for_file:
    file_completion_channels.append(block_channel = channel())
    queue.push(block_bytes, block_channel)  
 pending_files.push(file, file_completion_channels)

Store and restore permissions, owner, group

Pre-count tree before backup

Just counting the number of files or maybe their total size will let us show percent completion for the whole backup. If the tree changes in between this preview and the actual backup, progress won't be absolutely accurate but that's fine.

This will take some time so should be optional.

sourcefrog / conserve Goto Github PK

conserve's Introduction

Conserve: a robust backup program

Quick start guide

Exclusions

S3 support

Install

Arch Linux

More documentation

Performance on Windows

Project status

Licence and non-warranty

conserve's People

Contributors

Stargazers

Watchers

Forkers

conserve's Issues

Assumptions:

Backup Sequence

Failure conditions:

Deletion sequence

Failure conditions:

Recommend Projects

Recommend Topics

Recommend Org