Giter Club home page Giter Club logo

backup-bench's Introduction

backup-bench

Quick and dirty backup tool benchmark with reproducible results

** This is a one page entry with benchmarks (see below), previous versions are available via git versioning.**

What

This repo aims to compare different backup solutions among:

The idea is to have a script that executes all backup programs on the same datasets.

We'll use a quite big (and popular) git repo as first dataset so results can be reproduced by checking out branches (and ignoring .git directory). I'll also use another (not public) dataset which will be some qcow2 files which are in use.

Time spent by the backup program is measured by the script so we get as accurate as possible results (time is measured from process beginning until process ends, with a 1 second granularity).

While backups are done, cpu/memory/disk metrics are saved so we know how "resource hungry" a backup program can be.

All backup programs are setup to use SSH in order to compare their performance regardless of the storage backend.

When available, we'll tune the encryption algorithm depending on the results of a benchmark. For instance, kopia has a kopia benchmark compression --data-file=/some/big/data/file option to find out which compression / crypto works best on the current architecture. This is REALLY NICE TO HAVE when choices need to be made, aware of current architecture. As of the current tests, Borg v2.0.0-b1 also has a borg benchmark cpu option.

Why

I am currently using multiple backup programs to achieve my needs. As of today, I use Graham Keeling's burp https://github.com/grke/burp to backup windows machines, and borg backup to backup QEMU VM images. Graham decided to remove its deduplication (protocol 2) and stick with rsync based backups (protocol 1), which isn't compatible with my backup strategies. I've also tried out bupstash which I found to be quite quick, but which produces bigger backups remotely when dealing with small files (probably because of the chunk size?).

Anyway, I am searching for a good allrounder, so I decided to give all the deduplicating backup solutions a try, and since I am configuring them all, I thought why not make my results available to anyone, with a script so everything can be reproduced easily.

As of today I use the script on my lab hypervisor, which runs AlmaLinux 8.6. The script should run on other distros, although I didn't test it.

I'll try to be as little biased as possible when doing my backup tests. If you feel that I didn't give a specific program enough attention, feel free to open an issue.

In depth comparison of backup solutions

Last update: 03 October 2022

Backup software Version
borg 1.2.2
borg beta 2.0.0b2
restic 0.14.0
kopia 0.12.0
bupstash 0.11.1
duplicacy 2.7.2

The following list is my personal shopping list when it comes to backup solutions, and might not be complete, you're welcome to provide PRs to update it. ;)

Goal Functionality borg restic kopia bupstash duplicacy
Reliability Redundant index copies ? ? Yes yes, redundant + sync No indexes used
Reliability Continue restore on bad blocks in repository ? ? Yes (can ignore errors when restoring) No Yes, erasure coding
Reliability Data checksumming Yes (CRC & HMAC) ? No (Reed–Solomon in the works) HMAC Yes
Reliability Backup coherency (detecting in flight file changes while backing up) Yes Yes ? No ?
Restoring Data Backup mounting as filesystem Yes Yes Yes No No
File management File includes / excludes bases on regexes Yes ? ? ? Yes
File management Supports backup XATTRs Yes ? No Yes ?
File management Supports backup ACLs Yes ? No Yes ?
File management Supports hardlink identification (no multiple stored hardlinked files No (borg2 will Yes No Yes No
File management Supports sparse files (thin provisionned files on disk) Yes Yes Yes Yes ?
File management Can exclude CACHEDIR.TAG(3) directories Yes Yes Yes No No
Dedup & compression efficiency Is data compressed Yes Yes Yes Yes Yes
Dedup & compression efficiency Uses newer compression algorithms (ie zstd) Yes Yes Yes Yes Yes
Dedup & compression efficiency Can files be excluded from compression by extension ? No Yes No No
Dedup & compression efficiency Is data deduplicated Yes Yes Yes Yes Yes
Platform support Programming lang Python Go Go Rust Go
Platform support Unix Prebuilt binaries Yes Yes Yes No Yes
Platform support Windows support Yes (WSL) Yes Yes No Yes
Platform support Windows first class support (PE32 binary) No Yes Yes No Yes
Platform support Unix snapshot support where snapshot path prefix is removed ? ? ? ? ?
Platform support Windows VSS snapshot support where snapshot path prefix is removed No Yes No, but pre-/post hook VSS script provided No Yes
WAN Support Can backups be sent to a remote destination without keeping a local copy Yes Yes Yes Yes Yes
WAN Support What other remote backends are supported ? rclone (1) (2) None (1)
Security Are encryption protocols secure (AES-256-GCM / PolyChaCha / etc ) ? Yes, AES-256-GCM Yes, AES-256 Yes, AES-256-GCM or Chacha20Poly1305 Yes, Chacha20Poly1305 Yes, AES-256-GCM
Security Are metadatas encrypted too ? ? Yes ? Yes Yes
Security Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)? No No ? No (4) ?
Security Can a compromised client delete backups? No (append mode) No (append mode) Supports optional object locking No (ssh restriction ) No pubkey + immutable targets
Security Can a compromised client restore encrypted data? Yes ? ? No No pubkey
Security Are pull backup scenarios possible? Yes No No No, planned ?
Misc Does the backup software support pre/post execution hooks? ? ? Yes No Yes
Misc Does the backup software provide an API for their client ? Yes (JSON cmd) No, but REST API on server No, but REST API on server No No
Misc Does the backup sofware provide an automatic GFS system ? Yes Yes Yes No ?
Misc Does the backup sofware provide a crypto benchmark ? No, available in beta No Yes Undocumented No, generic benchmark
Misc Can a repo be synchronized to another repo ? ? ? Yes Yes Yes
  • (1) SFTP/S3/Wasabi/B2/Aliyun/Swift/Azure/Google Cloud
  • (2) SFTP/Google Cloud/S3 and S3-compatible storage like Wasabi/B2/Azure/WebDav/rclone*
  • (3) see https://bford.info/cachedir/
  • (4) For bupstash, CRIME/BREACH style attacks are mitigated if you disable read access for backup clients, and keep decryption keys off server.

A quick word about backup coherence:

While some backup tools might detect filesysetm changes inflight, it's usually the burden of a snapshot system (zfs, bcachefs, lvm, btrfs, vss...) to provide the backup program a reliable static version of the filesystem. Still it's a really nice to have in order to detect problems on backups without those snapshot aware tools, like plain XFS/EXT4 partitions.

Results

2022-10-02

Used system specs

  • Source system: Xeon E3-1275, 64GB RAM, 2x SSD 480GB (for git dataset and local target), 2x4TB disks 7.2krpm (for bigger dataset), using XFS, running AlmaLinux 8.6

  • Remote target system: AMD Turion(tm) II Neo N54L Dual-Core Processor (yes, this is old), 6GB RAM, 2x4TB WD RE disks 7.2krpm using ZFS 2.1.5, 1x 1TB WD Blue using XFS, running AlmaLinux 8.6

  • Target system has a XFS filesystem as target for the linux kernel backup tests

  • Target system has a ZFS filesystem as target for the qemu backup tests. ZFS has been configured as follows:

    • zfs set xattr=off backup
    • zfs set compression=off backup # Since we already compress, we don't want to add another layer here
    • zfs set atime=off backup
    • zfs set recordsize=1M backup # This could be tuned as per backup program...

source data for local and remote multiple git repo versions backup benchmarks

Linux kernel sources, initial git checkout v5.19, then changed to v5.18, 4.18 and finally v3.10 for the last run. Initial git directory totals 4.1GB, for 5039 directories and 76951 files. Using env GZIP=-9 tar cvzf kernel.tar.gz /opt/backup_test/linux produced a 2.8GB file. Again, using "best" compression with tar cf - /opt/backup_test/linux | xz -9e -T4 -c - > kernel.tar.bz produces a 2.6GB file, so there's probably big room for deduplication in the source files, even without running multiple consecutive backups on different points in time of the git repo.

backup multiple git repo versions to local repositories

image

Numbers:

Operation bupstash 0.11.1 borg 1.2.2 borg_beta 2.0.0b2 kopia 0.12.0 restic 0.14.0 duplicacy 2.7.2
backup 1st run 9 41 55 10 23 32
backup 2nd run 11 22 25 4 8 13
backup 3rd run 7 28 39 7 17 23
backup 4th run 5 20 29 6 13 16
restore 4 16 17 5 9 11
size 1st run 213268 257300 265748 259780 260520 360200
size 2nd run 375776 338760 348248 341088 343060 480600
size 3rd run 538836 527732 543432 529812 531892 722176
size 4th run 655836 660812 680092 666408 668404 894984

Remarks:

  • kopia was the best allround performer on local backups when it comes to speed, but is quite CPU intensive.
  • bupstash was the most space efficient tool and is not CPU hungry.
  • For the next instance, I'll need to post CPU / Memory / Disk IO usage graphs from my Prometheus instance.

backup multiple git repo versions to remote repositories

  • Remote repositories are SSH (+ binary) for bupstash and burp.
  • Remote repository is SFTP for duplicacy.
  • Remote repository is HTTPS for kopia (kopia server with 2048 bit RSA certificate)
  • Remote repository is HTTPS for restic (rest-server 0.11.0 with 2048 bit RSA certificate)

image

Numbers:

Operation bupstash 0.11.1 borg 1.2.2 borg_beta 2.0.0b2 kopia 0.12.0 restic 0.14.0 duplicacy 2.7.2
backup 1st run 10 47 67 72 24 32
backup 2nd run 12 25 30 32 10 15
backup 3rd run 9 36 47 54 19 23
backup 4th run 7 31 50 46 21 23
restore 170 244 243 258 28 940
size 1st run 213240 257288 265716 255852 260608 360224
size 2nd run 375720 338720 348260 336440 342848 480856
size 3rd run 538780 527620 543204 522512 531820 722448
size 4th run 655780 660708 679868 657196 668436 895248

Remarks:

  • With restic's recent release 0.14.0, the remote speeds using rest-server increased dramatically and are onpar with local backup results.
  • All other programs take about 5-10x more time to restore than the initial backup, except for duplicacy, which has a 30x factor which is really bad
  • Since last benchmark series, kopia 0.2.0 was released which resolves the remote bottleneck
  • I finally switchted from ZFS to XFS remote filesystem so we have comparable file sizes between local and remote backups
  • Noticing bad restore results, I've tried to tweak the SSH server:
    • The best cipher algorithm on my repository server was chacha-poly1305 (found with https://gist.github.com/joeharr4/c7599c52f9fad9e53f62e9c8ae690e6b)
    • Compression disabled
    • X11 Forwarding disabled (was already disabled)
    • The above settings were applied to sshd, so even duplicacy gets to use them, since I didn't find a way to configure those settings for duplicacy

backup private qemu disk images to remote repositories

Source data are 8 qemu qcow2 files, and 7 virtual machines description JSON files for a total of 366GB. Remote repositories are configured as above, except that I used ZFS as a backing filesystem.

image

Numbers:

Operation bupstash 0.11.1 borg 1.2.2 borg_beta 2.0.0b2 kopia 0.12.0 restic 0.14.0 duplicacy 2.7.2
initial backup 4699 7044 6692 12125 8848 5889
initial size 121167779 123111836 122673523 139953808 116151424 173809600

Remarks:

As I did the backup benchmarks, I computed the average size of the files in each repository using

find /path/to/repository -type f -printf '%s\n' | awk '{s+=$0}
  END {printf "Count: %u\nAverage size: %.2f\n", NR, s/NR}'

Results for the linux kernel sources backups:

Software Source bupstash 0.11.1 borg 1.2.2 borg_beta 2.0.0b2 kopia 0.12 restic 0.14.0 duplicacy 2.7.2
File count 61417 2727 12 11 23 14 89
Avg file size (kb) 62 42 12292 13839 6477 10629 2079

I also computed the average file sizes in each repository for my private qemu images which I backup with all the tools using backup-bench.

Results for the qemu images backups:

Software Source bupstash 0.11.1 borg 1.2.2 borg_beta 2.0.0b2 kopia 0.12 restic 0.14.0 duplicacy 2.7.2
File count 15 136654 239 267 6337 66000 41322
Avg file size (kb) 26177031 850 468088 469933 22030 17344875 3838

Interesting enough, bupstash is the only software that produces sub megabyte chunks. Of the above 136654 files, only 39443 files weight more than 1MB. The qemu disk images are backed up to a ZFS filesystem with recordsize=1M. In order to measure the size difference, I created a ZFS filesystem with a 128k recordsize, and copied the bupstash repo to that filesystem. This resulted in bupstash repo size being roughly 13% smaller (137364728kb to 121167779kb). Since bupstash uses smaller chunk file sizes, I will continue using the 128k recordsize for the ZFS bupstash repository.

Footnotes

  • Getting restic SFTP to work with a different SSH port made me roam restic forums and try various setups. Didn't succeed in getting RESTIC_REPOSITORY variable to work with that configuration.
  • duplicacy wasn't as easy to script as the other tools, since it modifies the source directory (by adding .duplicacy folder) so I had to exclude that one from all the other backup tools.
  • The necessity for duplicacy to cd into the directory to backup/restore doesn't feel natural to me.

EARLIER RESULTS

Links

As of 6 September 2022, I've posted an issue to every backup program's git asking if they could review this benchmark repo:

backup-bench's People

Contributors

andrewchambers avatar basldfalksjdf avatar deajan avatar nh2 avatar thomaswaldmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

backup-bench's Issues

speed

... depends on a lot of things and might be hard to compare.

just a few insights (from borg development):

  • borg 1.x does some information gathering based primarily on the filename, assuming that same filename means same file.
  • borg 2.0 will play a bit more on the safe side considering race conditions due to changing file systems, so it opens the file to get a file descriptor (fd) and then does the information gathering using the fd. the fd will always refer to the same fs object.
  • borg >= 1.2 checks if a file has changed while it was backed up.
  • these are a few reasons why more recent borg versions got a bit slower than older ones, especially on NFS, because open() and stat() are slow there.

So, sometimes speed == quick & dirty and slower == better / safer.

The less you do, the faster you get. The question is then if you still do enough / all that is needed.

about features

it's hard to compare backup tools as some features and policies/styles will be different and hard to compare.

nevertheless, some related ideas (see feature table in the README):

Reliability | Redundant index copies

Not sure what kopia implements there, but in general, an index (or cache) is usually a data structure for quicker access to some primary/authoritative data. If the index (or cache) is lost, it can be rebuilt from the primary data.

So not sure why it needs "redundant index copies" considering the index (or cache) by definition is redundant to the primary data it is indexing (or caching).

Continue restore on bad blocks in repository

Usually there is some separate check/ repair function to bring the repository into a as-good-as-it-gets state.
But this is not necessarily done "on the fly" within a restore operation.

File management: filesystem flags missing

Some flags like immutable need yet another api (besides stat / xattrs / acls).

File management | Automatically excludes CACHEDIR.TAG(3) directories

borg can do that, but (IIRC) does not do it by default, but only if you give the cli option to enable this feature.

Remove the "automatically" part maybe?

Dedup & compression efficience | Can files be excluded from compression by extension

borg actually had code for that for a while (not sure if ever released), but it was removed again because it creates quite a lot of configuration burden on the user to manage all these extensions and what should be compressed or not. borg now has a auto,X compression mode which first tries to predict compressibility using lz4 (super fast) and then runs compressor X, if it likely is well compressible. In the other cases it uses either no compression or the already computed lz4 result, if that is shorter.

Are encryption protocols sure (AES-256-GCM / PolyChaCha / etc ) ?

You mean "secure".

What I would recommend here is checking for modern AE / AEAD (authenticated encryption [w/ associated data]) crypto.

AES-GCM, AES-OCB, Chacha20-Poly1305 are well known "ready to use" AEAD combos (borg2 will use 2nd and 3rd).

borg1 uses AES-CTR + HMAC-SHA256 (or blake2b) self-arranged combo due to its roots in / compatibility with attic backup.

Security | Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?

IIRC there is something about this in our FAQ or on the issue tracker. Result was that it is not a problem in this scenario.

Misc | Does the backup software support pre/post execution hooks?

Is this a feature or anti-feature? :-)

Anyway, for borg this feature does not make sense because it is a CLI tool. So you use either a shell script (then you can put your pre/post commands just in that script) or a wrapper or GUI (like borgmatic, vorta - then it is not borg's problem).

Duplicacy

Hi @deajan,
Awesome work!

I took a look at your script and I have a few suggestions:

1-For backup and restore commands, please use the -threads option with 8 threads for your setup. It will significantly increase speed.

Increase -threads from 8 until you saturate the network link or see a decrease in speed.

2-During init please play with chunk size:

-chunk-size, -c the average size of chunks (default is 4M)
-max-chunk-size, -max the maximum size of chunks (default is chunk-size*4)
-min-chunk-size, -min the minimum size of chunks (default is chunk-size/4)

With homogeneous data, you should see smaller backups and better deduplication. see Chunk size details

3-Some clarifications for your shopping list on Duplicacy:

1-Redundant index copies : duplicacy doesn't use indexes. (or db)
2-Continue restore on bad blocks in repository: yes, and Erasure Coding
3-Data checksumming: yes
4-Backup mounting as filesystem: No (fuse implementation PR but not likely short term)
5-File includes / excludes bases on regexes: yes
6-Automatically excludes CACHEDIR.TAG(3) directories: No
7-Are metadatas encrypted too ?: yes
8-Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?: No
9-Can a compromised client delete backups?: No (with pub key and immutable target->requires target setup)
10-Can a compromised client restore encrypted data? No (with pub key)
11-Does the backup software support pre/post execution hooks?: yes, see Pre Command and Post Command Scripts
12-Does the backup software provide a crypto benchmark ?: there is a Benchmark command.

Important:

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface..
This also means that D will always be a bit slower since it has to list before it uploads a particular chunk.
14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure.
15-No indexes or databases.


16-You should test partial restore
17-Test data should be a little bit more diverse. But I guess this is difficult
Hope this helps a bit. Feel free to join the Forum.

Keep up the good work.

What about rustic?

There is rustic that is the successor of restic, but rustic is written in full rust language instead GO language.

  • Rustic is compatible with restic backup.
  • Rustic has more features than restic, I think.
  • Rustic developer is a part of Restic development team and maintains Rustic and Restic. He said that Rustic generally uses less resources (CPU time and memory) than Restic.

Can you test Rustic benchmark if it is faster and more efficient than Restic?

Thanks!

Bupstash sync

Bupstash supports syncing backups to other repositories - Don't know if any other tool supports this.

Test also on remote source (NFS)

Hi, I suggest to also add a config option for backupping remote source (not just remote repo as it is already done) on NFS (it may also be on the same local system). Thanks.

Default vs "best" settings

Just throwing an idea out there, not sure how viable this is. Is it worth running the benchmarks two ways: (1) using default settings and (2) using settings optimized for this workload?

I ask because almost all these software have tweaks that can improve performance in certain aspects. For example, with Kopia you can use smaller chunks to enhance deduplication and thus reduce backup repository size. Kopia uses 4M chunks by default, because this is most efficient when the number of files gets large. But in a scenario where there are fewer files, changing it to 1M chunks may reduce in smaller backup repository size. I am sure the other programs have similar settings that can be tweaked.

Incremental restore vs full restore

Bupstash supports incremental restores - it will only download files it detects have changed. This may be faster, I am not sure if other tools support this.

Backup to cloud storage?

Would it be possible to do a test when backing up to cloud storage? As long as your local internet does not charge you for bandwidth, you can use Oracle Cloud Infrastructure as the target to get 10GB free storage, so that you don't have to pay for the tests, although Oracle may charge you for API calls. Scaleway offers free 75GB storage, which can be used too. Scaleway also does not charge for API calls.

I am happy to put some money in the pot to help fund this test.

usage of `duplicacy init`

- duplicacy wasn't as easy to script as the other tools, since it modifies the source directory (by adding .duplicacy folder) so I had to exclude that one from all the other backup tools.

You can use duplicacy init -repository source-dir ..., the .duplicacy directory is created in current directory, so it won't clobber with source-dir.

I agree the CLI of duplicacy is weird 😄

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.