I'm submitting a ... <ul class="contains-tas

Reduce storage footprint when using GPG encryption about docker-volume-backup HOT 9 OPEN

offen commented on August 23, 2024

Reduce storage footprint when using GPG encryption

from docker-volume-backup.

Comments (9)

m90 commented on August 23, 2024

When using GPG_PASSPHRASE the result of the tar command could (probably) directly be piped to gpg

Since this is not a bash script, it's not that easy, but it could still be done without having to resort to using the intermediate file and instead chain the tarWriter and the OpenPgpWriter. Maybe it would even be possible to upload the file in a streaming manner for some backends and remove the need for an intermediate file entirely.

This is a little complicated though (mostly because the code is written with that mechanism in mind), so it's much more than a one-line change.

If anyone wants to pick this up, I am happy to review and merge in PRs. Else I might be able to look into this at some point, but I cannot really make any estimates.

A side note question: the artifacts do get deleted properly after a backup run, right? So the storage footprint is only an issue while the backup is running, correct?

from docker-volume-backup.

simboel commented on August 23, 2024

Thanks for the reply. Chaining the upload would've been my next question 👍

Maybe I can find some time to start developing this feature.

from docker-volume-backup.

m90 commented on August 23, 2024

On a very high level this could work like this (from the top of my head, so it might be wrong in certain details):

(*script).file becomes an io.ReadWriter (?) instead of a string pointing to a file
methods on *script can now operate on this writer instead of reading / writing from the file
in case the storage backend requires to work off a file, the ReadWriter can be flushed to an intermediate file, in case the backend supports streaming, it can be piped through

In case you decide to start working on this feel free to ask questions any time and also don't shy away from improving the pipeline if you find things that are rather odd right now.

from docker-volume-backup.

m90 commented on August 23, 2024

I looked into how this could implemented a little and found a tricky detail hidden: if a streaming pipeline would be used that possibly does create tar archive -> possibly encrypt -> upload this would also mean that users who want to stop their containers during backup might see a considerable increase in downtime as this now means we could see backpressure or similar. I think it's possible to work around this mostly (emit an event once the tar writer stopped writing or similar), but this makes the refactoring needed even bigger.

I'll try to think about how the entire script orchestration could be refactored so it can accomodate for such requirements.

from docker-volume-backup.

MaxJa4 commented on August 23, 2024

Maybe different modes, which the user can select, would make sense here:

Stream mode: Archive + Encrypt + Upload all at once as a pipeline (if the storage backend supports it)
~~Hybrid mode: Archive + Encrypt -> Upload (Maybe speedup by using more CPU resources by archiving and encrypting simultaneously)~~
Sequential mode: Classic Archive -> Encrypt -> Upload (like now; default?)
Adaptive mode: Archive + Encrypt + Upload all at once but also build up an adjustable buffer of Archive+Encrypt (either a set amount or free storage space minus x%) so the container can be started again sooner, as the upload can use the remaining buffer

Adaptive mode would solve the backpressure issue. For example with Backblaze B2 I have usually 250 MBit/s Upload (~30 MB/s) and using zstd as compressing method (magnitudes faster on my maschines) plus multicore encryption... I'd assume that IO would usually be often faster than the network... but that's just a sample size of one (me). Makes sense to have the backpressure issue in mind imo.

Alternatively to the suggested modes, only adaptive and sequential could be fine too as stream mode is just adaptive mode with the buffer set to zero or very low.

from docker-volume-backup.

m90 commented on August 23, 2024

One thing that just occured to me is that implementing archive-encrypt-upload in a streaming fashion would be a breaking change as it would mean the command lifecycle would be chaning, i.e. when streaming, there is no more possibility of running pre-archive commands or similar. It would just allow for pre and post commands. That's probably ok, but I wanted to leave it here as I just thought of it.

from docker-volume-backup.

MaxJa4 commented on August 23, 2024

True, there would only be a start and a finish hook - no matter if using streaming or buffered streaming... unless the user-defined buffer is quite large, then it would make sense to restart the container early and trigger a post-archive hook.

The buffer size value could be:

0 to essentially disable the buffer -> streaming
-1 for automatic size: amount of available space minus some offset -> buffered streaming with optimal downtime
Otherwise xM/xG or simply an integer in MB/GB -> buffered streaming with manual space usage limit

I'd definitely keep the classic sequential processing around in the form of the automatically sized stream buffer (-1 buffer size) to optionally keep downtime of containers as low as possible without risking a storage issue.

Having all "modes" in basically one logic (but with different buffer sizes) would make the code base cleaner and configuration easier, since we wouldn't need two entirely different approaches (sequential vs streaming).

The default option should perhaps be the buffered streaming with a buffer size of -1 (automatic), since a long downtime is bad, but a full disk with potential crash/freeze/data-loss is worse IMO. That could be the best of both worlds: restart containers as soon as possible, but don't use more space than available.

I'd need to do some testing whether df or similar go functions inside the container report sensible values (for me in amd64 Win11 desktop and arm64 Ubuntu server, df did)... only then, the automatic buffer size is feasible.

from docker-volume-backup.

m90 commented on August 23, 2024

I was thinking one could have some sort of "event" that signals the end of the archive stream which would then trigger restart while bytes are still munged further down the stream, see #95 (comment)

Not sure how realistic that is though.

In any case I would argue optimizing for as little downtime as possible is more important than optimizing for disk space. Disk space is cheap, service downtime is not.

from docker-volume-backup.

MaxJa4 commented on August 23, 2024

An event of some sort after the archiving stage is done definitely makes sense.

That's usually the case yes. Maybe we can omit the buffer size entirely and just have buffer=true/false (working title) which would be automatic buffer size (-> use all available space minus some offset) for low downtime or no buffer (false) for low space usage.

That being said, I don't know if there is any benefit of using buffer=false, since the occupied space with the automatic buffer size is freed after the backup anyway... so why not just archive ASAP with the available buffer and get the containers running again quickly. Would mean less complexity in the implementation too.
If that's what you essentially meant, then I fully agree :)

from docker-volume-backup.

Reduce storage footprint when using GPG encryption about docker-volume-backup HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent