ivan-johnson / zam Goto Github PK

ZFS Automatic Manager. A short (<1k LOC) python script for taking ZFS snapshots and replicating them to remote servers.

Python 83.03% Makefile 15.03% Shell 1.94%

zam's Introduction

ZFS Automatic Manager

ZAM is a python-based command-line tool for maintaining a ZFS file system.

ZAM is very much a work in progress, and the name is certainly not final.

Currently the only feature that ZAM supports is periodically taking snapshots and replicating them to remote servers. ZAM is not even able to delete old snapshots, although that feature is a top priority.

Development Process

Run these commands from the root of the ZAM repository to setup the development environment:

python3 -m venv ".venv" --prompt "ZAM"
source ".venv/bin/activate"
.venv/bin/python3 -m pip install --upgrade pip
pip install -e '.'
pip install -r requirements_dev.txt

Before each commit, run these commands and fix any issues:

black --line-length 80 src tests
mypy src/zam/task.py
flake8 src tests
pytest

Installation

To install zam itself, simply run these two commands:

python setup.py build
python setup.py install --optimize=1

To install Linux-specific files that may be useful (e.g. example config file, systemd service), run this command:

make install_utils

zam's People

Contributors

Watchers

zam's Issues

Simplify permission management

Can we split ZAM into two users? One for creating snapshots, one for deleting them? Does that improve security at all? If not, we should at least have one for local and one for remote.
We should make it easy to give the ZAM user(s) the right permissions. We could do sudo zam --set-zfs-permissions, but I don't like giving ZAM root, even if only briefly. We could instead do ~zam --get-permission-commands, with the expectation that the user manually runs each command with an understanding of what it does.

I tentatively want this implemented before adding ZAM to the AUR. If my opinion changes, the priority will be reduced.

Implement replication_task

Implement prune_task

ZAM Design Spike

Thus far when creating ZAM, I've basically skipped the design phase. This was a mistake.

We need, for example:

abstraction layers (protocols)
a (UML?) document describing the relation between the different classes/protocols
...

Updating to a new version of ZAM is a pain

We need an installer so that future updates can be easily installed

Remove old config code

Until the new config design is finalized, we should not waste time maintaining the current config code. As such, it should be deleted and instead we should just hard code values for testing.

The run function in the task protocol has too many responsibilities

The run function in the task protocol of scheduler.py is responsible for both running the class and computing the time at which it should next be run. These responsibilities should be split into separate getNextActionionableDatetime (?) and run functions.

Besides just making for a cleaner interface, this would also allow the scheduler to fully initialize its prioritized heap without having to actually run each task.

Zam needs a --version argument

Support recursing on datasets

Currently ZAM config must specify each individual dataset. We should add a way to recurse on a dataset or zpool.

Recursive entries should have an optional blocklist (prefix match string literal, or regexp?).

When operating on a dataset, we should use the best fit from the config. e.g. if the config defines the foo and foo/bar datasets, then foo/bar/baz should use the settings from foo/bar. Equivalently (?), a config entry for foo/bar/baz defines an implicit blocklist entry on foo/bar (which in turn is blocked from foo).

{snapshot,replication,prune}-period is not respected

Currently all three actions loop with a period of ~min({snapshot,replication,prune}-period). They should be on separate timers.

Co-routines would be a good fit for this problem?

It's too easy to overlook a failed drive

It's unlikely the user will notice if snapshot replication fails

Nobody reads logs. How is the user expected to notice if ZAM repeatedly fails to update a replica? ZAM should somehow notify the user if it hasn't updated a replica for N days. Similarly, we should notify the user any time that ZAM itself crashes.

My current preference for the method of notification is a /etc/profile.d/*.sh script that basically cats some sort of ZAM log file. The user could clear the messages by either deleting the log file, or updating some sort of $HOME/.local/share/zam/ignore_message_before_this_date file.

ZAM needs a real installer

We need to simplify the process of setting up ZAM on a machine for the first time. In addition to the items described by #1, this may include:

Creating a daemon user for local use (i.e. actually for actually executing the ZAM script)
Creating a daemon user for remote instances of ZAM to use
Ensuring that the daemon users have sufficient ZFS permissions
Setting up SSH keys to access the remote replicas
Getting ZAM in the AUR

ZAM needs unit tests

Make man pages

ZAM needs to integrate with systemd

Refactor `task.get_next_runtime` to `task.get_next_possible_runtime`

Currently the task scheduler requires that the task be able to deduce exactly when it will next need to run. How should we handle cases where it doesn't know when to run next? For example, a replicator_task might want to only run when the device is not in use. Currently it can schedule itself for midnight, but it has no recourse if the device is still in use when midnight arrives besides duplicating scheduling logic in the runner function.

In order to cleanly support this sort of use case the task.get_next_runtime function should be refactored to task.get_next_possible_runtime. Before the task is run, this function should be called again to see if the time has moved back.

Implement snapshot_task

Can plantuml model functions without them being in a class?

Find a better way to model stand-alone functions; the markup (and rendered image for that matter) look horrible:

	class "<<public members>>" {
		run(tasks: typing.List[task]) -> None
	}

Add support for file system scrubbing

Design new config format / class diagram

Make a config parsing framework

It's a pain to have to maintain from_dict functions.

De-duplicate config files

With the current config framework you might, for example, have to specify the exact same destination replica multiple times for many managed datasets. There should be some syntax for avoiding this. e.g. enable setting a default value from a broader context:

{
    "DEFAULT#managed-datasets[*].destinations": [
        {
            "remote-host": "olympus",
            "pool": "o-nas",
            "dataset": "zam/s-root/data/home/i",
            "windows": [
                {"max-age": {"weeks": 6}, "period": {"hours": 1}},
                {"max-age": {"months": 6}, "period": {"days": 1}},
                {"max-age": {"years": 6}, "period": {"weeks": 1}},
                {"period": {"months": 1}}
            ]
        }
    ],
    "managed-datasets": [
        {
            "source": {
                "pool": "s-root",
                "dataset": "data/home/i",
                "windows": [
                    {"max-age": {"days": 2}, "period": {"minutes": 10}},
                    {"max-age": {"weeks": 1}, "period": {"hours": 1}}
                ]
            },
            "snapshot-period": {"minutes": "10"},
            "replication-period": {"hours": "1"},
            "prune-period": {"hours": "1"}
        },
        {
            "source": {
                "pool": "s-root",
                "dataset": "root/default",
                "windows": [
                    {"max-age": {"weeks": 1}, "period": {"hours": 3}}
                ]
            },
            "snapshot-period": {"minutes": "10"},
            "replication-period": {"hours": "1"},
            "prune-period": {"hours": "1"}
        }
    ]
}

Or better yet, use some sort of super-fancy variables:

{
    "DEFINE#dest-default-dataset": "zam",
    "DEFINE#dest-default": {
        "remote-host": "olympus",
        "pool": "o-nas",
        "windows": [
            {"max-age": {"weeks": 6}, "period": {"hours": 1}},
            {"max-age": {"months": 6}, "period": {"days": 1}},
            {"max-age": {"years": 6}, "period": {"weeks": 1}},
            {"period": {"months": 1}}
        ]
    },
    "managed-datasets": [
        {
            "source": {
                "pool": "s-root",
                "dataset": "data/home/i",
                "windows": [
                    {"max-age": {"days": 2}, "period": {"minutes": 10}},
                    {"max-age": {"weeks": 1}, "period": {"hours": 1}}
                ]
            },
            "destinations": [
                {
		    "REF": "dest-default",
                    "dataset": "${dest-default-dataset}/s-root/data/home/i",
                }
            ],
            "snapshot-period": {"minutes": "10"},
            "replication-period": {"hours": "1"},
            "prune-period": {"hours": "1"}
        },
        {
            "source": {
                "pool": "s-root",
                "dataset": "root/default",
                "windows": [
                    {"max-age": {"weeks": 1}, "period": {"hours": 3}}
                ]
            },
            "destinations": [
                {
		    "REF": "dest-default",
                    "dataset": "${dest-default-dataset}/s-root/root/default",
                }
            ],
            "snapshot-period": {"minutes": "10"},
            "replication-period": {"hours": "1"},
            "prune-period": {"hours": "1"}
        }
    ]
}

Bear in mind that one of the primary objectives of this project is to use as few lines of code as possible. Is this feature worth the extra LOC?