lemonsaurus / blackbox Goto Github PK

Magically save your database backups and critical logs in your favorite cloud storage provider.

License: MIT License

Dockerfile 2.30% Python 97.00% Shell 0.69%

blackbox's Introduction

I make stuff. Sometimes people like the stuff I make. 🍋

You might know me as one of the founders of Python Discord, as a musician, as a YouTuber, a game developer, or as an open-source maintainer of packages like blackbox and django-simple-bulma. Maybe I was on a podcast you listened to, or maybe you heard The PEP8 Song.

You can find me on the Lemonsaurus Discord, and on Twitter. If you check out my website, remember to touch my face!

blackbox's People

Contributors

Stargazers

Watchers

Forkers

inveracity jeffersfp matacoder akarys42 jb3 silviupanaite khizzioui97 chrislovering shtlrs dmclf

blackbox's Issues

Create a beautiful logo for this app

I want a logo in the readme, so let's make one.

Configuration: Allow databases to specify storage and notifiers to use.

Rationale

So, a user might have two databases, but these should not be backed up onto the same storage providers, and one of them should be notified to Discord while the other should go to Slack. How can we accommodate this?

One solution would be a more complex config file, but I want us to try to avoid this because it introduces additional user experience complexity to all users, even though this feature will only appeal to a few users. This is a bad UX investment.

Instead, let's handle it with optional connstring parameters!

Implementation

Allow all database connstrings to take two extra parameters, storage_providers and notifiers.

For example, redis://host:port?storage_providers=s3,dropbox&notifiers=slack would specify that the redis database should be backed up onto S3 and Dropbox, and then notified to Slack.

These are strictly optional. By default it will use all notifiers and all storage providers.

Multiple values can be provided, and are comma-separated.

But what if `s3` is ambiguous?

In #15, we propose implementing support for multiple handlers of the same kind. If the user has two S3 storage providers, and wants to only select one of them, we will need some way of identifying which one it is.

If we have multiple providers of type s3, use all of them if a database specifies storage_providers=s3.
Allow a user to specify an ID for notifiers and providers to disambiguate. For example, s3://my.bucket.com?id=main_bucket will associate this ID with the bucket.
The database can now use this ID to select its target, e.g. storage_providers=dropbox,main_bucket

Package this app for PyPI

Let's get this application on PyPI.

The end result should be that when this application is pip installed, it allows you to use a blackbox command to essentially do the same thing that python main.py currently will do.

That greatly simplifies the work needed to get this working locally:

Install blackbox with pip install blackbox
Set up a cron job that runs blackbox however often you want.

It also means we can add additional CLI utilities, such as one for setting up the cron job, maybe an interactive configuration tool, or whatever else.

Okay, don't get carried away. What's this issue?

Yeah, so this issue is just that we want some way of installing this so that running blackbox will run python main.py. That's it. Probably involves creating a setup.py file and maybe looking into doing this in a nice PEP 518 compliant and future-proof way, maybe with a pyproject.toml?

This issue is solved when you can use pip install -e . to install blackbox - we'll handle CI and stuff in a seperate issue.

Basic unit testing

Right now I've got some extremely lazy testing inside the model classes themselves, and we'll want to move this into actual unit tests sooner or later.

I'd like to keep testing as simple as possible for this project, and am thinking this might be a good fit for pytest with no coverage requirements, and where we only test essentials

Can all the classes be instantiated?
Does the config system work?
Can we use the application as intended?
Are the utils working?

Storage handler: Dropbox

Write a storage handler for Dropbox.

This should behave just like the Google Drive handler, just talk to a different API.

This should ideally be implemented via the Dropbox API, not via some client we have to install. See https://www.dropbox.com/developers/documentation

Tests fail on S3() instantiation

Pytest results:

tests/test_storage.py::test_s3_handler_can_be_instantiated FAILED                                                [100%]

====================================================== FAILURES =======================================================
_________________________________________ test_s3_handler_can_be_instantiated _________________________________________

config_file = None

    def test_s3_handler_can_be_instantiated(config_file):
        """Test if the GoogleDrive storage handler can be instantiated."""
        from blackbox.handlers.storage import S3
>       S3()

tests\test_storage.py:11:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
blackbox\handlers\storage\s3.py:48: in __init__
    self.client = boto3.client(
.venv\lib\site-packages\boto3\__init__.py:93: in client
    return _get_default_session().client(*args, **kwargs)
.venv\lib\site-packages\boto3\session.py:258: in client
    return self._session.create_client(
.venv\lib\site-packages\botocore\session.py:827: in create_client
    endpoint_resolver = self._get_internal_component('endpoint_resolver')
.venv\lib\site-packages\botocore\session.py:700: in _get_internal_component
    return self._internal_components.get_component(name)
.venv\lib\site-packages\botocore\session.py:924: in get_component
    self._components[name] = factory()
.venv\lib\site-packages\botocore\session.py:163: in create_default_resolver
    endpoints = loader.load_data('endpoints')
.venv\lib\site-packages\botocore\loaders.py:132: in _wrapper
    data = func(self, *args, **kwargs)
.venv\lib\site-packages\botocore\loaders.py:420: in load_data
    found = self.file_loader.load_file(possible_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <botocore.loaders.JSONFileLoader object at 0x00000236ADF396D0>
file_path = .venv\\lib\\site-packages\\botocore\\data\\endpoints'

    def load_file(self, file_path):
        """Attempt to load the file path.

        :type file_path: str
        :param file_path: The full path to the file to load without
            the '.json' extension.

        :return: The loaded data if it exists, otherwise None.

        """
        full_path = file_path + '.json'
        if not os.path.isfile(full_path):
            return

        # By default the file will be opened with locale encoding on Python 3.
        # We specify "utf8" here to ensure the correct behavior.
        with open(full_path, 'rb') as fp:
>           payload = fp.read().decode('utf-8')
E           AttributeError: 'str' object has no attribute 'decode'

.venv\lib\site-packages\botocore\loaders.py:173: AttributeError

Clean up temporary dumpfiles after sending them to storage

Currently it seems like we are not cleaning up the temporary database dumps we are creating locally. Let's make sure we do.

Add id in backup filenames

Different databases of the same kind overwrite each other

Support multiple handlers of the same kind

Right now, we can only support one of each handler type - but what if you have two postgres databases in completely different places?

Let's find some way to dynamically instantiate one handler per connstring. For example, we could move the connstring parser out of the mixin, and then instantiate the handler with a factory method where we pass in the connstring. This would probably be tidier and less magical, anyway.

Docker: Make the dockerfile actually run the blackbox when it launches, instead of sleeping.

Currently the docker image we create is very sleepy. This was nice for test purposes but is not convenient for actual production - let's make it run the blackbox application when we run the image.

Dynamic multi-stage notifiers

Currently, our notifiers only do a single thing - they notify when a job is done, and they do this at the end of the job.

For some notifiers, this is fine. For example, if we add an Email notifier, it makes sense to just send a single mail at the end of the job. However, for notifiers like Discord, we can do better than this. We can have the notifier dynamically update its progress as it progresses!

How would this look?

When blackbox starts up, the notifier sends its first message. It lists all the databases and all the storage methods, but shows all as pending, using 🟠 as the emoji.

Next, some of these will start, and the emoji changes to ♻️ (for collecting backup). When the backup has been collected and it starts uploading, it changes to ⬆️. Finally, it changes to ✅ when the process is complete. If it fails at any point, it changes to ⛔.

Yeah but, how do we do this?

I'm going to leave that up to you, but the basic idea will be something like this:

Instead of notifiers being called at the end of cli.py, they should probably be passed into the database and storage handler objects, and then called from inside the relevant methods.
We should support multiple simultaneous notifiers.
We need some way of tracking more granular states, and more emojis to correspond with each state.

I'm intentionally leaving the implementation details a bit vague, because it'll probably be easier if you have a bit of leeway on how to implement, and because it's an interesting challenge. Do chat with me on the lemonsaurus Discord if you'd like to discuss ideas for implementation, though.

Add support of integer value for port (Postgres env)

Trace:
blackbox | Traceback (most recent call last):
blackbox | File "/usr/local/bin/blackbox", line 33, in
blackbox | sys.exit(load_entry_point('blackbox-cli', 'console_scripts', 'blackbox')())
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in call
blackbox | return self.main(*args, **kwargs)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
blackbox | rv = self.invoke(ctx)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
blackbox | return ctx.invoke(self.callback, **ctx.params)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
blackbox | return callback(*args, **kwargs)
blackbox | File "/blackbox/blackbox/cli.py", line 136, in cli
blackbox | success = run()
blackbox | File "/blackbox/blackbox/cli.py", line 37, in run
blackbox | backup_file = database.backup()
blackbox | File "/blackbox/blackbox/handlers/databases/postgres.py", line 20, in backup
blackbox | self.success, self.output = run_command(
blackbox | File "/blackbox/blackbox/utils/commands.py", line 27, in run_command
blackbox | result = subprocess.run(
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 505, in run
blackbox | with Popen(*popenargs, **kwargs) as process:
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 951, in init
blackbox | self._execute_child(args, executable, preexec_fn, close_fds,
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 1743, in _execute_child
blackbox | env_list.append(k + b'=' + os.fsencode(v))
blackbox | File "/usr/local/lib/python3.9/os.py", line 810, in fsencode
blackbox | filename = fspath(filename) # Does type-checking of filename.
blackbox | TypeError: expected str, bytes or os.PathLike object, not int

Rotation: Don't delete irrelevant files

Currently when rotating, we'll just delete everything that's older than retention_days. This is problematic, because we may delete something that isn't a backup file.

Let's implement a solution where each handler has a regex that matches its output. For example, if the Postgres handler outputs a file like postgres-backup-2020-01-01.sql, then we should have a regex to match it that looks something like r"postgres-backup-\d{4}-\d{2}-\d{2}".

Now when we're doing the rotation, the rotate method can ensure that it only deletes files that match these expressions.

Lint the docstrings!

Let's lint docstrings.

We can use this plugin to do this.

Here are the rule ignores we need to add to tox.ini for this plugin:

# Missing Docstrings
D100,D104,D105,D107,
# Docstring Whitespace
D203,D212,D214,D215,
# Docstring Quotes
D301,D302,
# Docstring Content
D400,D401,D402,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414,D416,D417

We can include those comments in the tox.ini as well, and we still want to ignore whatever we're already ignoring.

Once this is done, you'll need to run pipenv run lint and fix all linting errors this creates.

commandline: path to config

Add commandline functionality to specify a path to a config

blackbox --config=path/to/config.yml

and of course have a sensible default

Configuration: Allow configuration via environment variables

Right now, we require a config.yml file to contain all the secrets in order to configure the application. This is not always convenient, especially in container orchestration environments where secrets are managed through some external secrets manager.

Let's allow environment variable interpolation in the config.yml file in order to make this more flexible.

databases:
  - mongodb://{{ MONGO_USERNAME }}:{{ MONGO_PASSWORD }}@host:port
  - {{ POSTGRES_CONNECTION_STRING }}

Implementation

Let's use Jinja to parse the config file as a template, and inject the entire environment into the renderer.

Pseudocode:

config = Path("config.yml").read_text
parsed_config = config.render_template(**os.environ)

Storage Provider: S3

We should support generic S3 buckets from any provider.

Add encryption to the saved dumps

Abstract

Due to GDPR and security issues, support should be added for password and/or PGP encryption. This can be done through the pgpy library.

Rationale

While databases are usually encrypted, dumps aren’t, leaving the data at risk. To prevent that, it can be encrypted using a password.

This contains it owns flaws as the password will have to be stored somewhere in cleartext. Asymmetric encryption using PGP can be used by storing a public key in the configuration file and have the developer at home owning the private decryption key.

Specifications

Another configuration can be added at the root level or for each individual storage provider to select the encryption method to use and the password or ascii armored public key.

Data can be encrypted before uploading and the file (that will be made temporary by #88) will have to be securely erased.

Symmetric encryption can be also done through GPG to have a simple way to decrypt the file.

Add description to Dockerhub lemonsaurus/blackbox

It's empty now. May be at least link to README on GitHub
https://hub.docker.com/r/lemonsaurus/blackbox

Publish Docker images with version numbers

Right now there is no way to fetch a specific semver from Docker Hub, as the only versions pushed are :latest and :{sha}.

Images should be pushed with the semvers as they are for PyPI:

It's not ideal to pin to SHAs and pinning to :latest isn't a great idea for production, so tags along the lines of :2.0.0 or :2-latest (to fetch anything 2.X.X) would be nice.

Telegram one Storage failed representation

Now it is incorrect with one Storage failed:

Somehow it affects the second Storage.
Should be:

Database: Zip folder archiving

Sometimes local folders outside of a database have to be archived. Being able to provide a local folder (that could eventually be bind mounted inside the Blackbox container) that could be zipped using the zipfile stdlib and backed up like any database dump would be quite useful. Compression level could also be added as an option.

I’d be interested in implementing this.

Notifier: Discord

We should support webhooks to various services whenever a backup completes.

For our purposes, the most obvious webhook is to Discord. When the entire backup process completes, we want to send a webhook with a status report for the whole job.

For example, a report might look like this:

Postgres: [ok]
Redis: [ok]
Mongo: [failed]

Database: MariaDB support

We'll need to add support for MariaDB, as this database is now going to be part of our stack at PythonDiscord.

Database handler: Redis

Backups for the Redis database.

Add pytest-testdox

Right now, when pytest shows test output, it just prints all the function names. This is ugly, and requires us to write long, verbose and descriptive function names, which is a silly way to document what a test does.

Instead, it would be better for us to put that information into a docstring, where we don't need to worry about linelength and other constraints.

Let's add the https://pypi.org/project/pytest-testdox/ tool to our toolchain, which will add the behavior of outputting docstrings instead of function names when they exist, and otherwise getting rid of underscores when they don't. It makes the test report far more readable and nice for humans.

We should add the --testdox option in the tox.ini file as well, so that we will use this option by default in all test runs.

S3 configuration XOR condition is incorrect

The S3 code says the following:

elif bool(key_id) ^ bool(secret_key):
    raise ImproperlyConfigured("You must configure either both or none of the AWS credential params.")

However if both the key_id and the secret_key are missing, this XOR evaluates to False.

Proof:

# Good
>>> True ^ True 
False  

# Bad: We want True here because both configs are missing
>>> False ^ False
False  

# Good
>>> True ^ False
True

Instead I suggest either:

elif None in (key_id, secret_key):

elif not all([key_id, secret_key]):

Database handler: PostgreSQL

Backups for the Postgres database.

Set up linting or autoformatting

Now that we're seeing contributors joining this project, we ought to set up a linter on the CI. Basic flake8 linting will probably be a good place to start.

Let's just use some GitHub Action that provides annotated linting for this. For example, this one:
https://github.com/suo/flake8-github-action

Pass token errors to notifiers

Add try except to BadInputError

blackbox    | dropbox.exceptions.BadInputError: BadInputError('e9660aeb9d014230800f843faafde25a', 'Error in call to API function "files/list_folder": Th
e given OAuth 2 access token is malformed.')

CI: Publish to PyPI

NOTE: This issue is blocked by #31, but can be tackled once a solution for #31 has been merged.

We'll want to set up a nice workflow to publish this application to PyPI whenever a new release is created. This is quite easy to do, look at https://github.com/python-discord/django-simple-bulma/blob/master/.github/workflows/release.yml for an example.

Notifier: Slack

We've got Discord support, let's do Slack as well!

We should try to make it look the same as the Discord webhook:

Use temporary files to save dumps

Rationale

Currently dumps are saved under the home folder. If Blackbox unexpectedly exit during archiving, the file will be left there and not deleted. This can cause issues for both security and disk space.

Specification

tempfile.NamedTempFile can be used to pass a temporary file handle directly to the database handler and retrieved by the main function. This will also allow us to have uniform file names in a DRY fashion.

Note: I’d be interested in implementing that

Logging handlers

We'll support just a single connstring for logging, and it will allow retrieving logs via arbitrary command execution.

For example:
logs://[user:password@host:port]?command="docker logs api"

If some combination of host, port, user and password is provided, we will use that information to SSH into a remote machine.
Then, we execute the command provided by the user.
The user must provide a command, otherwise we raise ImproperlyConfigured
If the host is localhost, we just execute the command locally.

Once we have our output, we store it as a text file, and return it. Then the storage handlers can sync them, just like the backups.

Notifiers: add support of telegram bots

Telegram has half a billion users over the world and it is free and simple to use. Creating a bot is fast and easy to handle and after that it can be used in groups so several people would be able to monitor backup process. All we need is bot API and user_id so basically user will be responsible for bot creating and using this bot as proxy to send him a message.

Configuration: Notifier frequency

Currently, the notifiers will trigger for every time blackbox is run. This is not necessarily convenient:

If someone wants backups every 10 minutes, it'll get really spammy. In this case, we only really care about the notifiers if they fail.
Kubernetes CronJobs are not guaranteed to run only once, so we may get double notifiers. The actual storage provider upload is idempotent (because it'll just overwrite the file if you run it twice the same day), but notifiers are not.

So, it would be convenient to be able to configure this to be a bit less noisy.

Implementation

Let's implement a new config option, notifier_frequency, that tells us how often we should show success notifications. If notifier_frequency is set to 1 day, we only show a success notification once per day. Failure notifications should always be shown.

I'm intentionally not specifying what format the frequency duration should be specified in. A fun solution would be something that supported a timestring like 1d12H or something similar to that. We have examples of how to do this in https://github.com/python-discord/bot/blob/master/bot/utils/time.py. However, honestly, I'd be fine if this was just in minutes or something, and we could set it to 1440 for a day. Maybe that's simpler? Up to you.

Support for multiple Postgres versions

So, here's the thing. I think we already support every major Postgres version after 8.0, because the Postgres documentation for pg_dump states that it should work just fine for older versions back to 8.0.

But, we should make sure, and if we do, we should document this.

How do we test it, then?

Make a docker-compose file with postgres containers going all the way back to version 8. We want a file with versions 13, 12, 11, 10, 9.6, and 8. This compose file will be used only for this test.
Write a test that tests a dump from each of these.
This test should be run in a new workflow (test_postgres) using the matrix strategy to test all versions concurrently.

How should it be documented?

The readme currently states that we specifically support Postgres 13. It should instead state that we support all major versions since 8.0, or whatever ends up being true.

Database handler: MongoDB

Backups for the MongoDB database.

Improve connstring parsers

Our connstring parsers are currently too simple. We should try to adhere to the connstring specs used by Postgres and Mongo - for example, every part of the connstring is optional, so postgres:// is a valid connstring, as is postgresql://user@host.

Let's make some improvements so that these specs are at least more or less followed. Let Postgres and Mongo provide the defaults if these are not provided.

Database handler: MySQL

This should provide support to make MySQL database backups.

Implement smart_open for better S3 performance

Check out https://github.com/RaRe-Technologies/smart_open, this may be a nice optimization for our S3 handler.

Currently, the S3 handler requires the entire backup to be loaded into memory. this can be very bad.

README: Add example manifest and crontab example

Before this project is 1.0, we'll need a complete readme.

We need an example manifest that shows how to use this project with a K8 CronJob.
We need an example crontab that shows how to use this project with crontab
All sections need to be updated to reflect the latest design
~~We should make a logo, and add it to the readme!~~
Maybe add a screenshot or a gif if feasible?
Maybe some badges? A footer?

Simplify config.yaml - connstrings everywhere!

We could just base the entire system on connstrings. For example, here's how our config might look:

databases:
- mongodb://username:password@host:port
- postgres://username:password@host:port

logging:
- ssh://username:password@host:port

storage:
- gdrive://username:token

rotation_days: 7

And based on this, it'll figure out what's enabled, what's disabled, and how to log into all these services. There's no need for enabled bools and environment variables and whatever. We'll just pass custom connstrings for every service and every type, and repurpose our connstring parser to work for any connstring.

When writing a generic connstring parser, we should try to adhere to the connstring specs used by Postgres and Mongo - for example, every part of the connstring is optional, so postgres:// is a valid connstring, as is postgresql://user@host. We should also support optional params in the parser.

GitHub Actions CI - Build and update docker image to GHCR

We'll need some basic CI up for this project to keep a public GHCR image updated. This will be used in the k8s manifest.

Whenever we push to master, we should build this Dockerfile and push it to GHCR.

Sanitize logging output

We should not - under any circumstances - be allowing logging output to include the interpolated config values, since this can include passwords and other high-security secrets that we don't want to send over webhooks and emails or whatever.

This is pretty easy to solve, though. Just go through the logging output and replace all config values with asterisks or something.

Move to Poetry

Right now we're using both Pipenv and setup.py, which feels bad. We're managing the dependencies in both Pipfile and setup.py, and the Dockerfile is relying on pip install -e . to install dependencies.

Basically this is a mess. We should be using PEP 517 compatible dependency tracking instead, since that would greatly simplify this.

Let's migrate from Pipenv to Poetry and get rid of the setup.py file entirely.

Change max-line-length to 100

Currently the max-line-length in our tox.ini is set to 150. Let's reduce it to something sane, like 100.

Some linting may need to be updated for this to pass lint.

Set up docker-compose for local testing

The docker-compose should contain the images for stuff like Redis, Mongo and Postgres so that we can test this application locally. Even better would be if we could automatically set up these databases with some data, so that we can test getting some actual data.

Make retention_days optional as intended

Trace:
blackbox | if delta.days >= Blackbox.retention_days:
blackbox | TypeError: '>=' not supported between instances of 'int' and 'NoneType'

Storage handler: Google Drive

A storage handler that allows us to upload stuff to Google Drive, using the GDrive API.

lemonsaurus / blackbox Goto Github PK

blackbox's Introduction

blackbox's People

Contributors

Stargazers

Watchers

Forkers

blackbox's Issues

Rationale

Implementation

But what if s3 is ambiguous?

Okay, don't get carried away. What's this issue?

How would this look?

Yeah but, how do we do this?

Implementation

Abstract

Rationale

Specifications

Rationale

Specification

Implementation

How do we test it, then?

How should it be documented?

Recommend Projects

Recommend Topics

Recommend Org

But what if `s3` is ambiguous?