Giter Club home page Giter Club logo

blackbox's Introduction

blackbox's People

Contributors

akarys42 avatar chrislovering avatar dependabot[bot] avatar inveracity avatar janasunrise avatar jb3 avatar kosayoda avatar ks129 avatar lemonsaurus avatar matacoder avatar onerandomusername avatar roock avatar shtlrs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

blackbox's Issues

Configuration: Allow databases to specify storage and notifiers to use.

Rationale

So, a user might have two databases, but these should not be backed up onto the same storage providers, and one of them should be notified to Discord while the other should go to Slack. How can we accommodate this?

One solution would be a more complex config file, but I want us to try to avoid this because it introduces additional user experience complexity to all users, even though this feature will only appeal to a few users. This is a bad UX investment.

Instead, let's handle it with optional connstring parameters!

Implementation

Allow all database connstrings to take two extra parameters, storage_providers and notifiers.

For example, redis://host:port?storage_providers=s3,dropbox&notifiers=slack would specify that the redis database should be backed up onto S3 and Dropbox, and then notified to Slack.

These are strictly optional. By default it will use all notifiers and all storage providers.

Multiple values can be provided, and are comma-separated.

But what if s3 is ambiguous?

In #15, we propose implementing support for multiple handlers of the same kind. If the user has two S3 storage providers, and wants to only select one of them, we will need some way of identifying which one it is.

  • If we have multiple providers of type s3, use all of them if a database specifies storage_providers=s3.
  • Allow a user to specify an ID for notifiers and providers to disambiguate. For example, s3://my.bucket.com?id=main_bucket will associate this ID with the bucket.
  • The database can now use this ID to select its target, e.g. storage_providers=dropbox,main_bucket

Package this app for PyPI

Let's get this application on PyPI.

The end result should be that when this application is pip installed, it allows you to use a blackbox command to essentially do the same thing that python main.py currently will do.

That greatly simplifies the work needed to get this working locally:

  • Install blackbox with pip install blackbox
  • Set up a cron job that runs blackbox however often you want.

It also means we can add additional CLI utilities, such as one for setting up the cron job, maybe an interactive configuration tool, or whatever else.

Okay, don't get carried away. What's this issue?

Yeah, so this issue is just that we want some way of installing this so that running blackbox will run python main.py. That's it. Probably involves creating a setup.py file and maybe looking into doing this in a nice PEP 518 compliant and future-proof way, maybe with a pyproject.toml?

This issue is solved when you can use pip install -e . to install blackbox - we'll handle CI and stuff in a seperate issue.

Basic unit testing

Right now I've got some extremely lazy testing inside the model classes themselves, and we'll want to move this into actual unit tests sooner or later.

I'd like to keep testing as simple as possible for this project, and am thinking this might be a good fit for pytest with no coverage requirements, and where we only test essentials

  • Can all the classes be instantiated?
  • Does the config system work?
  • Can we use the application as intended?
  • Are the utils working?

Tests fail on S3() instantiation

Pytest results:

tests/test_storage.py::test_s3_handler_can_be_instantiated FAILED                                                [100%]

====================================================== FAILURES =======================================================
_________________________________________ test_s3_handler_can_be_instantiated _________________________________________

config_file = None

    def test_s3_handler_can_be_instantiated(config_file):
        """Test if the GoogleDrive storage handler can be instantiated."""
        from blackbox.handlers.storage import S3
>       S3()

tests\test_storage.py:11:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
blackbox\handlers\storage\s3.py:48: in __init__
    self.client = boto3.client(
.venv\lib\site-packages\boto3\__init__.py:93: in client
    return _get_default_session().client(*args, **kwargs)
.venv\lib\site-packages\boto3\session.py:258: in client
    return self._session.create_client(
.venv\lib\site-packages\botocore\session.py:827: in create_client
    endpoint_resolver = self._get_internal_component('endpoint_resolver')
.venv\lib\site-packages\botocore\session.py:700: in _get_internal_component
    return self._internal_components.get_component(name)
.venv\lib\site-packages\botocore\session.py:924: in get_component
    self._components[name] = factory()
.venv\lib\site-packages\botocore\session.py:163: in create_default_resolver
    endpoints = loader.load_data('endpoints')
.venv\lib\site-packages\botocore\loaders.py:132: in _wrapper
    data = func(self, *args, **kwargs)
.venv\lib\site-packages\botocore\loaders.py:420: in load_data
    found = self.file_loader.load_file(possible_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <botocore.loaders.JSONFileLoader object at 0x00000236ADF396D0>
file_path = .venv\\lib\\site-packages\\botocore\\data\\endpoints'

    def load_file(self, file_path):
        """Attempt to load the file path.

        :type file_path: str
        :param file_path: The full path to the file to load without
            the '.json' extension.

        :return: The loaded data if it exists, otherwise None.

        """
        full_path = file_path + '.json'
        if not os.path.isfile(full_path):
            return

        # By default the file will be opened with locale encoding on Python 3.
        # We specify "utf8" here to ensure the correct behavior.
        with open(full_path, 'rb') as fp:
>           payload = fp.read().decode('utf-8')
E           AttributeError: 'str' object has no attribute 'decode'

.venv\lib\site-packages\botocore\loaders.py:173: AttributeError

Support multiple handlers of the same kind

Right now, we can only support one of each handler type - but what if you have two postgres databases in completely different places?

Let's find some way to dynamically instantiate one handler per connstring. For example, we could move the connstring parser out of the mixin, and then instantiate the handler with a factory method where we pass in the connstring. This would probably be tidier and less magical, anyway.

Dynamic multi-stage notifiers

Currently, our notifiers only do a single thing - they notify when a job is done, and they do this at the end of the job.

For some notifiers, this is fine. For example, if we add an Email notifier, it makes sense to just send a single mail at the end of the job. However, for notifiers like Discord, we can do better than this. We can have the notifier dynamically update its progress as it progresses!

How would this look?

When blackbox starts up, the notifier sends its first message. It lists all the databases and all the storage methods, but shows all as pending, using 🟠 as the emoji.

image

Next, some of these will start, and the emoji changes to ♻️ (for collecting backup). When the backup has been collected and it starts uploading, it changes to ⬆️. Finally, it changes to ✅ when the process is complete. If it fails at any point, it changes to ⛔.

image
image

Yeah but, how do we do this?

I'm going to leave that up to you, but the basic idea will be something like this:

  • Instead of notifiers being called at the end of cli.py, they should probably be passed into the database and storage handler objects, and then called from inside the relevant methods.
  • We should support multiple simultaneous notifiers.
  • We need some way of tracking more granular states, and more emojis to correspond with each state.

I'm intentionally leaving the implementation details a bit vague, because it'll probably be easier if you have a bit of leeway on how to implement, and because it's an interesting challenge. Do chat with me on the lemonsaurus Discord if you'd like to discuss ideas for implementation, though.

Add support of integer value for port (Postgres env)

Trace:
blackbox | Traceback (most recent call last):
blackbox | File "/usr/local/bin/blackbox", line 33, in
blackbox | sys.exit(load_entry_point('blackbox-cli', 'console_scripts', 'blackbox')())
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 829, in call
blackbox | return self.main(*args, **kwargs)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 782, in main
blackbox | rv = self.invoke(ctx)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
blackbox | return ctx.invoke(self.callback, **ctx.params)
blackbox | File "/usr/local/lib/python3.9/site-packages/click/core.py", line 610, in invoke
blackbox | return callback(*args, **kwargs)
blackbox | File "/blackbox/blackbox/cli.py", line 136, in cli
blackbox | success = run()
blackbox | File "/blackbox/blackbox/cli.py", line 37, in run
blackbox | backup_file = database.backup()
blackbox | File "/blackbox/blackbox/handlers/databases/postgres.py", line 20, in backup
blackbox | self.success, self.output = run_command(
blackbox | File "/blackbox/blackbox/utils/commands.py", line 27, in run_command
blackbox | result = subprocess.run(
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 505, in run
blackbox | with Popen(*popenargs, **kwargs) as process:
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 951, in init
blackbox | self._execute_child(args, executable, preexec_fn, close_fds,
blackbox | File "/usr/local/lib/python3.9/subprocess.py", line 1743, in _execute_child
blackbox | env_list.append(k + b'=' + os.fsencode(v))
blackbox | File "/usr/local/lib/python3.9/os.py", line 810, in fsencode
blackbox | filename = fspath(filename) # Does type-checking of filename.
blackbox | TypeError: expected str, bytes or os.PathLike object, not int

Rotation: Don't delete irrelevant files

Currently when rotating, we'll just delete everything that's older than retention_days. This is problematic, because we may delete something that isn't a backup file.

Let's implement a solution where each handler has a regex that matches its output. For example, if the Postgres handler outputs a file like postgres-backup-2020-01-01.sql, then we should have a regex to match it that looks something like r"postgres-backup-\d{4}-\d{2}-\d{2}".

Now when we're doing the rotation, the rotate method can ensure that it only deletes files that match these expressions.

Lint the docstrings!

Let's lint docstrings.

We can use this plugin to do this.

Here are the rule ignores we need to add to tox.ini for this plugin:

# Missing Docstrings
D100,D104,D105,D107,
# Docstring Whitespace
D203,D212,D214,D215,
# Docstring Quotes
D301,D302,
# Docstring Content
D400,D401,D402,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414,D416,D417

We can include those comments in the tox.ini as well, and we still want to ignore whatever we're already ignoring.

Once this is done, you'll need to run pipenv run lint and fix all linting errors this creates.

commandline: path to config

Add commandline functionality to specify a path to a config

blackbox --config=path/to/config.yml

and of course have a sensible default

Configuration: Allow configuration via environment variables

Right now, we require a config.yml file to contain all the secrets in order to configure the application. This is not always convenient, especially in container orchestration environments where secrets are managed through some external secrets manager.

Let's allow environment variable interpolation in the config.yml file in order to make this more flexible.

databases:
  - mongodb://{{ MONGO_USERNAME }}:{{ MONGO_PASSWORD }}@host:port
  - {{ POSTGRES_CONNECTION_STRING }}

Implementation

Let's use Jinja to parse the config file as a template, and inject the entire environment into the renderer.

Pseudocode:

config = Path("config.yml").read_text
parsed_config = config.render_template(**os.environ)

Add encryption to the saved dumps

Abstract

Due to GDPR and security issues, support should be added for password and/or PGP encryption. This can be done through the pgpy library.

Rationale

While databases are usually encrypted, dumps aren’t, leaving the data at risk. To prevent that, it can be encrypted using a password.

This contains it owns flaws as the password will have to be stored somewhere in cleartext. Asymmetric encryption using PGP can be used by storing a public key in the configuration file and have the developer at home owning the private decryption key.

Specifications

Another configuration can be added at the root level or for each individual storage provider to select the encryption method to use and the password or ascii armored public key.

Data can be encrypted before uploading and the file (that will be made temporary by #88) will have to be securely erased.

Symmetric encryption can be also done through GPG to have a simple way to decrypt the file.

Publish Docker images with version numbers

Right now there is no way to fetch a specific semver from Docker Hub, as the only versions pushed are :latest and :{sha}.

Images should be pushed with the semvers as they are for PyPI:

Screenshot 2021-03-26 at 14 49 54

It's not ideal to pin to SHAs and pinning to :latest isn't a great idea for production, so tags along the lines of :2.0.0 or :2-latest (to fetch anything 2.X.X) would be nice.

Database: Zip folder archiving

Sometimes local folders outside of a database have to be archived. Being able to provide a local folder (that could eventually be bind mounted inside the Blackbox container) that could be zipped using the zipfile stdlib and backed up like any database dump would be quite useful. Compression level could also be added as an option.

I’d be interested in implementing this.

Notifier: Discord

We should support webhooks to various services whenever a backup completes.

For our purposes, the most obvious webhook is to Discord. When the entire backup process completes, we want to send a webhook with a status report for the whole job.

For example, a report might look like this:

Postgres: [ok]
Redis: [ok]
Mongo: [failed]

Database: MariaDB support

We'll need to add support for MariaDB, as this database is now going to be part of our stack at PythonDiscord.

Add pytest-testdox

Right now, when pytest shows test output, it just prints all the function names. This is ugly, and requires us to write long, verbose and descriptive function names, which is a silly way to document what a test does.

Instead, it would be better for us to put that information into a docstring, where we don't need to worry about linelength and other constraints.

Let's add the https://pypi.org/project/pytest-testdox/ tool to our toolchain, which will add the behavior of outputting docstrings instead of function names when they exist, and otherwise getting rid of underscores when they don't. It makes the test report far more readable and nice for humans.

We should add the --testdox option in the tox.ini file as well, so that we will use this option by default in all test runs.

S3 configuration XOR condition is incorrect

The S3 code says the following:

elif bool(key_id) ^ bool(secret_key):
    raise ImproperlyConfigured("You must configure either both or none of the AWS credential params.")

However if both the key_id and the secret_key are missing, this XOR evaluates to False.

Proof:

# Good
>>> True ^ True 
False  

# Bad: We want True here because both configs are missing
>>> False ^ False
False  

# Good
>>> True ^ False
True 

Instead I suggest either:

elif None in (key_id, secret_key):

or

elif not all([key_id, secret_key]):

Pass token errors to notifiers

Add try except to BadInputError

blackbox    | dropbox.exceptions.BadInputError: BadInputError('e9660aeb9d014230800f843faafde25a', 'Error in call to API function "files/list_folder": Th
e given OAuth 2 access token is malformed.')

Notifier: Slack

We've got Discord support, let's do Slack as well!

We should try to make it look the same as the Discord webhook:

image

Use temporary files to save dumps

Rationale

Currently dumps are saved under the home folder. If Blackbox unexpectedly exit during archiving, the file will be left there and not deleted. This can cause issues for both security and disk space.

Specification

tempfile.NamedTempFile can be used to pass a temporary file handle directly to the database handler and retrieved by the main function. This will also allow us to have uniform file names in a DRY fashion.

Note: I’d be interested in implementing that

Logging handlers

We'll support just a single connstring for logging, and it will allow retrieving logs via arbitrary command execution.

For example:
logs://[user:password@host:port]?command="docker logs api"

  • If some combination of host, port, user and password is provided, we will use that information to SSH into a remote machine.
  • Then, we execute the command provided by the user.
  • The user must provide a command, otherwise we raise ImproperlyConfigured
  • If the host is localhost, we just execute the command locally.

Once we have our output, we store it as a text file, and return it. Then the storage handlers can sync them, just like the backups.

Notifiers: add support of telegram bots

Telegram has half a billion users over the world and it is free and simple to use. Creating a bot is fast and easy to handle and after that it can be used in groups so several people would be able to monitor backup process. All we need is bot API and user_id so basically user will be responsible for bot creating and using this bot as proxy to send him a message.

Configuration: Notifier frequency

Currently, the notifiers will trigger for every time blackbox is run. This is not necessarily convenient:

  • If someone wants backups every 10 minutes, it'll get really spammy. In this case, we only really care about the notifiers if they fail.
  • Kubernetes CronJobs are not guaranteed to run only once, so we may get double notifiers. The actual storage provider upload is idempotent (because it'll just overwrite the file if you run it twice the same day), but notifiers are not.

So, it would be convenient to be able to configure this to be a bit less noisy.

Implementation

Let's implement a new config option, notifier_frequency, that tells us how often we should show success notifications. If notifier_frequency is set to 1 day, we only show a success notification once per day. Failure notifications should always be shown.

I'm intentionally not specifying what format the frequency duration should be specified in. A fun solution would be something that supported a timestring like 1d12H or something similar to that. We have examples of how to do this in https://github.com/python-discord/bot/blob/master/bot/utils/time.py. However, honestly, I'd be fine if this was just in minutes or something, and we could set it to 1440 for a day. Maybe that's simpler? Up to you.

Support for multiple Postgres versions

So, here's the thing. I think we already support every major Postgres version after 8.0, because the Postgres documentation for pg_dump states that it should work just fine for older versions back to 8.0.

But, we should make sure, and if we do, we should document this.

How do we test it, then?

  • Make a docker-compose file with postgres containers going all the way back to version 8. We want a file with versions 13, 12, 11, 10, 9.6, and 8. This compose file will be used only for this test.
  • Write a test that tests a dump from each of these.
  • This test should be run in a new workflow (test_postgres) using the matrix strategy to test all versions concurrently.

How should it be documented?

The readme currently states that we specifically support Postgres 13. It should instead state that we support all major versions since 8.0, or whatever ends up being true.

Improve connstring parsers

Our connstring parsers are currently too simple. We should try to adhere to the connstring specs used by Postgres and Mongo - for example, every part of the connstring is optional, so postgres:// is a valid connstring, as is postgresql://user@host.

Let's make some improvements so that these specs are at least more or less followed. Let Postgres and Mongo provide the defaults if these are not provided.

Simplify config.yaml - connstrings everywhere!

We could just base the entire system on connstrings. For example, here's how our config might look:

databases:
- mongodb://username:password@host:port
- postgres://username:password@host:port

logging:
- ssh://username:password@host:port

storage:
- gdrive://username:token

rotation_days: 7

And based on this, it'll figure out what's enabled, what's disabled, and how to log into all these services. There's no need for enabled bools and environment variables and whatever. We'll just pass custom connstrings for every service and every type, and repurpose our connstring parser to work for any connstring.

When writing a generic connstring parser, we should try to adhere to the connstring specs used by Postgres and Mongo - for example, every part of the connstring is optional, so postgres:// is a valid connstring, as is postgresql://user@host. We should also support optional params in the parser.

Sanitize logging output

We should not - under any circumstances - be allowing logging output to include the interpolated config values, since this can include passwords and other high-security secrets that we don't want to send over webhooks and emails or whatever.

This is pretty easy to solve, though. Just go through the logging output and replace all config values with asterisks or something.

Move to Poetry

Right now we're using both Pipenv and setup.py, which feels bad. We're managing the dependencies in both Pipfile and setup.py, and the Dockerfile is relying on pip install -e . to install dependencies.

Basically this is a mess. We should be using PEP 517 compatible dependency tracking instead, since that would greatly simplify this.

Let's migrate from Pipenv to Poetry and get rid of the setup.py file entirely.

Change max-line-length to 100

Currently the max-line-length in our tox.ini is set to 150. Let's reduce it to something sane, like 100.

Some linting may need to be updated for this to pass lint.

Set up docker-compose for local testing

The docker-compose should contain the images for stuff like Redis, Mongo and Postgres so that we can test this application locally. Even better would be if we could automatically set up these databases with some data, so that we can test getting some actual data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.