Giter Club home page Giter Club logo

alphagov.govuk-datascrubber's Introduction

GOV.UK Data Scrubber

This tool is intended to run in GOV.UK production environments in AWS to create copies of backend databases with sensitive data removed. Such copies can then be imported into development environments.

Supported databases

  • MySQL:
    • whitehall
  • Postgres:
    • email-alert-api
    • publishing_api

Mechanism of operation

RDS Database Snapshots are the basis for all the operations of the scrubber. This means snapshots stored within the RDS service, and not in S3. See the RDS docs for more details on snapshots.

Given an RDS snapshot, either MySQL or Postgres, the scrubber:

  • Spins up a temporary RDS DB Instance from that snapshot
  • Connects to it
  • Discovers the databases inside
  • Matches the discovered list against the internal list of supported databases
  • Executes database-specific code to scrub any databases matched
  • Shuts down the temporary instance, creating a "final snapshot"
  • Optionally shares the final snapshot with other AWS accounts

If any errors are encountered during the scrub process, the final snapshot creation and sharing is aborted, so as to guard against accidental leakage of sensitive data.

The RDS snapshot to start from can be specified in one of three ways:

  • Explicitly, with its RDS snapshot identifier
  • Implicitly, by taking the latest available automated snapshot of a given RDS database identifier
  • Implicitly, by taking the latest available automated shapshot of a given RDS database endpoint

Automated snapshots are daily backups that are taken from all DB instances by default by RDS.

Parallel operation

Spinning up RDS instances, running the SQL, creating the final snapshot, and destroying the instances are all fairly time-consuming operations, although they are not CPU intensive (at least, not on the machine running the scrubber).

Where multiple instances are required, e.g. one for MySQL and one for Postgres, a worker thread is spawned per instance.

Usage

To run parallel scrub tasks against mysql-primary and postgresql-primary:

$ datascrubber --mysql-hosts mysql-primary --postgresql-hosts postgresql-primary

Sharing scrubbed snapshots with another AWS account:

$ datascrubber --mysql-hosts mysql-primary --share-with 123456789

Logging to syslog:

$ datascrubber --mysql-hosts mysql-primary --log-mode syslog

Note that if STDOUT is not a TTY (e.g. if being run by Cron) logging to syslog is the default behaviour.

Increasing the log level:

$ datascrubber --mysql-hosts mysql-primary --log-level DEBUG

This will increase the verbosity of Boto3 logging too.

Prerequisites, development, deployment

The dependencies are Python 3 and the libraries listed in the requirements.txt.

Docker is required to build the Debian package.

At run time, AWS credentials are required. The Boto3 library is used for all AWS interactions, and AWS credentials are discovered in exactly the same manner as the AWS CLI.

virtualenv

Use virtualenv to work on the code locally:

virtualenv --python=python3 .env
source .env/bin/activate
pip3 install -r requirements.txt

Supporting new databases

Database scrub tasks are defined in the datascrubber/tasks/ directory. See existing code for examples of each step:

  • Create a new Python script in that directory, containing a function that takes a database cursor as an argument.
  • Add an import statement for it into datascrubber/tasks/__init__.py.
  • Add the function to the self.scrub_functions dict in either the MySQL or or the Postgres task manager class (see datascrubber/task_managers/.

Build process

To build the Debian package:

$ make deb

alphagov.govuk-datascrubber's People

Contributors

dougneal avatar issyl0 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.