phacdatahub / cpho-phase2 Goto Github PK

A data collection and retrieval application to automate and standardize the data intake for the yearly Health of Canadians report

Python 77.85% Jinja 14.29% CSS 0.35% Shell 2.27% D2 4.34% TypeScript 0.64% JavaScript 0.25%

cpho-phase2's People

Contributors

Stargazers

Watchers

Forkers

alemgeb andguo mustafademir0

cpho-phase2's Issues

speed up tests

The conftest is super slow, makes TDD really annoying, maybe we shouldn't create all of those users

Pre-prod (dev/test) environment

There's going to be a need for a dev/test environment to (user) test new features in a prod-like environment.

GitOps wise, I'd suggest having the main branch syncing to the dev/test environment and a separate prod branch for the prod state. @AlexCLeduc any considerations from your perspective?

The dev/test environment would ideally use fake data, re-seeded periodically (on every deploy maybe). That'd put the onus on the devs to maintain seeding scripts or fake data factories, so that's also @AlexCLeduc's call. Could also just leave the manual management of the dev/test environment's database to the devs.

On the DevOps side, we'll need to figure out how to best enable this. There's a lot about the k8s side of this that aren't obvious to me, so if you have a good idea, chime in @simardeep1792 @vedantthapa!

Age group records shouldn't be deleted

Currently we rely on the formset logic to delete age groups that get checked "deleted". Right now it is actually deleting records and all of their versions. We need to override this behaviour and probably modify the data-model to keep its versions around.

Scripts and fixtures as migrations

Longer term we need to be using the migration engine to push "data updates" (aka scripts and fixture updates).

The official way to do this is to use data-migrations but those involve writing scripts in a very rigorous fashion. These are only supposed to access "serializable" parts of models, not methods or utility functions that may move around around.

Using official "data-migrations" is realistic for very simple operations like creating groups or small groups of lookup-records, but it is cumbersome for larger fixtures and complex scripts.

Another issue with these data-migrations is that they would also run in tests, which we certainly don't want in every case.

However migrations are ideal because migrations and scripts often depend on one-another, and migrations are already built around a dependency queue

I'd like to propose subclassing the django migration class to create custom behaviour.

Our subclass would specify in which environments it should run (test, localdev, prod)
If it gets run in a non-applicable environment, it just does nothing so that the migration gets recorded in django's migration table - If a migration class gets run in tests or localdev, its dependencies should be considered permanents parts of the repo
Migrations can still run arbitrary python, so we should be able to keep using fixture, e.g. call_command("loadadata", "fixture.json")
- Note that this can cause problems if we add columns and the migration is run before those columns are added! In this case, we could go and "invalidate" the old migration by wiping out its code and create a new migration.

Note that this still doesn't cover everything, and we'll need manual prod DB access for exceptions. For instance, creating the first user can't be done through the UI, and we probably don't want people's emails in this github repo.

Limit query complexity

GraphQL is a query language for APIs. It's great for clients to be able to express their data needs with a language, but the server needs to put some limits on how much data fetching happens in a single request. Prior art exists for graphene/django. We have a while before production, and so we have some time to figure out a reasonable way of limiting queries.

Load test the k8s configuration

Doesn't need to be very fancy or too realistic, although if you pick a solution that can manage some automation then I'd recommend talking to @AlexCLeduc and hashing out some test credentials and basic workflows to properly put the app nodes through their paces.

Ideally pick a tool where we can re-run the load balancing whenever we want, as we'll be needing load testing at least until the configuration is fully stable (and even then it could be useful to the devs if the app requirements include heavier routes down the line, etc).

We want to do this soon to put the experimental-ish 1-1 app node to django process configuration through its paces. It works when it's just me poking around the test deployment, but it's definitely suboptimal for higher traffic volume. Adding some sort of request counting to the horizontal scaling and making sure the load balancing is smart could be necessary, but we should see how bad it is to start with before going down the rabbit hole. Compare it to nodes with more resources each running multiple django process via gunicorn.

Unleash Dependabot! :robot:

Let's get dependabot going in this repo! Dependabot is pretty enthusiastic so limiting it to some sane number of PRs is a good idea, but once that's in place it's just a matter of enabling it in repository config:

Add test coverage reporting

Consider an even more minimal run-time container image

From here:

While the current image is already using python-slim, it'd be good to reduce the attack surface further with some kind of minimalist docker image. While Alpine linux is a popular choice, it doesn't go quite as far as the Distroless images do, and historically doesn't play nice with Python which seems to need Glibc to work properly.

Chainguards Python image seems pretty ideal to use as a base image for our Django setup, taking the ideas of distroless even further.

We can explore/verify the contents of a docker image with dive.

Going slimmer might be marginal at this point, since we're already on Alpine, but it might be worth a look at some point. Slimmer means better security and better cold start time. Downside is only if the image ends up very different from dev/is harder to debug.

Rearrange the docker file/build layers to keep the dev-requirements out of the prod image

Added install dev and formatting requirements to the Dockerfile that is used for both dev and test. Will need to be reworked to keep the extra requirements out of the final build.

Fail main branch build if test coverage is below a given percentage.

Building on the outcome of PR #93. Probably need to look at branch thresholds/ and compare to last build too - will look at other projects and see solutions to see if can borrow something similar.

Create a PHAC signature component

CPHO needs an agency signature component that works for responsive web applications and dark mode.
Similar to the Wordmark component from a72c45c, this component should be able to display differently based on props to allow it to be used both in CPHO and lots of other contexts. Behaviour we're looking for:

show just the flag
show flag plus one language (FR/EN)
show flag plus both languages (in the correct order, determined by the current language)
able to support monochrome usage
alt text that matches whatever is displayed

As always, optimised svg, tests and a solid accessibility story are key! Inspirational source material can be found here and in gcui.

logging out redirects to admin view logout

should be an easy fix

Write build status and test coverages back to GitHub.

Looks like we might be able to use PubSub and Cloud Functions to do this. (example).

From the docs:

Review all roles and permissions to ensure everything is "least-privileged"

We're using broader scoped roles and permissions to get things up and running at the moment, but once we're closer to finalizing the infrastructure we'll want to review all the roles and permissions to limit access and capabilities to the bare-minimum.

https://cloud.google.com/iam/docs/using-iam-securely#least_privilege

I think @tcaky has been looking at this as well.

Enable vulnerability scanning for the Cloud Run image Artifact Repository

Should be able to do this via the gcloud cli, so add it to gcloud_init_setup.sh too.

Add PHAC signature/wordmark

See #39 which was working on this for the old React frontend. Presume that this is still needed in the new Django monolith app so recover that work.

I assume all of the Django apps around here will need this, so maybe it should be a templatetag in django-phac_aspc-helpers.

Consider adding some Metrics to our OpenTelemetry instrumentation

Use opentelemetry-exporter-gcp-monitoring for exporting metrics directly to GCP Cloud Trace. TBC, do the predefined aggregation types require a collector, or can the GCP Cloud Trace backend perform aggregation for them?

We may want a Cloud Run sidecar to run an OpenTelemetry Collector, assuming we want metrics with aggregation logic, which is the super power of OpenTelemetry metrics. See the google docs here and here.

If we don't need a collector, this is pretty simple. If/when we have metrics that need the side car, then this becomes more complicated (and bloats up our infrastructure a bit).

Infrastructure as Code/Data for non-k8s infrastructure

TODO:

GraphiQL is not loading

Graphene Django includes the web based GraphQL IDE GraphiQL.
It should be available at /graphql but currently the files needed for it to work are producing a 404.

The most likely cause here is a missing manage.py collectstatic step in the build process.

Flux GitHub commit status updates

Add a Flux notification controller to set Flux reconciliation status on GitHub commits. Additional info here.

Add linting to the server

Adding linting to projects is a great way to catch errors and ensure the code quality stays high.
Ruff is a linter for python that is written in Rust and extremely fast.
Adding it to the server (as a pdm script initially) would be useful both during development (where tools like Ale or a VSCode extension surface linting errors as you type), and also in the CI pipeline that we'll be creating shortly.

Testing, and test coverage reporting, as a Cloud Build step

See upcoming PR, work is under-way.

Automatically run server migrations

We need to figure out a way of automatically running migrations for the server.

There is some existing code from an initial attempt in the deployment for the server. The thinking was to use an initContainer, but it wasn't working for some reason. Historically istio's sidecars can conflict with running migrations from an initContainer but the exact cause isn't clear.

Totally open to other solutions like jobs here too.

Cloud Run uptime checks

https://cloud.google.com/blog/products/management-tools/monitor-cloud-run-service-availability-with-uptime-checks

Cross posting from PHACDataHub/cloudrun-deployment-example#24

Since we're using CPHO as our Cloud Run deployment experiment/example repo for now (because it'll force us to also consider the realistic dev experience and prod impacts of whatever we're doing with infrastructure), I might move more of these issues over and possibly archive that repo.

Add coverage reports for templates.

This issue came out of PR #73 (Testing as a Cloud Build step and code coverage reporting.)

Stephen made this comment "...there's a coverage plugin called django_coverage_plugin we can add to possibly get coverage reports on django templates as well. Note, the example I saw this in was using django's built in templates, while this project is using jinja for templating. Haven't checked if this specific plugin can also track coverage for jinja files, or if some jinja alternative exists."

Configure Istio HTTP health checks

As per these docs, we can configure HTTP based health checks via Istio for readiness/liveliness of our nodes. The app provides a simple health check route at /healthcheck, use that. Tune the initial delay and period; delay will probably be relatively stable going forward but the appropriate period may depend a lot on whether we stick with low-resource, single django process, nodes (related: #133).

Configure k8s postgres to autoscale storage capacity

Pgaudit logs are stored to the DB, so from those alone we can expect it to fill up gradually.

Failing a good option to configure auto storage scaling, we should look in to setting up warnings when the disk space starts running out.

Production DB management mechanism & playbook

We need to keep the production database fenced off and assure its security and integrity, but we'll also need a mechanism/playbook for executing data management scripts and (non-trivial) migrations against it. Acceptance criteria will mean something that can pass a protected-B level security assessment while also being sufficiently flexible and confidence-inspiring for devs.

The GitOps + zero trust ideal would be to carefully plan out django migrations, and to use data migrations instead of scripts, so that they can be run hands-off in the cloud as part of the project's CD infrastructure. This no-one-touches-the-database approach theoretically breezes past a prot-B assessment (as much as anything can breeze past the gauntlet).

Fully hands-off DB management is less flexible and seems to not inspire developer confidence. To my understanding, lack of confidence in the approach is in large part because no hands-on-keyboard means no ability to steer the ship when things go sideways.

Personally, I'd like to try and work out the hands-free approach, but I also think some sort of break-glass direct access system for emergencies is sensible. Of course, that means doing all the work and maintenance for both (although it'd be easier to get break-glass console access past security then saying it will be the default management method).

Alpha domain name and DNS infrastructure

We probably want a new domain to reflect the HoPiC brand, rather than the existing CPHO name. This will just be for the alpha deployments, but we should start answering some related bilingualism questions now.

Do we want one bilingual acronym based name, or one plan language domain name per official language? The former is the standard, but isn't great, while the second will require slightly more book keeping and might put us at odds with policy enforcers. It'll be the business owners call ultimately, although we could do either or both while still in the dev/alpha stage.

As part of this issue, update the relevant steps in gcloud_init_setup.sh, the ALLOWED_HOSTS configuration in the GCP Secret Manager prod env vars, etc.

The domain name will be provisioned via https://github.com/PHACDataHub/dns

Consider CDN for static content

Something to look in to down the road. Whitenoise is nice for simplicity and consistency between dev and prod, but having all static content requests hitting Cloud Run isn't great. Cloud Armor should help block DDOS once we have that in place, so that part of the Whitenoise trade off isn't a big concern. Still, a CDN would probably be nice-to-have for performance, if the cache busting isn't too annoying to wrangle.

Dependent pods should restart when SealedSecret resources are modified.

Updated one of our SealedSecret-stored app config values, had to manually restart the app pod to get the new value in to its env vars. Not at all ideal, should be automatic, with rolling updates to avoid down time and all that nice-ness.

Infrastructure next-step: firewall and cloud Armor between public internet and Cloud Run containers

Currently, the Cloud Run container instances are public-internet facing via the automatic URLs we get for them. We want a more secure "front door", with the traffic going through a (better than default) cloud firewall and google cloud armor. This desired path is illustrated in the target architecture diagram below.

Target architecture diagram:

Current architecture diagram, for reference:

breadcrumb trail and back buttons

indicators > (indicator name) > period > stratifier > (edit/review)

Also add back buttons, but back buttons shouldn't be too redundant with the breadcrumb trail. Maybe just on form pages

Upload coverage reports & make more visable.

This issue comes out of PR #73 Testing as a Cloud Build step and code coverage reporting.

The test coverage report is currently printed deep in the cloud build logs. Let's pull this out and ideally have it printed back to GitHub, but saved to Google Cloud Storage is a good first step.

Link to comment.

Consider incorporating Identity Aware Proxy down the line

Since the (non-API) portions of this project are intended to be accessible only by internal users, and the application may every integrate with AD for SSO down the line, it will likely make sense to go one step further and slap an IAP along the network path for incoming requests.

This video might be a good starting point.

This is low-ish priority for now, might be worth waiting to see if the AD SSO pans out first.

Streamline project specific configurations as Kustomize patch

There are some "project specific" components that can be consolidated into a kustomize patch to house all configurations in a single file. This would make deployments easier for other gcp environments / projects.

For example, the project field in the cert-manager/issuers.yaml could be a kustomization patch.

cpho-phase2/k8s/cert-manager/issuers.yaml

Lines 20 to 22 in 1b50b10

 cloudDNS: 

 # The ID of the GCP project 

 project: phx-01h4rr1468rj3v5k60b1vserd3

Tasks

Beta Give feedback

flux-system
istio-ingress
cert-manager
cnpg-system
server
Options

Review the k8s logs

There are a lot of logs coming out of the k8s cluster.

some might indicate missing or bad configuration. Prioritize fixing those
- there are info level logs that also seem to imply configuration issues, e.g.
others might just be pure noise with no applicable "fix". If possible, do something to silence those

Add coverage reporting for the server

Python's unittest tool can calculate test coverage, and we should get that going with a pdm post-script.

System check warning re AxesModelBackend was renamed

Get this warning ?: (axes.W003) You do not have 'axes.backends.AxesStandaloneBackend' or a subclass in your settings.AUTHENTICATION_BACKENDS. HINT: AxesModelBackend was renamed to AxesStandaloneBackend in django-axes version 5.0.when running on each step when running

python ./manage.py loaddata cpho/fixtures/periods.yaml, 
python ./manage.py loaddata cpho/fixtures/dimension_lookups.yaml
python ./manage.py runscript cpho.scripts.dev

multiple measures per indicator

Often an indicator has multiple things being measured, e.g. a relative measure and an absolute. This sounds tricky, we may have to add a new model.

A cheap alternative is to force people to create one indicator per measure. If the only issue with this is cosmetic (e.g. indicator names) then we can create a very lightweight "indicator grouping" as a parent to indicator and rebrand indicators as "measures".

Add step to enable Postgres audit logs in gcloud_init_setup.sh

As described here. Setting/updating database flags is easy, but you also need to run CREATE EXTENSION pgaudit; in the database. So we're back to needing ~~a good script-able way to connect to the prod DB or~~ a good strategy for DB script Cloud Run jobs.

~~Could be~~ Cloud SQL Auth Proxy, but that's an additional dependency needed on-machine and might be difficult to integrate smoothly in the script. It also can't currently connect to the DB; either the dev machine needs to be able to connect to the VPC to use the private IP, or we need to be temporarily enabling a public IP before connecting via Cloud SQL Proxy. Hm. See #64,

Much less hacky would be to make a one-time DB initialization Cloud Job that we kick off from the init script, although I know there's still a desire to have a way to get a hands-on DB shell, at least during this pre-prod stage.

... see also these limitations and warnings. Could need some caution and careful configuration (especially if we want to use this on a busier application in the future).

Update architecture diagram

With:

Test coverage reporting
Alerts
OpenTelemetry

Confirm functionality of the automated database backup feature, set retention policy

I don't see any backups since October 26, so it's likely not currently working. Additionally, we should confirm that manual backups and recovery from backups also function as expected.

TODO:

confirm/fix functionality of automated backups
- was broken, see comments below
confirm ability to load from backups
set retention policy

CPHO: Add OpenTelemetry to server

OpenTelemetry is an open observability standard. Google's Cloud Trace works with OT and there is a library that automatically instruments Django applications. We should get this working on the CPHO server.

Approvals and submission flow - data-model and UI

Only records that have been both approved by the program and HSO can be published via the API. Records can still be modified after approval, but we don't want to publish the edits that aren't published.

How do we do this? By only serving approved versions in the API.

We can add is_program_approved and is_hso_approved boolean fields on the indicator-datum VERSION model. There's already an approved flag on the abstract ApprovableCustomVersionModelWithEditor, we just need to split up that field into 2.

Also, for more meta-data reasons, we can also have a Submission model that will just note who/when approved something. This submission model will also contain a type=HSO|Program choice field.

For the user-interface. One idea is to have a POST endpoint (triggered by a submit button) that scopes an (indicator,period) pair. This endpoint triggers a services.py function that iterates all the latest versions for indicator-data within that scope and sets is_program_approved=True. Similarly, another endpoint/service/button for the HSO approval. These services will also create a Submission record.

Note that if someone has already submitted, but wants to make a correction to a single record and clicks "re-submit", the service will iterate versions it has previously approved. This isn't an issue or anything, just thought I'd point out.

Add testing as a step in the deployment process

Can be added as a step in cloudbuild.yaml, so when tests fail, deployment fails. May want a GitHub action to run tests on pull request as well.

Add linting and formatting steps to Cloud Build CI/CD

Maybe straight forward, unless there's a lot of outstanding lint/format errors to resolve.

Should run black, isort, and djlint. Will need a container with requirements_formatting.txt installed, could be done with the test image or even during the tests step.

Can be done right now, but will have interplay with #78.

Infrastructure as Code/Data

The current run-once gcloud_init_setup.sh script is better than undocumented click-ops, but it's not the ideal (and will not be too useful when it's time to update the infrastructure).

@tcaky will be working on a generic Kubernetes Config Connector IaC for a more framework-agnostic Cloud Run + Cloud SQL infrastructure. We'll likely wait for that as our starting point, but this task for this issue is to help Keith out with that.

Alternatively, we could take time now to experiment with some other IaC/D solutions, but that's not a priority for now.

Add architecture diagram to repo

Written in either d2 or mermaid markdown. If d2, borrow heavily (yank wholesale) from Mike's prior work.

For now, we will likely have both a current and target architecture diagram, as we iterate on both.

	cloudDNS:
	# The ID of the GCP project
	project: phx-01h4rr1468rj3v5k60b1vserd3