hubverse-org / hubverse-cloud Goto Github PK

Test hub for S3 data submission and storage

License: MIT License

hubverse-cloud's Issues

Create AWS alert for unusual activity

Probably doesn't need to be a high priority as we get started, but because we're hosting publicly-accessible S3 buckets, we should have AWS alert us to any spikes/high-volumes of data access activity (since we'll be paying for it).

Move model-output transform function to its own repository

Background

As part of exploring the possibility of AWS event notifications -> AWS lambda function for converting incoming model-output data to parquet format (#52), there is some MVP code for the python-based data transformation function. It's currently in a branch of the hubverse-infrastructure repo.

Work needed for this issue

Move the hubverse-transforms Python module from hubverse-infrastructure to its own repository so that it can be deployed as a Hubverse AWS lambda function or used by non-hubverse-cloud hubs who want to perform transformations to their model-output data.

Definition of done:

The model-output transformation Python module is in a standalone repo in the Infectious-Disease-Modeling-Hubs GitHub organization
The module can be pip installed from the GitHub repo
The module is code checked (linter, types, tests) via CI
The new repository has branch protections that prevent updating main without a PR and require successful checks
The README describes how to set up the project locally

Create an AWS account for the Hubverse

Create a designated Hubverse AWS account.

Determine a data organization strategy for S3

In current hubverse repos, the model-output data is organized like:

model-output/
├─ team1/
│  ├─ yyyy-mm-dd-team1.csv
│  ├─ yyyy-mm-dd-team1.csv
├─ team2/
│  ├─ yyyy-mm-dd-team2.csv
│  ├─ yyyy-mm-dd-team2.csv

The cloud work done to date mirrors this structure in AWS S3 buckets.

Our current hypothesis is that this file organization isn't aligned with how most people what to query the data. For example, when hubUtils connects to an S3 bucket and uses Arrow to present information, the performance of that operation depends on how well the query maps to how the data is physically stored.

Thus, the decision for how we want to organize the data once it's in the cloud is important.

Switch sync utility used in hubverse-aws-upload workflow

As determined in #34, the native AWS s3 sync command is not a good fit for GitHub actions, since it relies on a file's timestamp to determine whether or not it has changed since the last sync operation (when the GitHub action checks out code on the runner's virtual machine, all files get the current timestamp, so s3 sync considers them all updated).

Others in our situation have recommended rclone as an alternative. A quick local test is promising:

rclone respects the usual AWS authentication methods (i.e., we can still use our "short-term token" approach)
rclone has a --checksum option that uses a hash function to determine file changes (instead of relying on a file modified dates)

I recreated the problem by checking out the hubverse-cloud repo to another local directory and running rsync as follows. Despite the more recent timestamps on the newly-cloned files, rclone didn't sync them.

rclone sync model-output/ s3-test:hubverse-cloud/raw/model-output/ --checksum --verbose
2024/03/07 10:46:54 INFO  : There was nothing to transfer
2024/03/07 10:46:54 INFO  :
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:                 7 / 7, 100%
Elapsed time:         0.2s

Get IaC production-ready: add test suite to Pulumi code

Add a test suite to the Pulumi code in the hubverse-infrastructure repo so that we can make updates to it with a higher degree of confidence.

Definition of done:

hubverse-infrastructure contains unit tests for the modules that create our S3 and IAM AWS resources
hubverse-infrastructure has a GitHub workflow that runs the test suite
PRs cannot be merged until the tests complete successfully

Some background info on Pulumi unit testing: https://github.com/Infectious-Disease-Modeling-Hubs/hubverse-infrastructure/tree/bsweger/add-model-output-transforms/hubverse-transforms

Create an initial proof of concept for syncing hub data to AWS S3

Demonstrate that we can use GItHub actions to push data to an S3 bucket.

Mirror the data structure used in the hub repo
Grant GitHub actions the required AWS permissions without using long-term secrets or adding secrets to the repo

Configure an GIthub AWS open identity provider in the Hubverse AWS account

The proof-of-concept experiments with S3 used an AWS open identity provider to grant permissions to GitHub actions. That worked well, so let's replicate that approach in the new Hubverse AWS account.

The additional requirement here is using an "infrastructure as code" method to create this resource. Infrastructure as code is non-trivial, but laying this foundation is critical to the security of our AWS account and to managing the moving parts as new hubs onboard to the cloud.

Get IaC production-ready: add branch protections

Add branch protections to the hubverse-infrastructure repo to prevent inadvertent updates to the main branch and to ensure that updates to hubverse infrastructure go through a review process.

Definition of done:

Feature branch pushes from org members only
PRs from org members only
No pushes directly to main branch
PRs cannot be merged without at least one reviewer
PRs cannot be merged until tests and lints have passed

test item - delete me

Automatically copy incoming S3 model-output data to our chosen organization strategy

The current GitHub action we're using simply copies model-output data from it's GitHub file structure into S3. Once we decide the best organization/partitioning strategy to use for cloud-based model outputs (#11), add a step after the initial S3 sync to copy the data into the structure that will be accessed by our query tooling (e.g. hubUtils)

Get IaC production-ready: documentation

If we decide to proceed with Pulumi as a framework for managing hubverse cloud infrastructure, we need to ensure that the process is well-documented. We're assuming that the target audience(s) are people who develop Hubverse tooling (as opposed to Hubverse admins and users).

Definition of done:

Documentation exists in the repo (e.g., README or similar)
Documentation has high-level instructions for the use case of onboarding a hub to Hubverse-owned AWS resources (basically, update the YAML config)
Documentation has more detailed instructions for those who need to update the Pulumi code itself (for example, to fix a bug or add a new AWS resource to support additional features)

Test the hubverse-aws-upload workflow against a large volume of data

Our GitHub action workflow to send data to S3 has been working well in our tests of small data volumes. However, it would be useful to get some rough estimates on timing with large volumes of data (one reason: @lmullany is working to convert some archived hubs to hubverse format, and it would be great to make that data available on S3)

Create proof-of-concept for using S3 triggers for automated conversion of model-output files

There was some conversation here that resulting in the conclusion that we should not rely on GitHub CI actions for triggering the conversion of incoming model-output files to parquet format.

As a next step, I'd like to explore using S3 event notifications as way to invoke actions when a model-output file is written to a hub's S3 bucket.

Specifically, these notifications:

New object created events
Object removal events
Object ACL PUT events?

At a high level, the idea is to invoke our prototype "transform model-output file to parquet" function automatically, whenever a model-output file is uploaded to S3 (this happens via GitHub action).

model submission PR merged -> model-output data syncs to S3 -> S3 "new object created" event triggers an AWS lambda version of the "convert data to S3 function"

Definition of done:

data conversion function automatically runs when model-output files are sent to s3://hubverse-cloud/raw/model-output
converted parquet files are written to s3://hubverse-cloud/model-output
converted parquet files are accessible via hubData
uploading a new version of a model-output file (i.e., with same name) to s3://hubverse-cloud/raw/model-output triggers the same data conversion process as uploading a new file

The AWS resources for this will be created manually (i.e., no need to incorporate into our infrastructure as code process unless we decide this solution will work for us).

Decide on a data format for hubverse cloud storage

In addition to deciding how we want to organize the data for hubs in the cloud, we also want to determine a default data format.

I'm not sure if we'd want to provide this as admin-configurable option in the future, but it would be great to have an opinion re: a sensible default to get started.

Create a test function to transform model-output data

Following the "demos not memos" principle, before making a decision about what will trigger model-output data conversions for cloud usage (#20), it might be helpful to see some preliminary code for these data conversions.

Create a prototype class.function in hubverse-cloud that:

accepts filename and S3 location as input
read file into an in-memory tabular data structure (e.g., R dataframe, pandas dataframe, arrow)
parse filename into team, model, round (see logic in hubValidations)
add 3 columns to the tabular data structure: round id, team name, model name
writes the modified data (with the new columns) to the specified S3 location in parquet format

Investigate the actual behavior of S3 sync

According to the docs, the AWS s3 sync command:

Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.

The new and updated files bit sounds like what we want when pushing data to S3 when a hub PR is merged.

However, the GitHub workflow doing the sync has some confusing outputs when it runs against data that hasn't been modified by a PR. For example, some output from this run:

Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 7 file(s) remaining
upload: model-output/hub-baseline/2022-10-15-hub-baseline.parquet to s3://hubverse-cloud/model-output/hub-baseline/2022-10-15-hub-baseline.parquet
Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 6 file(s) remaining
Completed 4.6 KiB/53.5 KiB (36.7 KiB/s) with 6 file(s) remaining

Before onboarding large hubs, we should confirm that the sync command is, in fact, only operating on the deltas.

[ORG NAME CHANGE]: Update repo to hubverse-org organisation name

❗ Please do not merge anything into main until the 19th June

Create new branch called to-hubverse from latest dev branch (most likely related to the v3 schema PRs). (Make sure to pull first). If none exist, branch off from main.
Find and replace Infectious-Disease-Modeling-Hubs with hubverse-org throughout repo.

Replace origin remote to point to the new organisation

git remote set-url origin https://github.com/hubverse-org/<REPO_NAME>.git
# Check remotes
git remote -v

url <- paste0('https://github.com/hubverse-org/', basename(getwd()), '.git')
usethis::use_git_remote(name = "origin", url, overwrite = TRUE)
# Check remotes
usethis::git_remotes()

Push and open PR to original branch.

Schedule a demo of Hubverse cloud infrastructure

The last several weeks have seen good progress re: AWS infrastructure for hubs:

We're able to sync hub data from Github to S3 via GitHub action when a PR is merged
We have an "infrastructure as code" prototype to programmatically provision the AWS resources required for a hub to do the above

I'd like to pause and:

Share the reasoning behind the work done so far (e.g., no AWS secrets stored in GitHub repos, the need for Infrastructure as Code)
Get feedback on the "infrastructure as code" prototype. Are we striking the right balance between reducing barriers to entry for new hubs and additional complexity incurred by the Hubverse team?

How will we automate the conversion of hub data to parquet after syncing to S3?

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

Get IaC production-ready: remove GitHub secret for Pulumi AWS access

Now that we've committed to Pulumi as our IaC tooling, ~~we should remove the AWS credentials being stored as GitHub secrets in the hubverse-infrastructure repo.~~

Edit 2024-04-16: the token we're storing as a GitHub secret isn't actually an AWS token, it's a Pulumi token. That said, we do still need to separate the permissions required to preview infra changes (which we'll need on any branch in the repo) from the permissions required to execute infra changes (which should only be assumed by the repo's main branch).

Definition of done:

There is a new AWS IAM role that has the permissions required to create, update, and destroy AWS infrastructure as required in our Pulumi code
The new IAM role can be assumed by GitHub actions originating from the main branch of the hubverse-infrastructure repo
There is a second new "read only" AWS IAM role that has the permissions necessary to run the Pulumi preview command
The "read only" role can be assumed by GitHub actions originating from the feature branches of the hubverse-infrastructure repo (so the preview report can be run when opening a PR)

Get IaC production-ready: add linting and type checking

Add linting, type-checking, and formatting checks to the hubverse-infrastructure repo to ensure that existing and new code meets Hubverse quality standards.

Definition of done:

Python code aligns to Black/ruff formatting standards
Python code is linted and adheres to the Hubverse's R-based linting decisions where applicable (e.g., 120 characters per line)
Python code passes type checks
The above checks must pass before a PR is merged

Pilot an Infrastructure as Code tool for onboarding hubs to the cloud

There are several options for adopting an infrastructure as code (IaC) approach to managing our cloud resources. To start this journey, we need to pick one and create an initial repo that hooks it up to the Hubverse's AWS account.

The goal of this issue is not to create a specific resource, but rather to establish a baseline for adding resources as needed.

hubverse-org / hubverse-cloud Goto Github PK

hubverse-cloud's Issues

❗ Please do not merge anything into main until the 19th June

Recommend Projects

Recommend Topics

Recommend Org