Giter Club home page Giter Club logo

hubverse-cloud's Issues

Create AWS alert for unusual activity

Probably doesn't need to be a high priority as we get started, but because we're hosting publicly-accessible S3 buckets, we should have AWS alert us to any spikes/high-volumes of data access activity (since we'll be paying for it).

Move model-output transform function to its own repository

Background

As part of exploring the possibility of AWS event notifications -> AWS lambda function for converting incoming model-output data to parquet format (#52), there is some MVP code for the python-based data transformation function. It's currently in a branch of the hubverse-infrastructure repo.

Work needed for this issue

Move the hubverse-transforms Python module from hubverse-infrastructure to its own repository so that it can be deployed as a Hubverse AWS lambda function or used by non-hubverse-cloud hubs who want to perform transformations to their model-output data.

Definition of done:

  • The model-output transformation Python module is in a standalone repo in the Infectious-Disease-Modeling-Hubs GitHub organization
  • The module can be pip installed from the GitHub repo
  • The module is code checked (linter, types, tests) via CI
  • The new repository has branch protections that prevent updating main without a PR and require successful checks
  • The README describes how to set up the project locally

Determine a data organization strategy for S3

In current hubverse repos, the model-output data is organized like:

model-output/
├─ team1/
│  ├─ yyyy-mm-dd-team1.csv
│  ├─ yyyy-mm-dd-team1.csv
├─ team2/
│  ├─ yyyy-mm-dd-team2.csv
│  ├─ yyyy-mm-dd-team2.csv

The cloud work done to date mirrors this structure in AWS S3 buckets.

Our current hypothesis is that this file organization isn't aligned with how most people what to query the data. For example, when hubUtils connects to an S3 bucket and uses Arrow to present information, the performance of that operation depends on how well the query maps to how the data is physically stored.

Thus, the decision for how we want to organize the data once it's in the cloud is important.

Switch sync utility used in hubverse-aws-upload workflow

As determined in #34, the native AWS s3 sync command is not a good fit for GitHub actions, since it relies on a file's timestamp to determine whether or not it has changed since the last sync operation (when the GitHub action checks out code on the runner's virtual machine, all files get the current timestamp, so s3 sync considers them all updated).

Others in our situation have recommended rclone as an alternative. A quick local test is promising:

  • rclone respects the usual AWS authentication methods (i.e., we can still use our "short-term token" approach)
  • rclone has a --checksum option that uses a hash function to determine file changes (instead of relying on a file modified dates)

I recreated the problem by checking out the hubverse-cloud repo to another local directory and running rsync as follows. Despite the more recent timestamps on the newly-cloned files, rclone didn't sync them.

rclone sync model-output/ s3-test:hubverse-cloud/raw/model-output/ --checksum --verbose
2024/03/07 10:46:54 INFO  : There was nothing to transfer
2024/03/07 10:46:54 INFO  :
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:                 7 / 7, 100%
Elapsed time:         0.2s

Get IaC production-ready: add test suite to Pulumi code

Add a test suite to the Pulumi code in the hubverse-infrastructure repo so that we can make updates to it with a higher degree of confidence.

Definition of done:

  • hubverse-infrastructure contains unit tests for the modules that create our S3 and IAM AWS resources
  • hubverse-infrastructure has a GitHub workflow that runs the test suite
  • PRs cannot be merged until the tests complete successfully

Some background info on Pulumi unit testing: https://github.com/Infectious-Disease-Modeling-Hubs/hubverse-infrastructure/tree/bsweger/add-model-output-transforms/hubverse-transforms

Configure an GIthub AWS open identity provider in the Hubverse AWS account

The proof-of-concept experiments with S3 used an AWS open identity provider to grant permissions to GitHub actions. That worked well, so let's replicate that approach in the new Hubverse AWS account.

The additional requirement here is using an "infrastructure as code" method to create this resource. Infrastructure as code is non-trivial, but laying this foundation is critical to the security of our AWS account and to managing the moving parts as new hubs onboard to the cloud.

Get IaC production-ready: add branch protections

Add branch protections to the hubverse-infrastructure repo to prevent inadvertent updates to the main branch and to ensure that updates to hubverse infrastructure go through a review process.

Definition of done:

  • Feature branch pushes from org members only
  • PRs from org members only
  • No pushes directly to main branch
  • PRs cannot be merged without at least one reviewer
  • PRs cannot be merged until tests and lints have passed

Get IaC production-ready: documentation

If we decide to proceed with Pulumi as a framework for managing hubverse cloud infrastructure, we need to ensure that the process is well-documented. We're assuming that the target audience(s) are people who develop Hubverse tooling (as opposed to Hubverse admins and users).

Definition of done:

  • Documentation exists in the repo (e.g., README or similar)
  • Documentation has high-level instructions for the use case of onboarding a hub to Hubverse-owned AWS resources (basically, update the YAML config)
  • Documentation has more detailed instructions for those who need to update the Pulumi code itself (for example, to fix a bug or add a new AWS resource to support additional features)

Test the hubverse-aws-upload workflow against a large volume of data

Our GitHub action workflow to send data to S3 has been working well in our tests of small data volumes. However, it would be useful to get some rough estimates on timing with large volumes of data (one reason: @lmullany is working to convert some archived hubs to hubverse format, and it would be great to make that data available on S3)

Create proof-of-concept for using S3 triggers for automated conversion of model-output files

There was some conversation here that resulting in the conclusion that we should not rely on GitHub CI actions for triggering the conversion of incoming model-output files to parquet format.

As a next step, I'd like to explore using S3 event notifications as way to invoke actions when a model-output file is written to a hub's S3 bucket.

Specifically, these notifications:

  • New object created events
  • Object removal events
  • Object ACL PUT events?

At a high level, the idea is to invoke our prototype "transform model-output file to parquet" function automatically, whenever a model-output file is uploaded to S3 (this happens via GitHub action).

model submission PR merged -> model-output data syncs to S3 -> S3 "new object created" event triggers an AWS lambda version of the "convert data to S3 function"

Definition of done:

  • data conversion function automatically runs when model-output files are sent to s3://hubverse-cloud/raw/model-output
  • converted parquet files are written to s3://hubverse-cloud/model-output
  • converted parquet files are accessible via hubData
  • uploading a new version of a model-output file (i.e., with same name) to s3://hubverse-cloud/raw/model-output triggers the same data conversion process as uploading a new file

The AWS resources for this will be created manually (i.e., no need to incorporate into our infrastructure as code process unless we decide this solution will work for us).

Decide on a data format for hubverse cloud storage

In addition to deciding how we want to organize the data for hubs in the cloud, we also want to determine a default data format.

I'm not sure if we'd want to provide this as admin-configurable option in the future, but it would be great to have an opinion re: a sensible default to get started.

Create a test function to transform model-output data

Following the "demos not memos" principle, before making a decision about what will trigger model-output data conversions for cloud usage (#20), it might be helpful to see some preliminary code for these data conversions.

Create a prototype class.function in hubverse-cloud that:

  • accepts filename and S3 location as input
  • read file into an in-memory tabular data structure (e.g., R dataframe, pandas dataframe, arrow)
  • parse filename into team, model, round (see logic in hubValidations)
  • add 3 columns to the tabular data structure: round id, team name, model name
  • writes the modified data (with the new columns) to the specified S3 location in parquet format

Investigate the actual behavior of S3 sync

According to the docs, the AWS s3 sync command:

Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.

The new and updated files bit sounds like what we want when pushing data to S3 when a hub PR is merged.

However, the GitHub workflow doing the sync has some confusing outputs when it runs against data that hasn't been modified by a PR. For example, some output from this run:

Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 7 file(s) remaining
upload: model-output/hub-baseline/2022-10-15-hub-baseline.parquet to s3://hubverse-cloud/model-output/hub-baseline/2022-10-15-hub-baseline.parquet
Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 6 file(s) remaining
Completed 4.6 KiB/53.5 KiB (36.7 KiB/s) with 6 file(s) remaining

Before onboarding large hubs, we should confirm that the sync command is, in fact, only operating on the deltas.

[ORG NAME CHANGE]: Update repo to hubverse-org organisation name

❗ Please do not merge anything into main until the 19th June

  • Create new branch called to-hubverse from latest dev branch (most likely related to the v3 schema PRs). (Make sure to pull first). If none exist, branch off from main.
  • Find and replace Infectious-Disease-Modeling-Hubs with hubverse-org throughout repo.
  • Replace origin remote to point to the new organisation
    git remote set-url origin https://github.com/hubverse-org/<REPO_NAME>.git
    # Check remotes
    git remote -v
    url <- paste0('https://github.com/hubverse-org/', basename(getwd()), '.git')
    usethis::use_git_remote(name = "origin", url, overwrite = TRUE)
    # Check remotes
    usethis::git_remotes()
  • Push and open PR to original branch.

Schedule a demo of Hubverse cloud infrastructure

The last several weeks have seen good progress re: AWS infrastructure for hubs:

  1. We're able to sync hub data from Github to S3 via GitHub action when a PR is merged
  2. We have an "infrastructure as code" prototype to programmatically provision the AWS resources required for a hub to do the above

I'd like to pause and:

  • Share the reasoning behind the work done so far (e.g., no AWS secrets stored in GitHub repos, the need for Infrastructure as Code)
  • Get feedback on the "infrastructure as code" prototype. Are we striking the right balance between reducing barriers to entry for new hubs and additional complexity incurred by the Hubverse team?

How will we automate the conversion of hub data to parquet after syncing to S3?

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

Get IaC production-ready: remove GitHub secret for Pulumi AWS access

Now that we've committed to Pulumi as our IaC tooling, we should remove the AWS credentials being stored as GitHub secrets in the hubverse-infrastructure repo.

Edit 2024-04-16: the token we're storing as a GitHub secret isn't actually an AWS token, it's a Pulumi token. That said, we do still need to separate the permissions required to preview infra changes (which we'll need on any branch in the repo) from the permissions required to execute infra changes (which should only be assumed by the repo's main branch).

Definition of done:

  • There is a new AWS IAM role that has the permissions required to create, update, and destroy AWS infrastructure as required in our Pulumi code
  • The new IAM role can be assumed by GitHub actions originating from the main branch of the hubverse-infrastructure repo
  • There is a second new "read only" AWS IAM role that has the permissions necessary to run the Pulumi preview command
  • The "read only" role can be assumed by GitHub actions originating from the feature branches of the hubverse-infrastructure repo (so the preview report can be run when opening a PR)

Get IaC production-ready: add linting and type checking

Add linting, type-checking, and formatting checks to the hubverse-infrastructure repo to ensure that existing and new code meets Hubverse quality standards.

Definition of done:

  • Python code aligns to Black/ruff formatting standards
  • Python code is linted and adheres to the Hubverse's R-based linting decisions where applicable (e.g., 120 characters per line)
  • Python code passes type checks
  • The above checks must pass before a PR is merged

Pilot an Infrastructure as Code tool for onboarding hubs to the cloud

There are several options for adopting an infrastructure as code (IaC) approach to managing our cloud resources. To start this journey, we need to pick one and create an initial repo that hooks it up to the Hubverse's AWS account.

The goal of this issue is not to create a specific resource, but rather to establish a baseline for adding resources as needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.