hubverse-org / hubverse-cloud Goto Github PK
View Code? Open in Web Editor NEWTest hub for S3 data submission and storage
License: MIT License
Test hub for S3 data submission and storage
License: MIT License
Probably doesn't need to be a high priority as we get started, but because we're hosting publicly-accessible S3 buckets, we should have AWS alert us to any spikes/high-volumes of data access activity (since we'll be paying for it).
Background
As part of exploring the possibility of AWS event notifications -> AWS lambda function for converting incoming model-output data to parquet format (#52), there is some MVP code for the python-based data transformation function. It's currently in a branch of the hubverse-infrastructure repo.
Work needed for this issue
Move the hubverse-transforms Python module from hubverse-infrastructure
to its own repository so that it can be deployed as a Hubverse AWS lambda function or used by non-hubverse-cloud hubs who want to perform transformations to their model-output data.
Definition of done:
Infectious-Disease-Modeling-Hubs
GitHub organizationmain
without a PR and require successful checksCreate a designated Hubverse AWS account.
In current hubverse repos, the model-output data is organized like:
model-output/
├─ team1/
│ ├─ yyyy-mm-dd-team1.csv
│ ├─ yyyy-mm-dd-team1.csv
├─ team2/
│ ├─ yyyy-mm-dd-team2.csv
│ ├─ yyyy-mm-dd-team2.csv
The cloud work done to date mirrors this structure in AWS S3 buckets.
Our current hypothesis is that this file organization isn't aligned with how most people what to query the data. For example, when hubUtils
connects to an S3 bucket and uses Arrow to present information, the performance of that operation depends on how well the query maps to how the data is physically stored.
Thus, the decision for how we want to organize the data once it's in the cloud is important.
As determined in #34, the native AWS s3 sync
command is not a good fit for GitHub actions, since it relies on a file's timestamp to determine whether or not it has changed since the last sync operation (when the GitHub action checks out code on the runner's virtual machine, all files get the current timestamp, so s3 sync
considers them all updated).
Others in our situation have recommended rclone
as an alternative. A quick local test is promising:
--checksum
option that uses a hash function to determine file changes (instead of relying on a file modified dates)I recreated the problem by checking out the hubverse-cloud
repo to another local directory and running rsync as follows. Despite the more recent timestamps on the newly-cloned files, rclone didn't sync them.
rclone sync model-output/ s3-test:hubverse-cloud/raw/model-output/ --checksum --verbose
2024/03/07 10:46:54 INFO : There was nothing to transfer
2024/03/07 10:46:54 INFO :
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 7 / 7, 100%
Elapsed time: 0.2s
Add a test suite to the Pulumi code in the hubverse-infrastructure
repo so that we can make updates to it with a higher degree of confidence.
Definition of done:
hubverse-infrastructure
contains unit tests for the modules that create our S3 and IAM AWS resourceshubverse-infrastructure
has a GitHub workflow that runs the test suiteSome background info on Pulumi unit testing: https://github.com/Infectious-Disease-Modeling-Hubs/hubverse-infrastructure/tree/bsweger/add-model-output-transforms/hubverse-transforms
Demonstrate that we can use GItHub actions to push data to an S3 bucket.
The proof-of-concept experiments with S3 used an AWS open identity provider to grant permissions to GitHub actions. That worked well, so let's replicate that approach in the new Hubverse AWS account.
The additional requirement here is using an "infrastructure as code" method to create this resource. Infrastructure as code is non-trivial, but laying this foundation is critical to the security of our AWS account and to managing the moving parts as new hubs onboard to the cloud.
Add branch protections to the hubverse-infrastructure repo to prevent inadvertent updates to the main branch and to ensure that updates to hubverse infrastructure go through a review process.
Definition of done:
main
branchThe current GitHub action we're using simply copies model-output data from it's GitHub file structure into S3. Once we decide the best organization/partitioning strategy to use for cloud-based model outputs (#11), add a step after the initial S3 sync to copy the data into the structure that will be accessed by our query tooling (e.g. hubUtils
)
If we decide to proceed with Pulumi as a framework for managing hubverse cloud infrastructure, we need to ensure that the process is well-documented. We're assuming that the target audience(s) are people who develop Hubverse tooling (as opposed to Hubverse admins and users).
Definition of done:
Our GitHub action workflow to send data to S3 has been working well in our tests of small data volumes. However, it would be useful to get some rough estimates on timing with large volumes of data (one reason: @lmullany is working to convert some archived hubs to hubverse format, and it would be great to make that data available on S3)
There was some conversation here that resulting in the conclusion that we should not rely on GitHub CI actions for triggering the conversion of incoming model-output files to parquet format.
As a next step, I'd like to explore using S3 event notifications as way to invoke actions when a model-output file is written to a hub's S3 bucket.
Specifically, these notifications:
At a high level, the idea is to invoke our prototype "transform model-output file to parquet" function automatically, whenever a model-output file is uploaded to S3 (this happens via GitHub action).
model submission PR merged -> model-output data syncs to S3 -> S3 "new object created" event triggers an AWS lambda version of the "convert data to S3 function"
Definition of done:
s3://hubverse-cloud/raw/model-output
s3://hubverse-cloud/model-output
hubData
s3://hubverse-cloud/raw/model-output
triggers the same data conversion process as uploading a new fileThe AWS resources for this will be created manually (i.e., no need to incorporate into our infrastructure as code process unless we decide this solution will work for us).
In addition to deciding how we want to organize the data for hubs in the cloud, we also want to determine a default data format.
I'm not sure if we'd want to provide this as admin-configurable option in the future, but it would be great to have an opinion re: a sensible default to get started.
Following the "demos not memos" principle, before making a decision about what will trigger model-output data conversions for cloud usage (#20), it might be helpful to see some preliminary code for these data conversions.
Create a prototype class.function in hubverse-cloud
that:
According to the docs, the AWS s3 sync
command:
Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination. Only creates folders in the destination if they contain one or more files.
The new and updated files
bit sounds like what we want when pushing data to S3 when a hub PR is merged.
However, the GitHub workflow doing the sync has some confusing outputs when it runs against data that hasn't been modified by a PR. For example, some output from this run:
Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 7 file(s) remaining
upload: model-output/hub-baseline/2022-10-15-hub-baseline.parquet to s3://hubverse-cloud/model-output/hub-baseline/2022-10-15-hub-baseline.parquet
Completed 3.4 KiB/53.5 KiB (66.7 KiB/s) with 6 file(s) remaining
Completed 4.6 KiB/53.5 KiB (36.7 KiB/s) with 6 file(s) remaining
Before onboarding large hubs, we should confirm that the sync
command is, in fact, only operating on the deltas.
to-hubverse
from latest dev branch (most likely related to the v3 schema PRs). (Make sure to pull first). If none exist, branch off from main
.Infectious-Disease-Modeling-Hubs
with hubverse-org
throughout repo.git remote set-url origin https://github.com/hubverse-org/<REPO_NAME>.git
# Check remotes
git remote -v
url <- paste0('https://github.com/hubverse-org/', basename(getwd()), '.git')
usethis::use_git_remote(name = "origin", url, overwrite = TRUE)
# Check remotes
usethis::git_remotes()
The last several weeks have seen good progress re: AWS infrastructure for hubs:
I'd like to pause and:
Per #18, we will default to parquet format for making hub data available in the cloud.
So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).
I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.
Now that we've committed to Pulumi as our IaC tooling, we should remove the AWS credentials being stored as GitHub secrets in the hubverse-infrastructure
repo.
Edit 2024-04-16: the token we're storing as a GitHub secret isn't actually an AWS token, it's a Pulumi token. That said, we do still need to separate the permissions required to preview infra changes (which we'll need on any branch in the repo) from the permissions required to execute infra changes (which should only be assumed by the repo's main branch).
Definition of done:
hubverse-infrastructure
repohubverse-infrastructure
repo (so the preview report can be run when opening a PR)Add linting, type-checking, and formatting checks to the hubverse-infrastructure
repo to ensure that existing and new code meets Hubverse quality standards.
Definition of done:
There are several options for adopting an infrastructure as code (IaC) approach to managing our cloud resources. To start this journey, we need to pick one and create an initial repo that hooks it up to the Hubverse's AWS account.
The goal of this issue is not to create a specific resource, but rather to establish a baseline for adding resources as needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.