Comments (5)
Recording some numbers from the recent test of getting a forked version of the CDC's FluSight to the cloud.
Forked repo: https://github.com/bsweger/FluSight-forecast-hub/tree/main
S3 bucket: bsweger-flusight-forecast
(these will disappear once we're done testing)
number of model-output files
- repo: 1129 (
find model-output -type f | wc -l
) - bsweger-flusight-forecast/raw/model-output: 1129 (
aws s3 ls s3://bsweger-flusight-forecast/raw/model-output/ --recursive --no-sign-request | wc -l
) - bsweger-flusight-forecast/model-output: 1128 (
aws s3 ls s3://bsweger-flusight-forecast/raw/model-output/ --recursive --no-sign-request | wc -l
)
Before diving into a more detailed kind of integrity check, will run down why we're missing a file in coverted model-output folder (and we need an issue to track getting alerts out to the team when the transform lambda fails).
from hubverse-cloud.
The "missing" file is actually a README.md that wasn't converted to parquet. Granted, we should decide how we want to handle non-supported file types, the end result--at least from a file count perspective--is as expected.
from hubverse-cloud.
This exercise resulted in 2 hubData issues we should resolve:
To run some integrity checks that compare a hub's GitHub-based model-output files and the transformed versions of those files, I worked-around the above issues by:
- Updated the test hub's
admin.json
config to add.parquet
as a valid file format - Manually removed the parquet file with the invalid date format from the hub's S3 bucket (
bsweger-flusight-forecast/model-output/FluSight-baseline_cat/2024-03-02-FluSight-baseline_cat.parquet
)
Below is the R script to run some integrity checks:
test_cloud_hub_data.txt
Console output from running the above:
Rscript test_cloud_hub_data.R
Warning message:
! The following potentially invalid model output file not opened successfully.
/Users/rsweger/code/FluSight-forecast-hub/model-output/FluSight-baseline_cat/2024-03-02-FluSight-baseline_cat.csv
SubTreeFileSystem: s3://bsweger-flusight-forecast/
[1] "Comparing local and cloud row counts"
[1] TRUE
[1] "Comparing local and cloud row counts by model_id"
[1] TRUE
[1] "Comparing local and cloud schemas"
[1] TRUE
from hubverse-cloud.
AWS handled the "bursty" lambda function invocations successfully, though there was some throttling due to what appears to be a currency limit of 10. The image below represents the default content on the "monitoring" tab of the AWS Lambda console (showing lambda activity between 2024-05-15 01:02:00 and 2024-05-15 01:13:00 UTC, which is when the incoming test model-output files emitted the S3 events that trigger the lambda function)
Am not an expert in these charts, but adding some additional info after image:
- Total error count of 3 makes sense: we know there was 1 failure due to an incoming .md file, and by default Lambda will retry 2 more times
- AsyncEventsDropped = the number of events dropped w/o successful execution (in our case, we dropped 1 event due to the above error)
- Concurrency: we're clearly capped at 10 concurrent lambda instances, which presumably explains the "throttle" graph. This is probably fine for handling normal hub operations, but might be something we need to explore for using the S3 event/lambda process for converting archived hubs to cloud-based Hubverse hubs
- Creating a separate "running sum" chart for invocations (# of times the lambda function was invoked) shows 1,131 total invocations
- once for each of 1,128 incoming model-output files
- 3 failed attempts to transform the errant .md file
The concurrency threshold of 10 for our lambda function may be because our AWS account is new: https://benellis.cloud/my-lambda-concurrency-applied-quota-is-only-10-but-why
from hubverse-cloud.
Gonna move this to done, now that we've onboarding the CDC's FluSight repo to the cloud. The archived FluSight data will have far more volume, but we can open new tickets if getting that onto the cloud surfaces additional isseus.
from hubverse-cloud.
Related Issues (20)
- Create an AWS account for the Hubverse
- Pilot an Infrastructure as Code tool for onboarding hubs to the cloud HOT 5
- Create an initial proof of concept for syncing hub data to AWS S3
- Create AWS alert for unusual activity HOT 3
- Decide on a data format for hubverse cloud storage HOT 3
- Schedule a demo of Hubverse cloud infrastructure HOT 1
- How will we automate the conversion of hub data to parquet after syncing to S3? HOT 5
- Investigate the actual behavior of S3 sync HOT 4
- Switch sync utility used in hubverse-aws-upload workflow HOT 1
- Create a test function to transform model-output data HOT 2
- Create proof-of-concept for using S3 triggers for automated conversion of model-output files HOT 3
- Get IaC production-ready: documentation HOT 2
- Get IaC production-ready: add branch protections HOT 1
- Get IaC production-ready: add linting and type checking
- Get IaC production-ready: remove GitHub secret for Pulumi AWS access
- Move model-output transform function to its own repository
- Get IaC production-ready: add test suite to Pulumi code
- test item - delete me
- [ORG NAME CHANGE]: Update repo to hubverse-org organisation name HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hubverse-cloud.