Giter Club home page Giter Club logo

data-upload-manager's People

Contributors

collinss-jpl avatar dependabot[bot] avatar jordanpadams avatar nutjob4life avatar pdsen-ci avatar tloubrieu-jpl avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-upload-manager's Issues

Develop Initial Proof-of-Concept

๐Ÿ’ก Description

Per discussions with team, looking at producing a POC for a few different architectures.

Some options:

  • Data provider stages data in their own S3 bucket, publish CNM to DAAC SNS topic
  • Data provider stages data in DAAC S3 bucket, publishes CNM to DAAC SNS topic
  • Data provider stages data in DAAC S3 bucket, DAAC ingest themselves using some crawler tool one time (for one-off collections)
  • Data provider stages data in DAAC S3 bucket, DAAC continuously ingest data as it comes in (new mechanism that is not yet completed)

Per discussion with the team, going to pursue an approach similar to what @collinss-jpl has proposed:

The approach I had been thinking about utilizes AWS API gateway connected to a Lambda (similar to TEA) to allow a client application on the SBN host to request an upload/sync of a local file or files. The Lambda uses information from the request to determine where in our S3 bucket hierarchy the requested files should get uploaded to (based on product type, PDS submitter node, or whatever other criteria we derive). The S3 URI(s) are then returned back through the API gateway to the client. The SBN client application then uses the returned URI(s) to perform the sync using the CLI or boto library. Eventually we could work in a job queue on the SBN client app so the uploads can be performed asynchronously from the upload requests. We would also be able to use the built-in throttling capability on API Gateway to control how much data or how many requests weโ€™ll allow within a window of time etcโ€ฆ

As Nucleus, I want to use a lock file to know when DUM is writing to a S3 bucket folder

Checked for duplicates

No - I haven't checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

No response

๐Ÿ’ช Motivation

...so that I can know when a directory has completed reading, and we can fully evaluate all the products (XML + data files) in the directory + sub-directories.

๐Ÿ“– Additional Details

  • Crawl file system
  • Each directory you come across, write a dum.lock file with TBD information in it
  • Continue to crawl and write data, as you complete a directory, and all it's sub-directories, remove the dum.lock file

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

No response

Develop IaC for Deployment

๐Ÿ’ก Description

Develop the necessary documentation, terraform scripts and/or other definitions/scripts needed to deploy the app on a user system and deploying to the cloud.

Implement automatic refresh of Cognito authentication token

๐Ÿ’ก Description

The authentication token returned from Cognito has a default expiration of 1 hour, which is typically shorter than what is expected for large file transfers. Cognito authentication tokens can be refreshed by providing the "refresh" token that is supplied after initial authentication.

The DUM client script needs to be updated to support an automatic refresh of the authentication token based on when the token is expected to expire. This should allow long running transfers to complete without interruption.

โš”๏ธ Parent Epic / Related Tickets

No response

As a user, I want to force an upload of file that is already in S3 or the Registry

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Node Operator

๐Ÿ’ช Motivation

...so that I can overwrite data that has already been loaded into the PDC

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given a file that has already been loaded into the registry
When I perform a DUM upload with the --overwrite flag enabled
Then I expect the data to overwrite the existing file in the system, and DUM to note that this occurred in the logs

โš™๏ธ Engineering Details

No response

Develop Ingress Client Logging Capabilities

๐Ÿ’ก Description

Ticket to develop the logging capabilities of the pds-ingress-client.py script. Added capabilities should include:

  • Addition of a logging or log utility module to control initialization of the global logger object
  • Implement features to allow control of the logging level/format from the command line
  • Addition of an API Gateway endpoint to submit client logs to a CloudWatch log group
  • Implement submission of all logged information during client execution to the API Gateway endpoint

Motivation

To help to support the discipline nodes.

Add argument to client script to follow symlinks

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

The current behavior of the pds-ingress-client.py script is to ignore an symbolic links encountered when traversing paths to be uploaded. Going forward, it could be useful to add a command-line option to instruct the client to follow encountered symlinks, rather than ignore them by default.

๐Ÿ’ช Motivation

Would allow pds-ingress-client.py to be used with datasets that are compiled from pre-existing data via symlink to avoid data duplication.

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

No response

verify the node of the user against Cognito

๐Ÿ’ก Description

  1. we need groups for each node
  2. The cognito user will be assigned node groups (at least one).
  3. The client will forward the access token to the API gateway
  4. The lambda authorizer will decode the groups, and check that it matches the node as specified in the header of the request.

Motivation

So that the PDS users have a single login and password for all the PDS services.

Develop Ingress Lambda Logging Conventions

๐Ÿ’ก Description

The Ingress Lambda function can log messages directly to AWS CloudWatch via the built-in logging library. This will likely be the primary mechanism for tracking incoming requests, so we need to define exactly what we would like to see logged for each request.

As a user, I want to parallelize upload of data products to PDC

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

No response

๐Ÿ’ช Motivation

...so that I can [why do you want to do this?]

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

No response

Upgrade SBN to latest and Rename Bucket Folder

๐Ÿ’ก Description

  • Tag new DUM within logging fixed
  • Deploy to production
  • Request SBN to upgrade
  • Rename root folder in S3 bucket from SBN to sbn (created sbn folder, move gbo... to sbn folder.

โš”๏ธ Parent Epic / Related Tickets

No response

Upload test data set with manual trigger of Nucleus

๐Ÿ’ก Description

  • Needs new IAM roles for DUM. To get help from @sjoshi-jpl @viviant100
  • Test out deployment to MCP
  • Test uploading data from internal pdsmcp-dev EC2 that can reach private API Gateway
    • Ask SAs to help setup EC2 and give access to specific EN operator user group
  • Test uploading data from on-prem EC2 instance to public API Gateway
    • Ask SAs to setup IP whitelist
  • Move onto deploying and running with SBN: #32

Update lambda function to lowercase the node prefix in buckets

๐Ÿ’ก Description

Right now data is being pushed to buckets like for SBN, /SBN/my/data/here. the /SBN seems a bit redundant, but we can leave it for now. However, definitely want this lowercase.

โš”๏ธ Parent Epic / Related Tickets

No response

Develop Cost Model

๐Ÿ’ก Description

Cost model for data upload manager components. Should go hand-in-hand with Design Doc but this will be managed and maintained in a secure location. This Epic will also include consideration of deployment strategies, as needed.

Develop initial design doc

๐Ÿ’ก Description

After initial rapid prototyping has completed, develop a design and architecture diagram/document.

Ideally this document would be posted as part of the online documentation for this repository.

As a user, I want to include a MD5 checksum in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

No - I haven't checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Archivist

๐Ÿ’ช Motivation

...so that I can include a checksum with the files being uploaded to ensure data integrity as the files flow through the system

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given a file to be uploaded to S3
When I perform data upload manager execution on that file
Then I expect data upload manager to generate a checksum and add to the object metadata for the S3 object

โš™๏ธ Engineering Details

https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

As a user, I want to upload only data products that have not been previously ingested

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Node Operator

๐Ÿ’ช Motivation

...so that I can upload only data I do not already have in Planetary Data Cloud.

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

No response

Develop Ingress Client Interface

๐Ÿ’ก Description

The current command-line interface for the Ingress client script only allows a user to provide a single file path, as well as an (arbitrary) node ID. This interface needs to be developed to allow at a minimum:

  • Validation of the provided node ID against the standard set of PDS identifiers
  • Support for specifying multiple input paths
  • Support for distinguishing paths to files vs. paths to directories and performing the appropriate S3 sync logic

As a user, I want to be able to take no more than X seconds per product to upload to AWS

Checked for duplicates

No - I haven't checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

One potential issue was raised for awareness: currently it seems to take about 5-10 seconds per css product to upload. That's going to need to be improved somewhere in the chain, because at that rate it will take more than 24 hours to upload the number of products that are generated every 24 hours.

๐Ÿ’ช Motivation

...so that I can [why do you want to do this?]

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

Variables we need to take into account:

  • Bandwidth to AWS - is there a way we could improve upload throughput?
  • Size of each file / product
  • Number of files
  • Not currently generating checksums

Populate Sphinx documentation for entire DUM service

๐Ÿ’ก Description

Ticket to add Sphinx documentation for the entire DUM repository. Topics covered should include:

  • Installation instructions
  • Terraform deployment procedure
  • Cognito account creation
  • INI config format
  • Client script usage

Deploy v1.2.0 DUM to Production

๐Ÿ’ก Description

  • Tag new DUM with v1.2.0
    • new summary report
    • new logging group /pds/nucleus/dum/client-log-group
    • token refresh
  • Update DUM client to test new capabilities
  • Debug session with SA
  • Request SBN to upgrade

โš”๏ธ Parent Epic / Related Tickets

No response

Log upload to Cloudwatch fails during batch upload

Checked for duplicates

No - I haven't checked

๐Ÿ› Describe the bug

When testing upload of CSS sample data to DUM, the following warning is generated on every ingest:

`WARNING:root:Unable to submit to CloudWatch Logs, reason: 'LogRecord' object has no attribute 'message'

๐Ÿ•ต๏ธ Expected behavior

The LogRecords should still have a message field assigned, and all logs should be uploaded to CloudWatch without issue.

๐Ÿ“œ To Reproduce

  1. Configure an instance of DUM on an EC2 instance that can communicate with the (currently Private) API gateway
  2. Use the pds-ingress-client.py script to upload a file
  3. Verify the above warning is reproduced in the output log

๐Ÿ–ฅ Environment Info

  • Version of this software [e.g. vX.Y.Z]
  • Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
    ...

๐Ÿ“š Version of Software Used

No response

๐Ÿฉบ Test Data / Additional context

No response

๐Ÿฆ„ Related requirements

๐Ÿฆ„ #xyz

โš™๏ธ Engineering Details

No response

As a user, I want an end summary report in logs to show statistics of files uploaded

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Node Operator

๐Ÿ’ช Motivation

...so that I can have a summary of how things were uploaded

๐Ÿ“– Additional Details

  • files uploaded
  • files skipped
  • files overwritten

Acceptance Criteria

Given a set of files to be uploaded to S3
When I perform a nominal upload
Then I expect a final report to be output show metrics of files read, files successfully uploaded, files skipped, and files overwritten

โš™๏ธ Engineering Details

No response

As a user, I want to include the modification datetime in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Archivist

๐Ÿ’ช Motivation

...so that I can match the modification datetime from the source system where the data is being copied.

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

No response

Add External Config Support to Ingress Client Script

๐Ÿ’ก Description

The current Ingress client script contains a number of hardcoded constants related to AWS configuration (such as API Gateway ID and region) that should be refactored into an external .ini config (or similar) to allow easy customization without requiring code redeployment.

Develop Ingress Service Routing Logic

๐Ÿ’ก Description

The current prototype of the Data ingress lambda function contains some dummy logic for determining the product type from the provided file path/node ID. This ticket is to track the planning and implementation for the initial logic for determining the s3 path convention from the data payload provided by the client script.

Also within scope for this ticket is defining the input payload schema itself that will determine what is sent by the client.

As a result, a back-end service component is developed for the Data-Upload-Manager. It received the upload request from the client and returns a s3 path (or eventually a presigned S3 URL) where the data should be uploaded by the client.

As a user, I want to skip upload of files already in S3 (nucleus staging bucket)

Checked for duplicates

No - I haven't checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Node Operator

๐Ÿ’ช Motivation

...so that I can avoid duplicate copies for data.

๐Ÿ“– Additional Details

Current design is to overwrite the data when user upload data via DUM. Propose to add capability to verify if the file modification time and size remain unchanged, thereby allowing the copying to S3 to be skipped. Additionally, providing an optional flag (e.g., --force-overwrite) would enable users to overwrite the file when needed.

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

Note: Per #91, rclone handles this functionality for us.

The user should also have the ability to force overwrite data that is already out there.

As an admin, I want access to buckets to be restricted by subnet

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Cloud Admin / Operator

๐Ÿ’ช Motivation

...so that I can add another layer of security to access to S3 buckets

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given a bucket that I have write access policy to a bucket with data upload manager, and within a set IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

Given a bucket that I have write access policy to a bucket with data upload manager, and outside the expected IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

โš™๏ธ Engineering Details

No response

As a user, I want to skip upload of files that are already in the Registry

Checked for duplicates

Yes - I've already checked

๐Ÿง‘โ€๐Ÿ”ฌ User Persona(s)

Node Operator

๐Ÿ’ช Motivation

...so that I do not try to reload the data

๐Ÿ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

โš™๏ธ Engineering Details

The easiest way to do this would be search the registry either for the file path OR by checksum OR both? We could do this with the LID/LIDVID but I think that will add some significant overhead.

Do we want to figure out some sort of auto-generated UUID for every file we upload to the cloud and add this as metadata? Maybe this is something we could actually store then in the Nucleus database and eventually in the registry. It could link throughout the whole system, agnostic of the LIDVID for the products themselves.

Add support for presigned upload URL usage

๐Ÿ’ก Description

To further secure the data upload process, the ingress service Lambda needs to incorporate generation of pre-signed S3 URL's that the client can use to securely upload files to S3. This will allow all the PDS buckets in Nucleus to be private (so trying to guess an S3 upload URI should not work), while still providing a means for outside users to push to S3 without additional credentials or permissions.

DUM client is unable to create CloudWatch Log Stream pds-ingress-client-sbn-* when upload data to cloud

Checked for duplicates

No - I haven't checked

๐Ÿ› Describe the bug

When the css data was uploaded to cloud via DUM client, the following error occured:

WARNING:root:Unable to submit to CloudWatch Logs, reason: Failed to create CloudWatch Log Stream pds-ingress-client-sbn-1709312322, reason: 403 Client Error: Forbidden for url: https://yofdsuex7g.execute-api.us-west-2.amazonaws.com/prod/createstream

๐Ÿ•ต๏ธ Expected behavior

NO errors. And logs get pushed to the cloud

๐Ÿ“œ To Reproduce

Run DUM on any data and push data to the cloud.

๐Ÿ–ฅ Environment Info

Linux OS

๐Ÿ“š Version of Software Used

0.3.0

๐Ÿฉบ Test Data / Additional context

Any PDS4 data

๐Ÿฆ„ Related requirements

โš™๏ธ Engineering Details

As a workaround, let's plan to comment out the code that is causing this for the time being. Logging into AWS is a lower priority ("should") requirement.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.