nasa-pds / data-upload-manager Goto Github PK

Data Upload Manager (DUM) component for managing the interface for data uploads to the Planetary Data Cloud from Data Providers and PDS Nodes.

Home Page: https://nasa-pds.github.io/data-upload-manager

License: Apache License 2.0

Python 68.56% JavaScript 3.79% HCL 27.65%

s3-storage upload

data-upload-manager's People

Contributors

Watchers

data-upload-manager's Issues

Develop Initial Proof-of-Concept

💡 Description

Per discussions with team, looking at producing a POC for a few different architectures.

Some options:

Data provider stages data in their own S3 bucket, publish CNM to DAAC SNS topic
Data provider stages data in DAAC S3 bucket, publishes CNM to DAAC SNS topic
Data provider stages data in DAAC S3 bucket, DAAC ingest themselves using some crawler tool one time (for one-off collections)
Data provider stages data in DAAC S3 bucket, DAAC continuously ingest data as it comes in (new mechanism that is not yet completed)

Per discussion with the team, going to pursue an approach similar to what @collinss-jpl has proposed:

The approach I had been thinking about utilizes AWS API gateway connected to a Lambda (similar to TEA) to allow a client application on the SBN host to request an upload/sync of a local file or files. The Lambda uses information from the request to determine where in our S3 bucket hierarchy the requested files should get uploaded to (based on product type, PDS submitter node, or whatever other criteria we derive). The S3 URI(s) are then returned back through the API gateway to the client. The SBN client application then uses the returned URI(s) to perform the sync using the CLI or boto library. Eventually we could work in a job queue on the SBN client app so the uploads can be performed asynchronously from the upload requests. We would also be able to use the built-in throttling capability on API Gateway to control how much data or how many requests we’ll allow within a window of time etc…

Test upload subset of CSS with manual trigger of Nucleus

💡 Description

Detailed work plan TBD

As Nucleus, I want to use a lock file to know when DUM is writing to a S3 bucket folder

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can know when a directory has completed reading, and we can fully evaluate all the products (XML + data files) in the directory + sub-directories.

📖 Additional Details

Crawl file system
Each directory you come across, write a dum.lock file with TBD information in it
Continue to crawl and write data, as you complete a directory, and all it's sub-directories, remove the dum.lock file

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Develop IaC for Deployment

💡 Description

Develop the necessary documentation, terraform scripts and/or other definitions/scripts needed to deploy the app on a user system and deploying to the cloud.

Implement automatic refresh of Cognito authentication token

💡 Description

The authentication token returned from Cognito has a default expiration of 1 hour, which is typically shorter than what is expected for large file transfers. Cognito authentication tokens can be refreshed by providing the "refresh" token that is supplied after initial authentication.

The DUM client script needs to be updated to support an automatic refresh of the authentication token based on when the token is expected to expire. This should allow long running transfers to complete without interruption.

⚔️ Parent Epic / Related Tickets

No response

As a user, I want to force an upload of file that is already in S3 or the Registry

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can overwrite data that has already been loaded into the PDC

📖 Additional Details

No response

Acceptance Criteria

Given a file that has already been loaded into the registry
When I perform a DUM upload with the --overwrite flag enabled
Then I expect the data to overwrite the existing file in the system, and DUM to note that this occurred in the logs

⚙️ Engineering Details

No response

Develop Ingress Client Logging Capabilities

💡 Description

Ticket to develop the logging capabilities of the pds-ingress-client.py script. Added capabilities should include:

Addition of a logging or log utility module to control initialization of the global logger object
Implement features to allow control of the logging level/format from the command line
Addition of an API Gateway endpoint to submit client logs to a CloudWatch log group
Implement submission of all logged information during client execution to the API Gateway endpoint

Motivation

To help to support the discipline nodes.

Improve DUM Upload Performance and Avoid Replication of Files in the Cloud

💡 Description

Currently, execution is taking too long to upload data. We need to figure out ways we can improve throughput. TBD on this task. This may be dependent on system CPUs available.

⚔️ Sub-tasks

Add argument to client script to follow symlinks

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

The current behavior of the pds-ingress-client.py script is to ignore an symbolic links encountered when traversing paths to be uploaded. Going forward, it could be useful to add a command-line option to instruct the client to follow encountered symlinks, rather than ignore them by default.

💪 Motivation

Would allow pds-ingress-client.py to be used with datasets that are compiled from pre-existing data via symlink to avoid data duplication.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

verify the node of the user against Cognito

💡 Description

we need groups for each node
The cognito user will be assigned node groups (at least one).
The client will forward the access token to the API gateway
The lambda authorizer will decode the groups, and check that it matches the node as specified in the header of the request.

Motivation

So that the PDS users have a single login and password for all the PDS services.

Add User-Defined Object Metadata

💡 Description

Develop Ingress Lambda Logging Conventions

💡 Description

The Ingress Lambda function can log messages directly to AWS CloudWatch via the built-in logging library. This will likely be the primary mechanism for tracking incoming requests, so we need to define exactly what we would like to see logged for each request.

As a user, I want to parallelize upload of data products to PDC

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Update installation documentation to only use virtual environment only

💡 Description

Let's always direct users to setup a virtual environment by default

⚔️ Parent Epic / Related Tickets

No response

Upgrade SBN to latest and Rename Bucket Folder

💡 Description

Tag new DUM within logging fixed
Deploy to production
Request SBN to upgrade
Rename root folder in S3 bucket from SBN to sbn (created sbn folder, move gbo... to sbn folder.

⚔️ Parent Epic / Related Tickets

No response

Develop Initial Design and Architecture

💡 Description

Doing some rapid prototyping to determine the best approach and develop and architecture. See sub-tasks for more details

Upload test data set with manual trigger of Nucleus

💡 Description

Needs new IAM roles for DUM. To get help from @sjoshi-jpl @viviant100
Test out deployment to MCP
Test uploading data from internal pdsmcp-dev EC2 that can reach private API Gateway
- Ask SAs to help setup EC2 and give access to specific EN operator user group
Test uploading data from on-prem EC2 instance to public API Gateway
- Ask SAs to setup IP whitelist
Move onto deploying and running with SBN: #32

CSS MVP: Deploy to MCP and Test Uploads to Nucleus

💡 Description

Temporarily disable centralized logging

💡 Description

Because of bug #75 the centralized logging need to be disabled until the bug is fixed.

⚔️ Parent Epic / Related Tickets

No response

Update lambda function to lowercase the node prefix in buckets

💡 Description

Right now data is being pushed to buckets like for SBN, /SBN/my/data/here. the /SBN seems a bit redundant, but we can leave it for now. However, definitely want this lowercase.

⚔️ Parent Epic / Related Tickets

No response

Develop Cost Model

💡 Description

Cost model for data upload manager components. Should go hand-in-hand with Design Doc but this will be managed and maintained in a secure location. This Epic will also include consideration of deployment strategies, as needed.

Develop initial design doc

💡 Description

After initial rapid prototyping has completed, develop a design and architecture diagram/document.

Ideally this document would be posted as part of the online documentation for this repository.

As a user, I want to include a MD5 checksum in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Archivist

💪 Motivation

...so that I can include a checksum with the files being uploaded to ensure data integrity as the files flow through the system

📖 Additional Details

No response

Acceptance Criteria

Given a file to be uploaded to S3
When I perform data upload manager execution on that file
Then I expect data upload manager to generate a checksum and add to the object metadata for the S3 object

⚙️ Engineering Details

https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

As a user, I want to upload only data products that have not been previously ingested

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can upload only data I do not already have in Planetary Data Cloud.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Develop Directory Write Locking Mechanism

💡 Description

Develop Ingress Client Interface

💡 Description

The current command-line interface for the Ingress client script only allows a user to provide a single file path, as well as an (arbitrary) node ID. This interface needs to be developed to allow at a minimum:

Validation of the provided node ID against the standard set of PDS identifiers
Support for specifying multiple input paths
Support for distinguishing paths to files vs. paths to directories and performing the appropriate S3 sync logic

As a user, I want to resume/rerun upload on a directory and only have the updates or missing files uploaded

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

No response

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

As a user, I want to be able to take no more than X seconds per product to upload to AWS

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

One potential issue was raised for awareness: currently it seems to take about 5-10 seconds per css product to upload. That's going to need to be improved somewhere in the chain, because at that rate it will take more than 24 hours to upload the number of products that are generated every 24 hours.

💪 Motivation

...so that I can [why do you want to do this?]

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

Variables we need to take into account:

Bandwidth to AWS - is there a way we could improve upload throughput?
Size of each file / product
Number of files
Not currently generating checksums

Populate Sphinx documentation for entire DUM service

💡 Description

Ticket to add Sphinx documentation for the entire DUM repository. Topics covered should include:

Installation instructions
Terraform deployment procedure
Cognito account creation
INI config format
Client script usage

Investigate integration of authentication Lambda/Cognito with API Gateway

Ticket to track integration of Cognito authentication with API gateway via Lambda authorizer function.

The following links were provided as a starting point to investigate this feature:

https://github.com/unity-sds/unity-cs/wiki/Getting-Cognito-JWT-Tokens-in-Command-Line
https://github.com/unity-sds/unity-cs-security/tree/main/code_samples

Deploy v1.2.0 DUM to Production

💡 Description

Tag new DUM with v1.2.0
- new summary report
- new logging group /pds/nucleus/dum/client-log-group
- token refresh
Update DUM client to test new capabilities
Debug session with SA
Request SBN to upgrade

⚔️ Parent Epic / Related Tickets

No response

The bucket-map.yaml file used for terraform deployments should be pulled from a private location

💡 Description

Currently it is stored on S3 for production.

⚔️ Parent Epic / Related Tickets

No response

Log upload to Cloudwatch fails during batch upload

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When testing upload of CSS sample data to DUM, the following warning is generated on every ingest:

`WARNING:root:Unable to submit to CloudWatch Logs, reason: 'LogRecord' object has no attribute 'message'

🕵️ Expected behavior

The LogRecords should still have a message field assigned, and all logs should be uploaded to CloudWatch without issue.

📜 To Reproduce

Configure an instance of DUM on an EC2 instance that can communicate with the (currently Private) API gateway
Use the pds-ingress-client.py script to upload a file
Verify the above warning is reproduced in the output log

🖥 Environment Info

Version of this software [e.g. vX.Y.Z]
Operating System: [e.g. MacOSX with Docker Desktop vX.Y]
...

📚 Version of Software Used

No response

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

As a user, I want an end summary report in logs to show statistics of files uploaded

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can have a summary of how things were uploaded

📖 Additional Details

files uploaded
files skipped
files overwritten

Acceptance Criteria

Given a set of files to be uploaded to S3
When I perform a nominal upload
Then I expect a final report to be output show metrics of files read, files successfully uploaded, files skipped, and files overwritten

⚙️ Engineering Details

No response

Investigate COTs Tools for Improving Overall Functionality and Robustness of DUM

💡 Description

https://rclone.org
MGSS DataDrive
owncloud
???

Is this useful? Should we consider a refactoring to simply wrap this utility?

⚔️ Parent Epic / Related Tickets

No response

As a user, I want to include the modification datetime in the the user-defined object metadata being sent in the upload payload

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Archivist

💪 Motivation

...so that I can match the modification datetime from the source system where the data is being copied.

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

No response

Add External Config Support to Ingress Client Script

💡 Description

The current Ingress client script contains a number of hardcoded constants related to AWS configuration (such as API Gateway ID and region) that should be refactored into an external .ini config (or similar) to allow easy customization without requiring code redeployment.

Refine Design and Prototype

💡 Description

Follow-on to #1. Build on top of POC and initial design through implementation and rapid prototyping

Develop Ingress Service Routing Logic

💡 Description

The current prototype of the Data ingress lambda function contains some dummy logic for determining the product type from the provided file path/node ID. This ticket is to track the planning and implementation for the initial logic for determining the s3 path convention from the data payload provided by the client script.

Also within scope for this ticket is defining the input payload schema itself that will determine what is sent by the client.

As a result, a back-end service component is developed for the Data-Upload-Manager. It received the upload request from the client and returns a s3 path (or eventually a presigned S3 URL) where the data should be uploaded by the client.

As a user, I want to skip upload of files already in S3 (nucleus staging bucket)

Checked for duplicates

No - I haven't checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I can avoid duplicate copies for data.

📖 Additional Details

Current design is to overwrite the data when user upload data via DUM. Propose to add capability to verify if the file modification time and size remain unchanged, thereby allowing the copying to S3 to be skipped. Additionally, providing an optional flag (e.g., --force-overwrite) would enable users to overwrite the file when needed.

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

Note: Per #91, rclone handles this functionality for us.

The user should also have the ability to force overwrite data that is already out there.

As an admin, I want access to buckets to be restricted by subnet

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Cloud Admin / Operator

💪 Motivation

...so that I can add another layer of security to access to S3 buckets

📖 Additional Details

No response

Acceptance Criteria

Given a bucket that I have write access policy to a bucket with data upload manager, and within a set IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

Given a bucket that I have write access policy to a bucket with data upload manager, and outside the expected IP subnet
When I perform a DUM load
Then I expect the data to upload successfully

⚙️ Engineering Details

No response

As a user, I want to skip upload of files that are already in the Registry

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I do not try to reload the data

📖 Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

⚙️ Engineering Details

The easiest way to do this would be search the registry either for the file path OR by checksum OR both? We could do this with the LID/LIDVID but I think that will add some significant overhead.

Do we want to figure out some sort of auto-generated UUID for every file we upload to the cloud and add this as metadata? Maybe this is something we could actually store then in the Nucleus database and eventually in the registry. It could link throughout the whole system, agnostic of the LIDVID for the products themselves.

As a user, I want to use Cognito Single Sign On to authenticate to the DUM service

💡 Description

Epic to investigate/implement support for client authentication when accessing the API gateway endpoint to the Ingress Lambda service.

Add support for presigned upload URL usage

💡 Description

To further secure the data upload process, the ingress service Lambda needs to incorporate generation of pre-signed S3 URL's that the client can use to securely upload files to S3. This will allow all the PDS buckets in Nucleus to be private (so trying to guess an S3 upload URI should not work), while still providing a means for outside users to push to S3 without additional credentials or permissions.

DUM client is unable to create CloudWatch Log Stream pds-ingress-client-sbn-* when upload data to cloud

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

When the css data was uploaded to cloud via DUM client, the following error occured:

WARNING:root:Unable to submit to CloudWatch Logs, reason: Failed to create CloudWatch Log Stream pds-ingress-client-sbn-1709312322, reason: 403 Client Error: Forbidden for url: https://yofdsuex7g.execute-api.us-west-2.amazonaws.com/prod/createstream

🕵️ Expected behavior

NO errors. And logs get pushed to the cloud

📜 To Reproduce

Run DUM on any data and push data to the cloud.

🖥 Environment Info

Linux OS

📚 Version of Software Used

0.3.0

🩺 Test Data / Additional context

Any PDS4 data

🦄 Related requirements

⚙️ Engineering Details

As a workaround, let's plan to comment out the code that is causing this for the time being. Logging into AWS is a lower priority ("should") requirement.