Giter Club home page Giter Club logo

intuit / trapheus Goto Github PK

View Code? Open in Web Editor NEW
101.0 6.0 52.0 107.27 MB

This tool automates restoration of RDS database instances from snapshots into any dev, staging or production environments. It supports individual RDS Snapshot as well as cluster snapshot restore operations.

Home Page: https://intuit.github.io/Trapheus/

License: MIT License

Python 91.72% Jupyter Notebook 8.28%
rds step-functions aws database-restore python aws-lambda aws-lambda-layer aws-lambda-python state-machine rds-snapshots

trapheus's Introduction

Restore RDS instances in AWS without worrying about client downtime or configuration retention.
Trapheus can restore individual RDS instance or a RDS cluster. Modelled as a state machine, with the help of AWS step functions, Trapheus restores the RDS instance in a much faster way than the usual SQL dump preserving the same instance endpoint and confgurations as before.

TravisCI Build Status Coverage serverless badge release badge

English | 简体中文 | français

  • Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the AWS pricing page for details.

---------------------------------------------------------------

Table of Contents

---------------------------------------------------------------------------------------------------------------------------------------------------------------

Pre-Requisites

The app requires the following AWS resources to exist before installation:

  1. python3.11 installed on local machine following this.

  2. Configure AWS SES

    • Configure the SES sender and receiver email (SES Console->Email Addresses).
      • An SES email alert is configured to notify the user about any failures in the state machine. The sender email parameter is needed to configure the email ID through which the alert is sent out. The receiver email parameter is needed to set the email ID to which the alert is sent.
  3. Create the S3 bucket where the system is going to store the cloud formation templates:

    • Proposed Name: trapheus-cfn-s3-[account-id]-[region]. It is recommended that the name contains your:
      • account-id, as the bucket names need to be global (prevents someone else having the same name)
      • region, to easily keep track when you have trapheus-s3 buckets in multiple regions
  4. A VPC (region specific). The same VPC/region should be used for both the RDS instance(s), to be used in Trapheus, and Trapheus' lambdas.

    • Region selection consideration. Regions that support:
    • Example minimal VPC setup:
      • VPC console:
        • name: Trapheus-VPC-[region] (specify the [region] where you VPC is created - to easily keep track when you have Trapheus-VPCs in multiple regions)
        • IPv4 CIDR block: 10.0.0.0/16
      • VPC console->Subnets page and create two private subnets:
        • Subnet1:
          • VPC: Trapheus-VPC-[region]
          • Availability Zone: choose one
          • IPv4 CIDR block: 10.0.0.0/19
        • Subnet2:
          • VPC: Trapheus-VPC-[region]
          • Availability Zone: choose a different one than the Subnet1 AZ.
          • IPv4 CIDR block: 10.0.32.0/19
      • You have created a VPC with only two private subnets. If you are creating non-private subnets, check the ratio between private, public subnets, private subnet with dedicated custom network ACL and spare capacity.
  5. One or more instances of an RDS database that you wish to restore.

    • Example minimal free RDS setup:
      • Engine options: MySQL
      • Templates: Free tier
      • Settings: enter password
      • Connectivity: VPC: Trapheus-VPC-[region]

-----------------------------------------------------

Parameters

The following are the parameters for creating the cloudformation template:

  1. --s3-bucket : [Optional] The name of the CloudFormation template S3 bucket from the Pre-Requisites.
  2. vpcID : [Required] The id of the VPC from the Pre-Requisites. The lambdas from the Trapheus state machine will be created in this VPC.
  3. Subnets : [Required] A comma separated list of private subnet ids (region specific) from the Pre-Requisites VPC.
  4. SenderEmail : [Required] The SES sending email configured in the Pre-Requisites
  5. RecipientEmail : [Required] Comma separated list of recipient email addresses configured in Pre-Requisites.
  6. UseVPCAndSubnets : [Optional] Whether to use the vpc and subnets to create a security group and link the security group and vpc to the lambdas. When UseVPCAndSubnets left out (default) or set to 'true', lambdas are connected to a VPC in your account, and by default the function can't access the RDS (or other services) if VPC doesn't provide access (either by routing outbound traffic to a NAT gateway in a public subnet, or having a VPC endpoint, both of which incur cost or require more setup). If set to 'false', the lambdas will run in a default Lambda owned VPC that has access to RDS (and other AWS services).
  7. SlackWebhookUrls : [Optional] Comma separated list of Slack webhooks for failure alerts.

-----------------------------------------------------

Tagging

This section describes the currently supported set of tags and how to take advantage of them in support of common use cases around billing insight generation.

Tags provide additional means of grouping and subdividing costs, whether for the purposes of analysis or cost distribution. Explanations for how tags can be applied to resources inside AWS is provided here. To facilitate a consistent approach to handling known and foreseen use cases, the following tags has been added at stack level and as well as at resource levels.

AppName - Name of the application, default Trapheus
AppComponent - Name of the component, since this application targeted for DB restore, default component database
AppFunction - Application function name, default RestoreDB

If you would like to change above defaults, change it in samconfig.toml

Every resource has tags as well, which override the default tag.

For example:

  Tags:
    AppComponent: "Lambda"
    AppFunction: "RenameDBInstance"

Instructions

Setup

To setup the Trapheus in your AWS account, follow the steps below:

  1. Clone the Trapheus Git repository
  2. AWS Credentials configuration. Trapheus uses boto3 as client library to talk with Amazon Web Services. Feel free to use any environment variables that boto3 supports to supply authentication credentials.
  3. Run pip install -r requirements.txt to install the dependency graph
  4. Run python install.py

Still facing an issue? Check the Issues section or open a new issue

The above will setup a CFT in your AWS account with the name provided during installation.

TO BE NOTED: The CFT creates the following resources:

  1. DBRestoreStateMachine Step function state machine
  2. Multiple lambdas to execute various steps in the state machine
  3. LambdaExecutionRole: used across all lambdas to perform multiple tasks across RDS
  4. StatesExecutionRole: IAM role with permissions for executing the state machine and invoking lambdas
  5. S3 bucket: rds-snapshots-<your_account_id> where snapshots will be exported to
  6. KMS key: is required to start export task of snapshot to s3
  7. DBRestoreStateMachineEventRule: A Cloudwatch rule in disabled state, that can be used following above instructions based on user requirement
  8. CWEventStatesExecutionRole: IAM role used by DBRestoreStateMachineEventRule CloudWatch rule, to allow execution of the state machine from CloudWatch

To set up the step function execution through a scheduled run using CloudWatch rule, follow the steps below:

  1. Go to DBRestoreStateMachineEventRule section in the template.yaml of the Trapheus repo.

  2. We have set it as a scheduled cron rule to run every FRIDAY at 8:00 AM UTC. You can change it to your preferred schedule frequency by updating the ScheduleExpression property's value accordingly. Examples:

    • To run it every 7 days, ScheduleExpression: "rate(7 days)"
    • To run it every FRIDAY at 8:00 AM UTC, ScheduleExpression: "cron(0 8 ? * FRI *)"

    Click here for all details on how to set ScheduleExpression.

  3. Sample targets given in the template file under Targets property for your reference has to be updated:

    a. Change Input property according to your Input property values, accordingly give a better ID for your target by updating the Id property.

    b. Based on the number of targets for which you want to set the schedule, add or remove the targets.

  4. Change the State property value to ENABLED

  5. Lastly, package and redeploy the stack following steps 2 and 3 in Trapheus setup

-----------------------------------------------------

Execution

To execute the step function, follow the steps below:

  1. Navigate to the State machine definition from the Resources tab in the cloudformation stack.
  2. Click on Start Execution.
  3. Under Input, provide the following json as parameter:
{
    "identifier": "<identifier name>",
    "task": "<taskname>",
    "isCluster": true or false
}

a. identifier: (Required - String) The RDS instance or cluster identifier that has to be restored. Any type of RDS instance or Amazon aurora clusters are supported in this.

b. task: (Required - String) Valid options are create_snapshot or db_restore or create_snapshot_only.

c. isCluster: (Required - Boolean) Set to true if the identifier provided is of a cluster else set to false

The state machine can do one of the following tasks:

  1. if task is set to create_snapshot, the state machine creates/updates a snapshot for the given RDS instance or cluster using the snapshot identifier: identifier-snapshot and then executes the pipeline
  2. if task is set to db_restore, the state machine does a restore on the given RDS instance, without updating a snapshot, assuming there is an existing snapshot with an identifier: identifier-snapshot
  3. if task is set to create_snapshot_only, the state machine creates/updates a snapshot for the given RDS instance or cluster using the snapshot identifier: identifier-snapshot and it would not execute the pipeline

Cost considerations

After done with development or using the tool:

  1. if you don't need the RDS instance when not coding or using the tool (for instance, if it is a test RDS), consider stopping or deleting the database. You can always recreate it when you need it.
  2. if you don't need the past Cloud Formation templates, it is recommended you empty the CFN S3 bucket.

Tear down

To tear down your application and remove all resources associated with the Trapheus DB Restore state machine, follow these steps:

  1. Log into the Amazon CloudFormation Console and find the stack you created.
  2. Delete the stack. Note that stack deletion will fail if rds-snapshots-<YOUR_ACCOUNT_NO> s3 bucket is not empty, so first delete the snapshots' exports in the bucket.
  3. Delete the AWS resources from the Pre-Requisites. Removal of SES, the CFN S3 bucket (empty it if not deleting) and VPC is optional as you won't see charges, but can re-use them later for a quick start.

-----------------------------------------------------

How it Works

Complete Pipeline

DBRestore depiction

Modelled as a state machine, different steps in the flow such as snapshot creation/updation, instance rename, restore and deletion, completion/failure status of each operation, failure email alert, etc. are executed using individual lambdas for db instances and db clusters respectively. To track completion/failure of each operation, RDS waiters are used with delays and maximum retry attempts configured based on the lambda timeout. For scenarios of DB cluster availability and deletion, custom waiters have been defined. Lambda layers are used across all lambdas for common utility methods and custom exception handling.

Based on the input provided to the DBRestoreStateMachine step function, the following steps/branches are executed:

  1. Using the isCluster value, a branching takes place in the state machine to execute the pipeline for a db cluster or for a db instance.

  2. If task is set to create_snapshot, the snapshot creation/updation process takes place for a cluster or instance respectively. Creates a snapshot using the unique identifier: identifier-snapshot, if it does not exist. If a snapshot already exists with the aforementioned identifier, it is deleted and a new snapshot is created. Post the new snapshot creation, the db restoration pipeline executes.

  3. If task is set to db_restore, the db restoration process starts, without a snapshot creation/updation

  4. If task is set to create_snapshot_only, the snapshot creation/updation process only takes place for a cluster or instance respectively. Creates a snapshot using the unique identifier: identifier-snapshot, if it does not exist. If a snapshot already exists with the aforementioned identifier, it is deleted and a new snapshot is created. In this scenario, the db restoration pipeline is not started.

  5. As part of the db restoration process, the first step is a Rename of the provided db instance or db cluster and its corresponding instances to a temporary name. Wait for successful completion of the rename step to be able to use the provided unique identifier in the restoration step.

  6. Once the rename step is complete, next step is to Restore the db-instance or db-cluster using the identifier parameter and the snapshot id as identifier-snapshot

  7. Once the restore is complete and the db-instance or db-cluster is available, the final step is to Delete the initially renamed instance or cluster (along with its instances) which was retained for failure handling purposes. Executed using lambdas created for deletion purposes, once the deletion is successful, the pipeline is complete.

  8. At any step, the retries with backoff and failure alerts are handled in every step of the state machine. If there is an occurrence of a failure, an SES email alert is sent as configured during the setup. Optionally, if SlackWebhookUrls was provided in the setup, failure notifications will also be sent to the appropriate channels.

  9. If the restore step fails, as part of failure handling, the Step-4 of instance/cluster rename is reverted to ensure that the original db-instance or db cluster is available for use.

DBRestore failure handling depiction

-----------------------------------------------------

Contributing to Trapheus

Reference Code Structure

├── LICENSE.md                                        <-- The MIT license.
├── README.md                                         <-- The Readme file.
├── docs                                              <-- The Readme files
│   ├── README.fr.md
│   └── README.zh-CN.md
├── install.py
├── presentation
│   └── Trapheus.pptx
├── requirements.txt
├── screenshots                                       <-- Folder for screenshots of the state machine.
│   ├── Trapheus.gif
│   ├── Trapheus.png
│   ├── cluster_restore.png
│   ├── cluster_snapshot_branch.png
│   ├── failure_handling.png
│   ├── instance_restore.png
│   ├── instance_snapshot_branch.png
│   ├── isCluster_branching.png
│   └── restore_state_machine.png
├── setup.py
├── src
│   ├── checkstatus
│   │   ├── DBClusterStatusWaiter.py                  <-- Python Waiter(https://boto3.amazonaws.com/v1/documentation/api/latest/guide/clients.html#waiters) for checking the status of the cluster
│   │   ├── get_dbcluster_status_function.py          <-- Python Lambda code for polling the status of a clusterised database
│   │   ├── get_dbstatus_function.py                  <-- Python Lambda code for polling the status of a non clusterised RDS instance
│   │   └── waiter_acceptor_config.py                 <-- Config module for the waiters
│   ├── common                                        <-- Common modules across the state machine deployed as a AWS Lambda layer.
│   │   ├── common.zip
│   │   └── python
│   │       ├── constants.py                          <-- Common constants used across the state machine.
│   │       ├── custom_exceptions.py                  <-- Custom exceptions defined for the entire state machine.
│   │       └── utility.py                            <-- Utility module.
│   ├── delete
│   │   ├── cluster_delete_function.py                <-- Python Lambda code for deleting a clusterised database.
│   │   └── delete_function.py                        <-- Python Lambda code for deleting a non clusterised RDS instance.
│   ├── emailalert
│   │   └── email_function.py                         <-- Python Lambda code for sending out failure emails.
│   ├── export
│   │   ├── export_cluster_snapshot_s3_function.py    <-- Python Lambda code for exporting db cluster snapshot to S3.
│   │   └── export_snapshot_s3_function.py            <-- Python Lambda code for exporting db instance snapshot to S3.
│   ├── rename
│   │   ├── cluster_rename_function.py                <-- Python Lambda code for renaming a clusterised database.
│   │   └── rename_function.py                        <-- Python Lambda code for renaming a non-clusterised RDS instance.
│   ├── restore
│   │   ├── cluster_restore_function.py               <-- Python Lambda code for retoring a clusterised database.
│   │   └── restore_function.py                       <-- Python Lambda code for restoring a non-clusterised RDS instance
│   ├── slackNotification
│   │   └── slack_notification.py                     <-- Python Lambda code for sending out a failure alert to configured webhook(s) on Slack.
│   └── snapshot
│       ├── cluster_snapshot_function.py              <-- Python Lambda code for creating a snapshot of a clusterised database.
│       └── snapshot_function.py                      <-- Python Lambda code for creating a snapshot of a non-clusterised RDS instance.
├── template.yaml                                     <-- SAM template definition for the entire state machine.
└── tests
    └── unit
        ├── checkstatus
        │   ├── test_get_dbcluster_status_function.py
        │   └── test_get_dbstatus_function.py
        ├── delete
        │   ├── test_cluster_delete_function.py
        │   └── test_delete_function.py
        ├── emailalert
        │   └── test_email_function.py
        ├── export
        │   ├── test_export_cluster_snapshot_s3_function.py
        │   └── test_export_snapshot_s3_function.py
        ├── rename
        │   ├── test_cluster_rename_function.py
        │   └── test_rename_function.py
        ├── restore
        │   ├── test_cluster_restore_function.py
        │   └── test_restore_function.py
        ├── slackNotification
        │   └── test_slack_notification.py
        └── snapshot
            ├── test_cluster_snapshot_function.py
            └── test_snapshot_function.py

Prepare your environment. Install tools as needed.

  • Git Bash used to run Git from Command Line.
  • Github Desktop Git desktop tool for managing pull requests, branches and repos.
  • Visual Studio Code Full visual editor. Extensions for GitHub can be added.
  • Or an editor of your choice.
  1. Fork Trapheus repo

  2. Create a working branch.

    git branch trapheus-change1
  3. Confirm working branch for changes.

     git checkout trapheus-change1

    You can combine both commands by typing git checkout -b trapheus-change1.

  4. Make changes locally using an editor and add unit tests as needed.

  5. Run the test suite in the repo to ensure existing flows are not breaking.

       cd Trapheus
       python -m pytest tests/ -v #to execute the complete test suite
       python -m pytest tests/unit/test_get_dbstatus_function.py -v #to execute any individual test
  6. Stage edited files.

       git add contentfile.md 

    Or use git add . for multiple files.

  7. Commit changes from staging.

       git commit -m "trapheus-change1"
  8. Push new changes to GitHub

       git push --set-upstream origin trapheus-change1
  9. Verify status of branch

    git status

    Review Output to confirm commit status.

  10. Git push

        git push --set-upstream origin trapheus-change1
  11. The Output will provide a link to create your Pull Request.

Blogs

Maintainers

  1. Namita Devadas (@namitad)

Contributors

trapheus's People

Contributors

agnik7 avatar anukaal avatar arun-raghav-s avatar bazzyadb avatar cawrites avatar dependabot[bot] avatar despot avatar github-actions[bot] avatar hariingit avatar hipstersmoothie avatar jdfalko avatar kokikathir avatar kshitiz305 avatar kundai10 avatar madhura-saw avatar namitad avatar nprajilesh avatar riyajohn avatar rociomontes avatar sakthishanmugam avatar sarthakgupta072 avatar sculley avatar stationeros avatar su7edja avatar zahid07 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

trapheus's Issues

[Core] Add support having the snapshot id as an input to the state machine

Presently in the state machine, by default the snapshot identifier is expected to be of the format db-instance-identifier-snapshot.

As an enhancement, enable users to provide the snapshot id as part of the inputs to the state machine, not having to adhere to the default format as above. Also modify the lambdas to support reading of input across the various steps in the state machine.

Update failure notifications to have details in a structured format

Currrently the email and slack notification sent out on pipeline failure contains a text content with the error message as a plain string.

update both functions to send failure notification in the below format:

database id: <identifier name>
snapshot id: <snapshot identifier name>
failed step: <task name>
cause of failure: <error message>

Fix broken readme links to be compatible with the AWS blog

Please ensure your issue adheres to the following guidelines:

  • Use the following format: * [owner/repo](link)
  • Link additions should be added to the bottom of the relevant category.
  • New categories or improvements to the existing categorization are welcome.
  • Search previous suggestions before making a new one, as yours may be a duplicate.
  • Sort by alphabetical order

Thanks for contributing!

As part of the "Getting started with Trapheus" section in AWS blog published on trapheus: https://aws.amazon.com/blogs/opensource/what-is-trapheus/ , there are some links to specific sections of the readme. these links are currently broken in our github repo.

Example:
Screenshot 2023-09-26 at 4 09 48 PM

Currently, the link in the screenshot above is https://github.com/intuit/Trapheus#pre-requisites and should redirect to the prerequisites section of our readme but the link in github is https://github.com/intuit/Trapheus#-pre-requisites.

the request is to fix all the github links to ensure the ones mentioned in the blog work as expected.

Automate further the feature for exporting the snapshot to s3 in a region that doesn't support it.

Depends on:

  • issue-4 "export snapshot to s3 for a region that doesn't support it" case is done (the PR-34 is accepted)
  • issue-44 is done

Subtasks:

  1. Implement a StackSet so that you can deploy 2 stacks:
    1.1. the already existing template.yaml/deploy.yaml in the region that doesn't support the export of snapshot to s3 like Frankfurt eu-centra-1). (You might need to link the resources of the template-unsupportedExportOfSnapshotToS3.yaml into the template.yaml.)
    1.2. a new template-unsupportedExportOfSnapshotToS3.yaml, for the region that supports export of snapshot to s3 like Ireland eu-west-1, that will contain:
    1.2.1. the kmskeyid (check the README.md->Pre-requisites for the structure of the copy snapshot kmskey), and
    1.2.2. the bucket:
    SnapshotsBucket:
    Type: AWS::S3::Bucket
    Condition: HasExportSnapshotSupportedRegion
    Properties:
    BucketName: !Join [ "", [ !Sub "rds-snapshots-${AWS::AccountId}-",!Ref ExportSnapshotSupportedRegion ] ]
    1.2.3. and any other region resources that are required to.
    (You can't specify the other region resources in the same template.yaml, hence a new template-unsupportedExportOfSnapshotToS3.yaml is required)
  2. Modify the E2E test to run without the steps for manually setting kmskey kmskeyid and bucket in the export supporting region (in other words you would not need to do point 2.3. from issue-44).

Goal: Reduce the time the user takes to provide manual inputs in order to have a snapshot exported to s3 for a region that doesn't support it.

[Testing] refactor test files to improve code quality and coverage

Please ensure your issue adheres to the following guidelines:

Refactor the tests files to remove code duplication of the common src utility files in the test folder (ex: mock_utility, mock_constants).
Make use of patch to efficiently access and mock the utility files within the tests as well.

  • Use the following format: * [owner/repo](link)
  • Link additions should be added to the bottom of the relevant category.
  • New categories or improvements to the existing categorization are welcome.
  • Search previous suggestions before making a new one, as yours may be a duplicate.
  • Sort by alphabetical order

Thanks for contributing!

Support custom snapshot identifier based snapshot creation

As part of this issue, we would like to add an enhancement where users can provide a unique snapshot-identifier as an additional input parameter to the state machine and the snapshot id provided as a value for this parameter be used as the id for new snapshot being created (when task=create_snapshot or task=create_snapshot_only)

Refer https://github.com/intuit/Trapheus#-execution for more details on task types

This issue is a follow up enhancement to #132 where the input parameter support will be added

Potential security issue with a possible injection

We are using os.system inside install

https://github.com/intuit/Trapheus/blob/master/install.py#L17
https://github.com/intuit/Trapheus/blob/master/install.py#L44

This type of invocation is dangerous as it is vulnerable to various shell injection attacks. Great care should be taken to sanitize all input in order to mitigate this risk. Calls of this type are identified by the use of certain commands which are known to use shells.

Remediation
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

Example of secure code:

import subprocess

subprocess.call('/bin/echo suspicious code', shell=False)

[Documentation] Add missing existing feature parts to sections and reduce time for adoption by provide minimal example configurations

Add missing existing feature parts to sections:

  • Pre-Requisites section is missing VPC, RDS, SES, CFN S3 (without setting these up a user can get stuck)
  • Parameters section is missing CFN s3
  • Tear Down section is missing Pre-Requisites AWS resources that are not part of the stack. (making the user's aws repo unclean)

Reduce time for adoption by:

  • including minimal simple configurations of VPC and RDS for quick start up. It took 1 week for me to setup things up properly as there was nothing available. Let's make it less tricky for users and contributors to adopt the start up of Trapheus.
  • Cost considerations section is missing. People will adopt the tool more if they know how to try it without or less cost.
  • links to resources (for faster adoption from new contributors)

[Core] Minimal and lower cost Lambda-RDS access configuration

Provide a way for a user to generate a minimal and lower cost Lambda-RDS access configuration.

Context:
Currently, lambda functions have a VPC, subnets and security group linked to them when generated. Lambdas are connected to a VPC in your account, and by default the function can't access the RDS (or other services) if VPC doesn't provide access (either by routing outbound traffic to a NAT gateway in a public subnet, or having a VPC endpoint, both of which incur cost or require more setup).

How:
Provide a sam deploy parameter that will disable the linkage between the VPC/sg and will let lambdas run in a default Lambda owned VPC that has access to RDS (and other AWS services).

Issue:
Typically, linking your own VPC to the lambdas and not setting addition NAT gateway or VPC endpoint configuration will result in a "Connect timeout on endpoint URL: "https://rds.[region].amazonaws.com/"" or "Task timed out after 3.00 seconds" issue.
Typical output from the create snapshot lambda in this case is:
Connect timeout on endpoint URL: "https://rds.eu-west-1.amazonaws.com/"
Traceback (most recent call last):
  File "/var/task/snapshot_function.py", line 22, in lambda_create_dbinstance_snapshot
    util.eval_snapshot_exception(error, instance_id, rds)
  File "/opt/python/utility.py", line 71, in eval_snapshot_exception
    raise custom_exceptions.SnapshotCreationException(error_message)

Support custom snapshot identifier based db restoration

Currently Trapheus supports a limited set of input parameters to the state machine as defined here: https://github.com/intuit/Trapheus#-execution. Theidentifier parameter refers to the rds instance/cluster id.

The state machine works on the assumption/prerequisite that the restoration will be done using a snapshot with identifier in the format <identifier-snapshot> or snapshot creation will create/update snapshot with identifier as <identifier-snapshot>.

As part of this issue, we would like to add an enhancement where users can provide a unique snapshot-identifier as an additional input parameter to the state machine and the snapshot id provided as a value for this parameter be used for the db restoration process (when task=db_restore)

Refer https://github.com/intuit/Trapheus#-execution for more details on task types

Parallelise the alert and rollback steps in the state machine

Presently in the state machine, in any failure scenario, there is an email and optional slack alert configured.
Post the alert, a rollback is initiated.
Therefore in scenarios of failure in the alert mechanisms, the rollback is also blocked.

https://github.com/intuit/Trapheus/blob/master/screenshots/failure_handling.png

Enhance this sequential part of the state machine such that the alert and rollback happens in parallel.

Resources to start off: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

Use DNS CNAME to reduce client side impact during recovery in high scaled use cases

Problem
Trapheus today incurs a downtime while doing a restoration. The downtime arises primarily for the reason that AWS does not let us restore a database with the same endpoint. In order to keep the parity, Trapheus moves the database to a temporary one and restores it with a snapshot. Trapheus assumes we do this restoration during a maintenance window or applications to handle database connection failures gracefully with retries using exponential backoff. This enables the application to recover when a database connection becomes unavailable during a restore without causing any application to unexpectedly crash.

For more nuanced use cases where the above assumption can't hold true for instance in highly scaled scenarios, there are special considerations when you restore an RDS DB instance to replace a live RDS DB instance. For example, we must determine how we will redirect our existing database clients to the new instance with minimal interruption and modification. We also must ensure continuity and consistency in the data within the database by considering the restored data time and the recovery time when the new instance begins receiving writes.

Solution
We can create a separate DNS CNAME record that points to our DB instance endpoint and have our clients use this DNS name. Then we can update the CNAME to point to a new, restored endpoint without having to update our database clients.
Set the Time to Live (TTL) for our CNAME record to an appropriate value. The TTL that we specify determines how long the record is cached with DNS resolvers before another request is made. It is important to note that some DNS resolvers or applications might not honor the TTL, and they might cache the record for longer than the TTL. For Amazon Route 53, if you specify a longer value (for example, 172800 seconds, or two days), you reduce the number of calls that DNS recursive resolvers must make to Route 53 to get the latest information in this record. This reduces latency and reduces our bill for the Route 53 service. For more information, see How Amazon Route 53 routes traffic for your
domain.

Literature
Applications and client operating systems might also cache DNS information that we have to flush or restart to initiate a new DNS resolution request and retrieve the updated CNAME record. When we initiate a database restore and shift traffic to a restored instance, we need to verify that all our clients are writing to your restored instance instead of our prior instance.
The data architecture might support restoring our database, updating DNS to shift traffic to our restored instance, and then remediating any data that may still be written to our prior instance. Then all access is from our newly restored instance. This process creates room for a gradual decommissioning approach, first limit access to a running database by the security group. You can eventually stop the instance when it is no longer needed.

Separate create_snapshot from db_restore so latter is not run automatically

Hi, I have an enhancement request for Trapheus
we would like to use Trapheus to create and export a Snapshot from one AWS Environment
transfer the exported snapshot to a target Environment (different account) and then restore the snapshot there

currently task create_snapshot does automatically and always run into the db_restore part.
To achieve what we need there need to be a way to separate create_snapshot task from db_restore, so db_restore not to be run after successful create_snapshot

maybe there are various ways to do that, but the simplest one would to do a small (one liner) modification to template.yaml to the step function

"StartClusterSnapshotExportTask": {
                "Type": "Task",
                "Resource": "${ClusterSnapshotExportLambdaArn}",
                "Next": "ClusterRename",

by changing ClusterRename to RestorePipelineComplete

We would like to use the code as is so not to clone your code but would like to see this in there.
Or maybe introduce a task create_shapshot_only Task / action and extend the decision logic to stop the step function after having exported the RDS snapshot to S3?

Please feedback on whether such a modification would be possible.
Regards, Peter

Move all lambda functions to use Graviton2 for cost optimisation

Our lambda function defintion from here https://github.com/intuit/Trapheus/blob/master/template.yaml#L178C24-L178C24 needs to be moved to Graviton,

Lambda offers two instruction set architectures in most AWS Regions: x86 and arm64. Lambda functions powered by Graviton2 are designed to deliver up to 19 percent better performance at 20 percent lower cost.

We should move our functions to explicityly use Graviton2 (see example below)

MyFunction:
  Type: AWS::Serverless::Function
  Properties:
    Architectures:
      - arm64  #  use Graviton2
      -

[No-Code] Addition of new blogs

  • We will accpet the PR as hacktoberfest-accepted for this issue, if the PR adds new blogs in the documentation.
  • You can write a blog on Trapheus/TrapheusAI in ,markdown/medium pages . We will update it as Community Spotlight along with your name

Create Definition of Done (DoD) document

Create a DoD document in .github folder, that will contain all the prerequisites for a feature to be considered Done by the implementer/reviewer.

Those would entail:

  1. have a UT for the feature covering the happy path and the exception paths, and at least cover 85% of LoC
  2. don't have duplicated or redundant code (even in UTs). If the code is duplicated, extract, refactor to get a non-redundant code
  3. if you touch a LoC, you are expected to fix or create a UT for it following the above rules (even if this hasn't been followed or was missed by the previous implementer/reviewer)
  4. have the appropriate documentation for it in the README file.
  5. other?

Allow the user an option to choose between models for completion API.

Have a look at the TrapheusAI demo https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo

We want the user to be given an option to choose between LLM models , which enables them to acquire a type of accuracy and exactness which is beyond just language related stuff.

Also the dataset might contain embedded contextual information (schema, short forms) that might not be standardised across the datasets for the same search query, hence might not be well learned by all models.

Hence allowing the users to choose model and iterate on the search.

https://github.com/intuit/Trapheus/blob/master/labs/TrapheusAI/llm/model.py#L22

Create a jupyter notebook workshop

Create a jupyter notebook for Trapheus that can help show the execution commands in a sandbox environment thus simplifying the installation process currently

Explore the pre version of a major upgrade from open ai

Have a look at the TrapheusAI demo https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo

OpenAI is prepping up a major version upgrade https://github.com/openai/openai-python#beta-release . We have a dependency on OpenAI https://github.com/intuit/Trapheus/blob/master/labs/TrapheusAI/requirements.txt#L2 for LLMs.

We can install a pre version of the upgrade by pip install --pre openai

This issue is to explore if anything breaks in TrapheusAI app because of the upgrade post the above installation.

Automate installer preferences for Trapheus

Currently the installer file for Trapheus https://github.com/intuit/Trapheus/blob/master/install.py asks user for the following

  1. region
  2. s3_bucket
  3. VPC
  4. Subnets
  5. notification preferences such as slack webhooks and SES email configs

The ask is to make the installer more easy, for instance , when the user enters a region, use AWS CLI commands , to get the s3_buckets , VPCs, Subnets etc and present the user an option to pick and choose any or create a new one from the command line itself.

Trapheus exporting and restoring to other AWS accounts

Is it possible to use Trapheus to automate export + restore databases to different AWS accounts?

For example, I would like to run create_snapshot_only on Account A and run db_restore on Account B using Account's A snapshot

Add Spanish translation of docs

Please ensure your issue adheres to the following guidelines:

  • Use the following format: * [owner/repo](link)
  • Link additions should be added to the bottom of the relevant category.
  • New categories or improvements to the existing categorization are welcome.
  • Search previous suggestions before making a new one, as yours may be a duplicate.
  • Sort by alphabetical order

Thanks for contributing!

addition of Tags causing deployment failures for trapheus CFT

Multiple failures due to addition of tags for resources in template.yaml:

  1. securityGroup tag incorrectly added for AWS::Serverless::LayerVersion (tags are not supported for this resource of this type)
Screenshot 2023-11-13 at 12 15 28 AM
  1. indentation issues which lead to stack creation to be stuck at REVIEW_IN_PROGRESS
    https://github.com/intuit/Trapheus/blob/master/template.yaml#L542
    https://github.com/intuit/Trapheus/blob/master/template.yaml#L569
    https://github.com/intuit/Trapheus/blob/master/template.yaml#L596

  2. Expected format for Tags is a list with Key and Value attributes which results in deployment errors

Screenshot 2023-11-13 at 12 22 38 AM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.