intuit / trapheus Goto Github PK

This tool automates restoration of RDS database instances from snapshots into any dev, staging or production environments. It supports individual RDS Snapshot as well as cluster snapshot restore operations.

Home Page: https://intuit.github.io/Trapheus/

License: MIT License

Python 91.72% Jupyter Notebook 8.28%

rds step-functions aws database-restore python aws-lambda aws-lambda-layer aws-lambda-python state-machine rds-snapshots

trapheus's Introduction

Restore RDS instances in AWS without worrying about client downtime or configuration retention.
_{Trapheus can restore individual RDS instance or a RDS cluster.
Modelled as a state machine, with the help of AWS step functions, Trapheus restores the RDS instance in a much faster way than the usual SQL dump preserving the same instance endpoint and confgurations as before.}

English | 简体中文 | français

Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the AWS pricing page for details.

➤ Pre-Requisites
➤ Parameters
➤ Tagging
➤ Instructions
➤ Execution
➤ How it Works
➤ Contributing to Trapheus
➤ Blogs
➤ Contributors

Pre-Requisites

The app requires the following AWS resources to exist before installation:

python3.11 installed on local machine following this.
Configure AWS SES
- Configure the SES sender and receiver email (SES Console->Email Addresses).
  - An SES email alert is configured to notify the user about any failures in the state machine. The sender email parameter is needed to configure the email ID through which the alert is sent out. The receiver email parameter is needed to set the email ID to which the alert is sent.
Create the S3 bucket where the system is going to store the cloud formation templates:
- Proposed Name: trapheus-cfn-s3-[account-id]-[region]. It is recommended that the name contains your:
  - account-id, as the bucket names need to be global (prevents someone else having the same name)
  - region, to easily keep track when you have trapheus-s3 buckets in multiple regions
A VPC (region specific). The same VPC/region should be used for both the RDS instance(s), to be used in Trapheus, and Trapheus' lambdas.
- Region selection consideration. Regions that support:
  - Email receiving . Check Parameters -> 'RecipientEmail' for more.
- Example minimal VPC setup:
  - VPC console:
    - name: Trapheus-VPC-[region] (specify the [region] where you VPC is created - to easily keep track when you have Trapheus-VPCs in multiple regions)
    - IPv4 CIDR block: 10.0.0.0/16
  - VPC console->Subnets page and create two private subnets:
    - Subnet1:
      - VPC: Trapheus-VPC-[region]
      - Availability Zone: choose one
      - IPv4 CIDR block: 10.0.0.0/19
    - Subnet2:
      - VPC: Trapheus-VPC-[region]
      - Availability Zone: choose a different one than the Subnet1 AZ.
      - IPv4 CIDR block: 10.0.32.0/19
  - You have created a VPC with only two private subnets. If you are creating non-private subnets, check the ratio between private, public subnets, private subnet with dedicated custom network ACL and spare capacity.
One or more instances of an RDS database that you wish to restore.
- Example minimal free RDS setup:
  - Engine options: MySQL
  - Templates: Free tier
  - Settings: enter password
  - Connectivity: VPC: Trapheus-VPC-[region]

Parameters

The following are the parameters for creating the cloudformation template:

--s3-bucket : [Optional] The name of the CloudFormation template S3 bucket from the Pre-Requisites.
vpcID : [Required] The id of the VPC from the Pre-Requisites. The lambdas from the Trapheus state machine will be created in this VPC.
Subnets : [Required] A comma separated list of private subnet ids (region specific) from the Pre-Requisites VPC.
SenderEmail : [Required] The SES sending email configured in the Pre-Requisites
RecipientEmail : [Required] Comma separated list of recipient email addresses configured in Pre-Requisites.
UseVPCAndSubnets : [Optional] Whether to use the vpc and subnets to create a security group and link the security group and vpc to the lambdas. When UseVPCAndSubnets left out (default) or set to 'true', lambdas are connected to a VPC in your account, and by default the function can't access the RDS (or other services) if VPC doesn't provide access (either by routing outbound traffic to a NAT gateway in a public subnet, or having a VPC endpoint, both of which incur cost or require more setup). If set to 'false', the lambdas will run in a default Lambda owned VPC that has access to RDS (and other AWS services).
SlackWebhookUrls : [Optional] Comma separated list of Slack webhooks for failure alerts.

Tagging

This section describes the currently supported set of tags and how to take advantage of them in support of common use cases around billing insight generation.

Tags provide additional means of grouping and subdividing costs, whether for the purposes of analysis or cost distribution. Explanations for how tags can be applied to resources inside AWS is provided here. To facilitate a consistent approach to handling known and foreseen use cases, the following tags has been added at stack level and as well as at resource levels.

AppName - Name of the application, default Trapheus
AppComponent - Name of the component, since this application targeted for DB restore, default component database
AppFunction - Application function name, default RestoreDB

If you would like to change above defaults, change it in samconfig.toml

Every resource has tags as well, which override the default tag.

For example:

  Tags:
    AppComponent: "Lambda"
    AppFunction: "RenameDBInstance"

Instructions

Setup

To setup the Trapheus in your AWS account, follow the steps below:

Clone the Trapheus Git repository
AWS Credentials configuration. Trapheus uses boto3 as client library to talk with Amazon Web Services. Feel free to use any environment variables that boto3 supports to supply authentication credentials.
Run pip install -r requirements.txt to install the dependency graph
Run python install.py

Still facing an issue? Check the Issues section or open a new issue

The above will setup a CFT in your AWS account with the name provided during installation.

TO BE NOTED: The CFT creates the following resources:

DBRestoreStateMachine Step function state machine
Multiple lambdas to execute various steps in the state machine
LambdaExecutionRole: used across all lambdas to perform multiple tasks across RDS
StatesExecutionRole: IAM role with permissions for executing the state machine and invoking lambdas
S3 bucket: rds-snapshots-<your_account_id> where snapshots will be exported to
KMS key: is required to start export task of snapshot to s3
DBRestoreStateMachineEventRule: A Cloudwatch rule in disabled state, that can be used following above instructions based on user requirement
CWEventStatesExecutionRole: IAM role used by DBRestoreStateMachineEventRule CloudWatch rule, to allow execution of the state machine from CloudWatch

To set up the step function execution through a scheduled run using CloudWatch rule, follow the steps below:

Go to DBRestoreStateMachineEventRule section in the template.yaml of the Trapheus repo.
We have set it as a scheduled cron rule to run every FRIDAY at 8:00 AM UTC. You can change it to your preferred schedule frequency by updating the ScheduleExpression property's value accordingly. Examples:
- To run it every 7 days, ScheduleExpression: "rate(7 days)"
- To run it every FRIDAY at 8:00 AM UTC, ScheduleExpression: "cron(0 8 ? * FRI *)"
Click here for all details on how to set ScheduleExpression.
Sample targets given in the template file under Targets property for your reference has to be updated:

a. Change Input property according to your Input property values, accordingly give a better ID for your target by updating the Id property.

b. Based on the number of targets for which you want to set the schedule, add or remove the targets.
Change the State property value to ENABLED
Lastly, package and redeploy the stack following steps 2 and 3 in Trapheus setup

Execution

To execute the step function, follow the steps below:

Navigate to the State machine definition from the Resources tab in the cloudformation stack.
Click on Start Execution.
Under Input, provide the following json as parameter:

{
    "identifier": "<identifier name>",
    "task": "<taskname>",
    "isCluster": true or false
}

a. identifier: (Required - String) The RDS instance or cluster identifier that has to be restored. Any type of RDS instance or Amazon aurora clusters are supported in this.

b. task: (Required - String) Valid options are create_snapshot or db_restore or create_snapshot_only.

c. isCluster: (Required - Boolean) Set to true if the identifier provided is of a cluster else set to false

The state machine can do one of the following tasks:

if task is set to create_snapshot, the state machine creates/updates a snapshot for the given RDS instance or cluster using the snapshot identifier: identifier-snapshot and then executes the pipeline
if task is set to db_restore, the state machine does a restore on the given RDS instance, without updating a snapshot, assuming there is an existing snapshot with an identifier: identifier-snapshot
if task is set to create_snapshot_only, the state machine creates/updates a snapshot for the given RDS instance or cluster using the snapshot identifier: identifier-snapshot and it would not execute the pipeline

Cost considerations

After done with development or using the tool:

if you don't need the RDS instance when not coding or using the tool (for instance, if it is a test RDS), consider stopping or deleting the database. You can always recreate it when you need it.
if you don't need the past Cloud Formation templates, it is recommended you empty the CFN S3 bucket.

Tear down

To tear down your application and remove all resources associated with the Trapheus DB Restore state machine, follow these steps:

Log into the Amazon CloudFormation Console and find the stack you created.
Delete the stack. Note that stack deletion will fail if rds-snapshots-<YOUR_ACCOUNT_NO> s3 bucket is not empty, so first delete the snapshots' exports in the bucket.
Delete the AWS resources from the Pre-Requisites. Removal of SES, the CFN S3 bucket (empty it if not deleting) and VPC is optional as you won't see charges, but can re-use them later for a quick start.

How it Works

Complete Pipeline

Modelled as a state machine, different steps in the flow such as snapshot creation/updation, instance rename, restore and deletion, completion/failure status of each operation, failure email alert, etc. are executed using individual lambdas for db instances and db clusters respectively. To track completion/failure of each operation, RDS waiters are used with delays and maximum retry attempts configured based on the lambda timeout. For scenarios of DB cluster availability and deletion, custom waiters have been defined. Lambda layers are used across all lambdas for common utility methods and custom exception handling.

Based on the input provided to the DBRestoreStateMachine step function, the following steps/branches are executed:

Using the isCluster value, a branching takes place in the state machine to execute the pipeline for a db cluster or for a db instance.
If task is set to create_snapshot, the snapshot creation/updation process takes place for a cluster or instance respectively. Creates a snapshot using the unique identifier: identifier-snapshot, if it does not exist. If a snapshot already exists with the aforementioned identifier, it is deleted and a new snapshot is created. Post the new snapshot creation, the db restoration pipeline executes.
If task is set to db_restore, the db restoration process starts, without a snapshot creation/updation
If task is set to create_snapshot_only, the snapshot creation/updation process only takes place for a cluster or instance respectively. Creates a snapshot using the unique identifier: identifier-snapshot, if it does not exist. If a snapshot already exists with the aforementioned identifier, it is deleted and a new snapshot is created. In this scenario, the db restoration pipeline is not started.
As part of the db restoration process, the first step is a Rename of the provided db instance or db cluster and its corresponding instances to a temporary name. Wait for successful completion of the rename step to be able to use the provided unique identifier in the restoration step.
Once the rename step is complete, next step is to Restore the db-instance or db-cluster using the identifier parameter and the snapshot id as identifier-snapshot
Once the restore is complete and the db-instance or db-cluster is available, the final step is to Delete the initially renamed instance or cluster (along with its instances) which was retained for failure handling purposes. Executed using lambdas created for deletion purposes, once the deletion is successful, the pipeline is complete.
At any step, the retries with backoff and failure alerts are handled in every step of the state machine. If there is an occurrence of a failure, an SES email alert is sent as configured during the setup. Optionally, if SlackWebhookUrls was provided in the setup, failure notifications will also be sent to the appropriate channels.
If the restore step fails, as part of failure handling, the Step-4 of instance/cluster rename is reverted to ensure that the original db-instance or db cluster is available for use.

Contributing to Trapheus

Reference Code Structure

├── LICENSE.md                                        <-- The MIT license.
├── README.md                                         <-- The Readme file.
├── docs                                              <-- The Readme files
│   ├── README.fr.md
│   └── README.zh-CN.md
├── install.py
├── presentation
│   └── Trapheus.pptx
├── requirements.txt
├── screenshots                                       <-- Folder for screenshots of the state machine.
│   ├── Trapheus.gif
│   ├── Trapheus.png
│   ├── cluster_restore.png
│   ├── cluster_snapshot_branch.png
│   ├── failure_handling.png
│   ├── instance_restore.png
│   ├── instance_snapshot_branch.png
│   ├── isCluster_branching.png
│   └── restore_state_machine.png
├── setup.py
├── src
│   ├── checkstatus
│   │   ├── DBClusterStatusWaiter.py                  <-- Python Waiter(https://boto3.amazonaws.com/v1/documentation/api/latest/guide/clients.html#waiters) for checking the status of the cluster
│   │   ├── get_dbcluster_status_function.py          <-- Python Lambda code for polling the status of a clusterised database
│   │   ├── get_dbstatus_function.py                  <-- Python Lambda code for polling the status of a non clusterised RDS instance
│   │   └── waiter_acceptor_config.py                 <-- Config module for the waiters
│   ├── common                                        <-- Common modules across the state machine deployed as a AWS Lambda layer.
│   │   ├── common.zip
│   │   └── python
│   │       ├── constants.py                          <-- Common constants used across the state machine.
│   │       ├── custom_exceptions.py                  <-- Custom exceptions defined for the entire state machine.
│   │       └── utility.py                            <-- Utility module.
│   ├── delete
│   │   ├── cluster_delete_function.py                <-- Python Lambda code for deleting a clusterised database.
│   │   └── delete_function.py                        <-- Python Lambda code for deleting a non clusterised RDS instance.
│   ├── emailalert
│   │   └── email_function.py                         <-- Python Lambda code for sending out failure emails.
│   ├── export
│   │   ├── export_cluster_snapshot_s3_function.py    <-- Python Lambda code for exporting db cluster snapshot to S3.
│   │   └── export_snapshot_s3_function.py            <-- Python Lambda code for exporting db instance snapshot to S3.
│   ├── rename
│   │   ├── cluster_rename_function.py                <-- Python Lambda code for renaming a clusterised database.
│   │   └── rename_function.py                        <-- Python Lambda code for renaming a non-clusterised RDS instance.
│   ├── restore
│   │   ├── cluster_restore_function.py               <-- Python Lambda code for retoring a clusterised database.
│   │   └── restore_function.py                       <-- Python Lambda code for restoring a non-clusterised RDS instance
│   ├── slackNotification
│   │   └── slack_notification.py                     <-- Python Lambda code for sending out a failure alert to configured webhook(s) on Slack.
│   └── snapshot
│       ├── cluster_snapshot_function.py              <-- Python Lambda code for creating a snapshot of a clusterised database.
│       └── snapshot_function.py                      <-- Python Lambda code for creating a snapshot of a non-clusterised RDS instance.
├── template.yaml                                     <-- SAM template definition for the entire state machine.
└── tests
    └── unit
        ├── checkstatus
        │   ├── test_get_dbcluster_status_function.py
        │   └── test_get_dbstatus_function.py
        ├── delete
        │   ├── test_cluster_delete_function.py
        │   └── test_delete_function.py
        ├── emailalert
        │   └── test_email_function.py
        ├── export
        │   ├── test_export_cluster_snapshot_s3_function.py
        │   └── test_export_snapshot_s3_function.py
        ├── rename
        │   ├── test_cluster_rename_function.py
        │   └── test_rename_function.py
        ├── restore
        │   ├── test_cluster_restore_function.py
        │   └── test_restore_function.py
        ├── slackNotification
        │   └── test_slack_notification.py
        └── snapshot
            ├── test_cluster_snapshot_function.py
            └── test_snapshot_function.py

Prepare your environment. Install tools as needed.

Git Bash used to run Git from Command Line.
Github Desktop Git desktop tool for managing pull requests, branches and repos.
Visual Studio Code Full visual editor. Extensions for GitHub can be added.
Or an editor of your choice.

Fork Trapheus repo
Create a working branch.
```
git branch trapheus-change1
```
Confirm working branch for changes.
```
 git checkout trapheus-change1
```
You can combine both commands by typing git checkout -b trapheus-change1.
Make changes locally using an editor and add unit tests as needed.

Run the test suite in the repo to ensure existing flows are not breaking.

   cd Trapheus
   python -m pytest tests/ -v #to execute the complete test suite
   python -m pytest tests/unit/test_get_dbstatus_function.py -v #to execute any individual test

Stage edited files.
```
   git add contentfile.md 
```
Or use git add . for multiple files.
Commit changes from staging.
```
   git commit -m "trapheus-change1"
```

Push new changes to GitHub

   git push --set-upstream origin trapheus-change1

Verify status of branch
```
git status
```
Review Output to confirm commit status.

Git push

    git push --set-upstream origin trapheus-change1

The Output will provide a link to create your Pull Request.

Blogs

Maintainers

Namita Devadas (@namitad)

Contributors

trapheus's People

Contributors

Stargazers

Watchers

trapheus's Issues

[Core] Add support having the snapshot id as an input to the state machine

Presently in the state machine, by default the snapshot identifier is expected to be of the format db-instance-identifier-snapshot.

As an enhancement, enable users to provide the snapshot id as part of the inputs to the state machine, not having to adhere to the default format as above. Also modify the lambdas to support reading of input across the various steps in the state machine.

Update failure notifications to have details in a structured format

Currrently the email and slack notification sent out on pipeline failure contains a text content with the error message as a plain string.

update both functions to send failure notification in the below format:

database id: <identifier name>
snapshot id: <snapshot identifier name>
failed step: <task name>
cause of failure: <error message>

Allow user to download latest dataset when clicking on download

Currently when a user downloads a dataset after making some modifications in place, the changes are not synced in memory and the original dataset is downloaded.
User should be able to download the modified dataset after making modifications to it.

Fix broken readme links to be compatible with the AWS blog

Please ensure your issue adheres to the following guidelines:

Use the following format: * [owner/repo](link)
Link additions should be added to the bottom of the relevant category.
New categories or improvements to the existing categorization are welcome.
Search previous suggestions before making a new one, as yours may be a duplicate.
Sort by alphabetical order

Thanks for contributing!

As part of the "Getting started with Trapheus" section in AWS blog published on trapheus: https://aws.amazon.com/blogs/opensource/what-is-trapheus/ , there are some links to specific sections of the readme. these links are currently broken in our github repo.

Example:

Currently, the link in the screenshot above is https://github.com/intuit/Trapheus#pre-requisites and should redirect to the prerequisites section of our readme but the link in github is https://github.com/intuit/Trapheus#-pre-requisites.

the request is to fix all the github links to ensure the ones mentioned in the blog work as expected.

Automate further the feature for exporting the snapshot to s3 in a region that doesn't support it.

Depends on:

issue-4 "export snapshot to s3 for a region that doesn't support it" case is done (the PR-34 is accepted)
issue-44 is done

Subtasks:

Implement a StackSet so that you can deploy 2 stacks:
1.1. the already existing template.yaml/deploy.yaml in the region that doesn't support the export of snapshot to s3 like Frankfurt eu-centra-1). (You might need to link the resources of the template-unsupportedExportOfSnapshotToS3.yaml into the template.yaml.)
1.2. a new template-unsupportedExportOfSnapshotToS3.yaml, for the region that supports export of snapshot to s3 like Ireland eu-west-1, that will contain:
1.2.1. the kmskeyid (check the README.md->Pre-requisites for the structure of the copy snapshot kmskey), and
1.2.2. the bucket:
SnapshotsBucket:
Type: AWS::S3::Bucket
Condition: HasExportSnapshotSupportedRegion
Properties:
BucketName: !Join [ "", [ !Sub "rds-snapshots-${AWS::AccountId}-",!Ref ExportSnapshotSupportedRegion ] ]
1.2.3. and any other region resources that are required to.
(You can't specify the other region resources in the same template.yaml, hence a new template-unsupportedExportOfSnapshotToS3.yaml is required)
Modify the E2E test to run without the steps for manually setting kmskey kmskeyid and bucket in the export supporting region (in other words you would not need to do point 2.3. from issue-44).

Goal: Reduce the time the user takes to provide manual inputs in order to have a snapshot exported to s3 for a region that doesn't support it.

[UT] Test case for Slack notification lambda

Unit test in serverless for slack_notification.py https://github.com/intuit/Trapheus/blob/master/src/slackNotification/slack_notification.py

[Testing] refactor test files to improve code quality and coverage

Please ensure your issue adheres to the following guidelines:

Refactor the tests files to remove code duplication of the common src utility files in the test folder (ex: mock_utility, mock_constants).
Make use of patch to efficiently access and mock the utility files within the tests as well.

Use the following format: * [owner/repo](link)
Link additions should be added to the bottom of the relevant category.
New categories or improvements to the existing categorization are welcome.
Search previous suggestions before making a new one, as yours may be a duplicate.
Sort by alphabetical order

Thanks for contributing!

Support custom snapshot identifier based snapshot creation

As part of this issue, we would like to add an enhancement where users can provide a unique snapshot-identifier as an additional input parameter to the state machine and the snapshot id provided as a value for this parameter be used as the id for new snapshot being created (when task=create_snapshot or task=create_snapshot_only)

Refer https://github.com/intuit/Trapheus#-execution for more details on task types

This issue is a follow up enhancement to #132 where the input parameter support will be added

Potential security issue with a possible injection

We are using os.system inside install

https://github.com/intuit/Trapheus/blob/master/install.py#L17
https://github.com/intuit/Trapheus/blob/master/install.py#L44

This type of invocation is dangerous as it is vulnerable to various shell injection attacks. Great care should be taken to sanitize all input in order to mitigate this risk. Calls of this type are identified by the use of certain commands which are known to use shells.

Remediation
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function.

Example of secure code:

import subprocess

subprocess.call('/bin/echo suspicious code', shell=False)

[Documentation] Add missing existing feature parts to sections and reduce time for adoption by provide minimal example configurations

Add missing existing feature parts to sections:

Pre-Requisites section is missing VPC, RDS, SES, CFN S3 (without setting these up a user can get stuck)
Parameters section is missing CFN s3
Tear Down section is missing Pre-Requisites AWS resources that are not part of the stack. (making the user's aws repo unclean)

Reduce time for adoption by:

including minimal simple configurations of VPC and RDS for quick start up. It took 1 week for me to setup things up properly as there was nothing available. Let's make it less tricky for users and contributors to adopt the start up of Trapheus.
Cost considerations section is missing. People will adopt the tool more if they know how to try it without or less cost.
links to resources (for faster adoption from new contributors)

Unlimited Load No result

Using the current master branch after entering all the details the backend goes into an infinite loop creating multiple requests to the OpenAI and not returning anything.

Here is a video for the same.
https://youtu.be/NSnFOBj5pA8

[Core] Minimal and lower cost Lambda-RDS access configuration

Provide a way for a user to generate a minimal and lower cost Lambda-RDS access configuration.

Context:
Currently, lambda functions have a VPC, subnets and security group linked to them when generated. Lambdas are connected to a VPC in your account, and by default the function can't access the RDS (or other services) if VPC doesn't provide access (either by routing outbound traffic to a NAT gateway in a public subnet, or having a VPC endpoint, both of which incur cost or require more setup).

How:
Provide a sam deploy parameter that will disable the linkage between the VPC/sg and will let lambdas run in a default Lambda owned VPC that has access to RDS (and other AWS services).

Issue:
Typically, linking your own VPC to the lambdas and not setting addition NAT gateway or VPC endpoint configuration will result in a "Connect timeout on endpoint URL: "https://rds.[region].amazonaws.com/"" or "Task timed out after 3.00 seconds" issue.
Typical output from the create snapshot lambda in this case is:
Connect timeout on endpoint URL: "https://rds.eu-west-1.amazonaws.com/"
Traceback (most recent call last):
  File "/var/task/snapshot_function.py", line 22, in lambda_create_dbinstance_snapshot
    util.eval_snapshot_exception(error, instance_id, rds)
  File "/opt/python/utility.py", line 71, in eval_snapshot_exception
    raise custom_exceptions.SnapshotCreationException(error_message)

Indexing Trapheus as part of Python PyPi

In the larger scheme of things want to index Trapheus as a Python package in PyPi, so that can be installed via pip and indexed and searchable here https://pypi.org/

Model Trapheus as an AWS Systems Manager automation

Trapheus is currently modelled as an AWS step function or state machine.

This issue looks to alternatively support Trapheus as a SSM automation which can be run on AWS, cloned and shared with other accounts etc.
https://docs.aws.amazon.com/systems-manager/latest/userguide/documents-creating-content.html

Support custom snapshot identifier based db restoration

Currently Trapheus supports a limited set of input parameters to the state machine as defined here: https://github.com/intuit/Trapheus#-execution. Theidentifier parameter refers to the rds instance/cluster id.

The state machine works on the assumption/prerequisite that the restoration will be done using a snapshot with identifier in the format <identifier-snapshot> or snapshot creation will create/update snapshot with identifier as <identifier-snapshot>.

As part of this issue, we would like to add an enhancement where users can provide a unique snapshot-identifier as an additional input parameter to the state machine and the snapshot id provided as a value for this parameter be used for the db restoration process (when task=db_restore)

Refer https://github.com/intuit/Trapheus#-execution for more details on task types

Graceful fallback for ?q=search

https://github.com/intuit/Trapheus/blob/master/labs/TrapheusAI/providers/github_provider.py#L15

Submit a graceful fallback for this, if there is an error parsing query or insufficient auth is provided

Add tags to generated lambdas

The template yaml defines the lambdas that needs to be deployed https://github.com/intuit/Trapheus/blob/master/template.yaml#L182 . We need to add tags as to those lambdas as shown here https://github.com/aws/serverless-application-model/blob/master/versions/2016-10-31.md#example-awsserverlessfunction .

We can add the tags of trapheus-<<functionName>>`` , for instance for RenameLambdaFunction the corresponding tag would betrapheus-rename```

upgrade circle CI to the latest python version

Please ensure circle CI is in the latest python version
https://github.com/intuit/Trapheus/blob/master/.circleci/config.yml#L4

Parallelise the alert and rollback steps in the state machine

Presently in the state machine, in any failure scenario, there is an email and optional slack alert configured.
Post the alert, a rollback is initiated.
Therefore in scenarios of failure in the alert mechanisms, the rollback is also blocked.

https://github.com/intuit/Trapheus/blob/master/screenshots/failure_handling.png

Enhance this sequential part of the state machine such that the alert and rollback happens in parallel.

Resources to start off: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

Add support for scheduling state machine through cloudwatch

Support to trigger the state machine through a scheduled cloudwatch cron

Use DNS CNAME to reduce client side impact during recovery in high scaled use cases

Problem
Trapheus today incurs a downtime while doing a restoration. The downtime arises primarily for the reason that AWS does not let us restore a database with the same endpoint. In order to keep the parity, Trapheus moves the database to a temporary one and restores it with a snapshot. Trapheus assumes we do this restoration during a maintenance window or applications to handle database connection failures gracefully with retries using exponential backoff. This enables the application to recover when a database connection becomes unavailable during a restore without causing any application to unexpectedly crash.

For more nuanced use cases where the above assumption can't hold true for instance in highly scaled scenarios, there are special considerations when you restore an RDS DB instance to replace a live RDS DB instance. For example, we must determine how we will redirect our existing database clients to the new instance with minimal interruption and modification. We also must ensure continuity and consistency in the data within the database by considering the restored data time and the recovery time when the new instance begins receiving writes.

Solution
We can create a separate DNS CNAME record that points to our DB instance endpoint and have our clients use this DNS name. Then we can update the CNAME to point to a new, restored endpoint without having to update our database clients.
Set the Time to Live (TTL) for our CNAME record to an appropriate value. The TTL that we specify determines how long the record is cached with DNS resolvers before another request is made. It is important to note that some DNS resolvers or applications might not honor the TTL, and they might cache the record for longer than the TTL. For Amazon Route 53, if you specify a longer value (for example, 172800 seconds, or two days), you reduce the number of calls that DNS recursive resolvers must make to Route 53 to get the latest information in this record. This reduces latency and reduces our bill for the Route 53 service. For more information, see How Amazon Route 53 routes traffic for your
domain.

Literature
Applications and client operating systems might also cache DNS information that we have to flush or restart to initiate a new DNS resolution request and retrieve the updated CNAME record. When we initiate a database restore and shift traffic to a restored instance, we need to verify that all our clients are writing to your restored instance instead of our prior instance.
The data architecture might support restoring our database, updating DNS to shift traffic to our restored instance, and then remediating any data that may still be written to our prior instance. Then all access is from our newly restored instance. This process creates room for a gradual decommissioning approach, first limit access to a running database by the security group. You can eventually stop the instance when it is no longer needed.

Add an option in TrapheusAI to import a local CSV file

Give an option to user to upload data from a local CSV file and then conduct an analysis using natural language.
https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI

Separate create_snapshot from db_restore so latter is not run automatically

Hi, I have an enhancement request for Trapheus
we would like to use Trapheus to create and export a Snapshot from one AWS Environment
transfer the exported snapshot to a target Environment (different account) and then restore the snapshot there

currently task create_snapshot does automatically and always run into the db_restore part.
To achieve what we need there need to be a way to separate create_snapshot task from db_restore, so db_restore not to be run after successful create_snapshot

maybe there are various ways to do that, but the simplest one would to do a small (one liner) modification to template.yaml to the step function

"StartClusterSnapshotExportTask": {
                "Type": "Task",
                "Resource": "${ClusterSnapshotExportLambdaArn}",
                "Next": "ClusterRename",

by changing ClusterRename to RestorePipelineComplete

We would like to use the code as is so not to clone your code but would like to see this in there.
Or maybe introduce a task create_shapshot_only Task / action and extend the decision logic to stop the step function after having exported the RDS snapshot to S3?

Please feedback on whether such a modification would be possible.
Regards, Peter

Add support for moving the snapshot to a S3 bucket

Right now the snapshot created is stored in the source AWS account. We want to add support for storing of this snapshot in a S3 bucket thereby opening up to additional DR use cases.

Please refer the below doc.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ExportSnapshot.html

Move all lambda functions to use Graviton2 for cost optimisation

Our lambda function defintion from here https://github.com/intuit/Trapheus/blob/master/template.yaml#L178C24-L178C24 needs to be moved to Graviton,

Lambda offers two instruction set architectures in most AWS Regions: x86 and arm64. Lambda functions powered by Graviton2 are designed to deliver up to 19 percent better performance at 20 percent lower cost.

We should move our functions to explicityly use Graviton2 (see example below)

MyFunction:
  Type: AWS::Serverless::Function
  Properties:
    Architectures:
      - arm64  #  use Graviton2
      -

Add an option for the user to download the dataset onto the local machine

Please check the demo of dataset search https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo . In the dataset search , allow the user an option to download the dataset onto the local file system as well.

Fix broken link for AWS VPC documentation

The following link in the documentation is no longer maintained. Find an alternative for this.

Improve granularity of permissions provided for the AWS resources

Presently high level permissions have been provided for all AWS resources.
This needs to be improved to make it more granular for security purposes and decrease the blast radius.

[No-Code] Addition of new blogs

We will accpet the PR as hacktoberfest-accepted for this issue, if the PR adds new blogs in the documentation.
You can write a blog on Trapheus/TrapheusAI in ,markdown/medium pages . We will update it as Community Spotlight along with your name

Create Definition of Done (DoD) document

Create a DoD document in .github folder, that will contain all the prerequisites for a feature to be considered Done by the implementer/reviewer.

Those would entail:

have a UT for the feature covering the happy path and the exception paths, and at least cover 85% of LoC
don't have duplicated or redundant code (even in UTs). If the code is duplicated, extract, refactor to get a non-redundant code
if you touch a LoC, you are expected to fix or create a UT for it following the above rules (even if this hasn't been followed or was missed by the previous implementer/reviewer)
have the appropriate documentation for it in the README file.
other?

Add caching in TrapheusAI dataset search

TrapheusAI https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo has a dataset search option, which allows the user to download public datasets from github. It then gives the user the option to select between multiple datasets as shown in the demo.

When the user selects another dataset and comes back to the previous one, the app makes another API call. We need to implment caching there in order to save that API call

Use https instead of http for links in README

Link out to http://www.serverless.com and http://public.serverless.com/badges/v3.svg uses http instead of https .

Allow the user an option to choose between models for completion API.

Have a look at the TrapheusAI demo https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo

We want the user to be given an option to choose between LLM models , which enables them to acquire a type of accuracy and exactness which is beyond just language related stuff.

Also the dataset might contain embedded contextual information (schema, short forms) that might not be standardised across the datasets for the same search query, hence might not be well learned by all models.

Hence allowing the users to choose model and iterate on the search.

https://github.com/intuit/Trapheus/blob/master/labs/TrapheusAI/llm/model.py#L22

Create a jupyter notebook workshop

Create a jupyter notebook for Trapheus that can help show the execution commands in a sandbox environment thus simplifying the installation process currently

Explore the pre version of a major upgrade from open ai

Have a look at the TrapheusAI demo https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI#-demo

OpenAI is prepping up a major version upgrade https://github.com/openai/openai-python#beta-release . We have a dependency on OpenAI https://github.com/intuit/Trapheus/blob/master/labs/TrapheusAI/requirements.txt#L2 for LLMs.

We can install a pre version of the upgrade by pip install --pre openai

This issue is to explore if anything breaks in TrapheusAI app because of the upgrade post the above installation.

Automate installer preferences for Trapheus

Currently the installer file for Trapheus https://github.com/intuit/Trapheus/blob/master/install.py asks user for the following

region
s3_bucket
VPC
Subnets
notification preferences such as slack webhooks and SES email configs

The ask is to make the installer more easy, for instance , when the user enters a region, use AWS CLI commands , to get the s3_buckets , VPCs, Subnets etc and present the user an option to pick and choose any or create a new one from the command line itself.

Update reference code structure in readme

The code structure reference here: https://github.com/intuit/Trapheus#-contributing-to-trapheus is out dated.
This issue involves the following fixes:

remove the events reference
add the completed updated structure under the tests/unit path
add docs path
add install.py, requirements.txt and setup.py

Ideating and submitting paper for Trapheus at JOSS

Submitting a well-articulated paper at JOSS

https://joss.theoj.org/

and thus keeping the passion for academia alive and doing some serious contrbution back.

Trapheus exporting and restoring to other AWS accounts

Is it possible to use Trapheus to automate export + restore databases to different AWS accounts?

For example, I would like to run create_snapshot_only on Account A and run db_restore on Account B using Account's A snapshot

Upgrade to latest python(3.11.5) fixing CVE-2023-40217

Update python runtime https://github.com/intuit/Trapheus/blob/master/template.yaml#L10 to latest 3.11.5 (at the time of filing this issue) to address CVE-2023-40217
https://www.python.org/downloads/release/python-3115/

Add an option in TrapheusAI for a user to import a Google sheet

Allow the user an option to add CSV data by passing in a public GSheets URL https://github.com/intuit/Trapheus/tree/master/labs/TrapheusAI

Documentation addition for slack_notification support

Trapheus readme needs to be enhanced to have details about the support for slack notification for failure alerts

add docker image support for cuda

We have the dockerfile support for normal CPU in trapheusAI , add a docker file support for CUDA

Introducing dependency management in Trapheus.

In order to support ease of use, allowing the introduction of dependency management to be able to manage complex dependency graphs and manage upgrade and security issues.

Optimise AWS permissions for the step function execution

The Lambda execution role has broad permissions here https://github.com/intuit/Trapheus/blob/master/template.yaml#L97 . We should restrict it to the permissions that are required.

Add Spanish translation of docs

Please ensure your issue adheres to the following guidelines:

Use the following format: * [owner/repo](link)
Link additions should be added to the bottom of the relevant category.
New categories or improvements to the existing categorization are welcome.
Search previous suggestions before making a new one, as yours may be a duplicate.
Sort by alphabetical order

Thanks for contributing!

Modify Trapheus readme to include TrapheusAI as well

Trapheus now includes another subproject TrapheusAI . Can we modify the parent readme to include both and clicking of any of them opens up the detailed readme of the required one.

Adding code coverage to circle CI and modify the build badge to show the coverage

Add code coverage metrics to circle CI and publish it in the build badge.

Refer:-
https://circleci.com/docs/2.0/code-coverage/#python

addition of Tags causing deployment failures for trapheus CFT

Multiple failures due to addition of tags for resources in template.yaml:

securityGroup tag incorrectly added for AWS::Serverless::LayerVersion (tags are not supported for this resource of this type)

indentation issues which lead to stack creation to be stuck at REVIEW_IN_PROGRESS
https://github.com/intuit/Trapheus/blob/master/template.yaml#L542
https://github.com/intuit/Trapheus/blob/master/template.yaml#L569
https://github.com/intuit/Trapheus/blob/master/template.yaml#L596
Expected format for Tags is a list with Key and Value attributes which results in deployment errors

Add state machine execution id when sending failure notification (email and slack)

As a follow up to issue #141, provide additional attribute of Execution id in the email and slack notifications sent to users on on pipeline failure.

the execution id can be obtained from the context object and should to be retrieved in the email and slack notification lambda functions.

Reference: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-contextobject.html

intuit / trapheus Goto Github PK

trapheus's Introduction

Table of Contents

Pre-Requisites

Parameters

Tagging

Instructions

Setup

To setup the Trapheus in your AWS account, follow the steps below:

To set up the step function execution through a scheduled run using CloudWatch rule, follow the steps below:

Execution

How it Works

Contributing to Trapheus

Blogs

Maintainers

Contributors

trapheus's People

Contributors

Stargazers

Watchers

Forkers

trapheus's Issues

Recommend Projects

Recommend Topics

Recommend Org