duo-labs / cloudtrail-partitioner Goto Github PK

View Code? Open in Web Editor NEW

150.0 150.0 25.0 212 KB

License: BSD 3-Clause "New" or "Revised" License

JavaScript 30.77% Python 69.23%

cloudtrail-partitioner's People

Contributors

Stargazers

Watchers

cloudtrail-partitioner's Issues

Partitioner lambda function failing due to YAML import

My Partitioner function is failing with the error below:
[ERROR] Runtime.ImportModuleError: Unable to import module 'main': No module named 'yaml'

I commented out the import yaml line, as well as the lines that try to read the yaml config file, and it seems to have resolved the issue.

Partition projection: possible alternative to Cloudtrail Partitioner

It looks like a week or so ago AWS released partition projection for Athena. It discusses the performance benefits (apparently it doesn't need to call the Glue API to list partitions) but I feel the real benefit is avoiding the need to create partitions.

For example, here's how I set it up for my CloudTrail logs:

CREATE EXTERNAL TABLE cloudtrail_logs_auto (
	eventversion STRING,
	-- ...trimmed for clarity
	sharedeventid STRING,
	vpcendpointid STRING
)
PARTITIONED BY (accountId string, region string, date string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/AWSLogs/o-orgid/'
TBLPROPERTIES(
	"projection.enabled" = "true",
	"projection.date.type" = "date",
	"projection.date.range" = "2020/01/01,NOW",
	"projection.date.format" = "yyyy/MM/dd",
	"projection.date.interval" = "1" ,
	"projection.accountid.type" = "enum",
	"projection.accountid.values" = "0123456789012,210987654321,etc",
	"projection.region.type" = "enum",
	"projection.region.values" = "us-east-1,us-west-2,ap-southeast-2,etc",
	"storage.location.template" = "s3://bucketname/AWSLogs/o-orgid/${accountid}/CloudTrail/${region}/${date}"
);

This allows me to run queries like SELECT * FROM cloudtrail_logs_auto where date >= '2020/02/28' and date < '2020/06/01', with or without accountId and region columns, etc. A few things worth noting:

You have to enumerate every possible account ID. Account IDs that aren't listed in table properties get ignored. So you'd want to update the table properties when new accounts and added.
Likewise, you have to enumerate all regions of interest.
The NOW in projection.date.range means "now" in terms of when the query is executed. So new projected partitions automatically appear each day.
SHOW partitions cloudtrail_logs_auto will yield no results

I haven't yet found a gotcha or a reason why not to use this new functionality. As long as there aren't any show-stopper issues, it feels like this changes the need for cloudtrail-partitioner: it could instead change to keeping the projection.accountid.values and projection.region.values properties up to date instead. Thoughts?

Link to CloudTrail Lake as alternative?

AWS recently launched a service which wraps a lot of this up for you: CloudTrail Lake

Does it make sense to point people there up front and then they can leverage this project if they want to manage things themselves?

Check for the query results bucket

Ensure the query results bucket exists, and is in this account (to detect bucket sniping attacks)

Perform ETL job to merge files

This would be a big change for this project. Athena falls over when it tries to read too many small files (it crashes due to rate limiting apparently). CloudTrail log files are often a few KB in size in less active accounts. Athena works best when it reads files that are 64MB apparently. A nightly ETL job could take the previous day's log files and concat them into 64MB files, possibly into a separate S3 bucket.

I'm unsure of doing this. This was part of feedback I received from the Athena team for problems I was running into with a client. I'm more in the camp that Athena should be fixed, and not that I need to build an ETL to work around its limitations.

The org master account does not seem to be queryable

In an Organization trail, if you have account 000000000000 as your Org master, and say 111111111111 as another account, then your S3 bucket will contain:

- AWSLogs
  - 000000000000
  - o-123
    - 000000000000
    - 111111111111

Note that the 000000000000 account has two buckets dedicated to it. The first bucket (/AWSLogs/000000000000) is empty. The real logs are at /AWSLogs/o-123/000000000000. Looks like the code identifies the first bucket as being the one to make queries against, which it should use the one that is a child of the org key.

Document privileges needed

Document privileges needed by the users and for setting this up.

Allow null log_path_prefix

The recent PR that was merged enforced a \ on log_path_prefix. When that value is ``, that creates a problem.

Using AWS Organisation

Hi, when using AWS organisations the logfile path need to be changed. I added organisation id after AWSLogs and did a redeploy but I think the best is to let the config handle it.

209c209
<     log_path_prefix = log_path_prefix + "AWSLogs/"
---
>     log_path_prefix = log_path_prefix + "AWSLogs/[AWSORGANISATIONID]/"

Set a default region

If you have a large environment with many AWS accounts, you won't want to run this from your laptop as the initial setup will take hours. If you run this from an EC2, no default region is set. We should get the location of the S3 bucket, and then set the region to that.

duo-labs / cloudtrail-partitioner Goto Github PK

cloudtrail-partitioner's People

Contributors

Stargazers

Watchers

Forkers

cloudtrail-partitioner's Issues

Partitioner lambda function failing due to YAML import

Partition projection: possible alternative to Cloudtrail Partitioner

Link to CloudTrail Lake as alternative?

Check for the query results bucket

Perform ETL job to merge files

The org master account does not seem to be queryable

Document privileges needed

Allow null log_path_prefix

Using AWS Organisation

Set a default region

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent