Giter Club home page Giter Club logo

cloudtrail-partitioner's People

Contributors

0xdabbad00 avatar andresriancho avatar dependabot[bot] avatar jordan-wright avatar klaus993 avatar kylelady avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloudtrail-partitioner's Issues

Partitioner lambda function failing due to YAML import

My Partitioner function is failing with the error below:
[ERROR] Runtime.ImportModuleError: Unable to import module 'main': No module named 'yaml'

I commented out the import yaml line, as well as the lines that try to read the yaml config file, and it seems to have resolved the issue.

Partition projection: possible alternative to Cloudtrail Partitioner

It looks like a week or so ago AWS released partition projection for Athena. It discusses the performance benefits (apparently it doesn't need to call the Glue API to list partitions) but I feel the real benefit is avoiding the need to create partitions.

For example, here's how I set it up for my CloudTrail logs:

CREATE EXTERNAL TABLE cloudtrail_logs_auto (
	eventversion STRING,
	-- ...trimmed for clarity
	sharedeventid STRING,
	vpcendpointid STRING
)
PARTITIONED BY (accountId string, region string, date string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/AWSLogs/o-orgid/'
TBLPROPERTIES(
	"projection.enabled" = "true",
	"projection.date.type" = "date",
	"projection.date.range" = "2020/01/01,NOW",
	"projection.date.format" = "yyyy/MM/dd",
	"projection.date.interval" = "1" ,
	"projection.accountid.type" = "enum",
	"projection.accountid.values" = "0123456789012,210987654321,etc",
	"projection.region.type" = "enum",
	"projection.region.values" = "us-east-1,us-west-2,ap-southeast-2,etc",
	"storage.location.template" = "s3://bucketname/AWSLogs/o-orgid/${accountid}/CloudTrail/${region}/${date}"
);

This allows me to run queries like SELECT * FROM cloudtrail_logs_auto where date >= '2020/02/28' and date < '2020/06/01', with or without accountId and region columns, etc. A few things worth noting:

  • You have to enumerate every possible account ID. Account IDs that aren't listed in table properties get ignored. So you'd want to update the table properties when new accounts and added.
  • Likewise, you have to enumerate all regions of interest.
  • The NOW in projection.date.range means "now" in terms of when the query is executed. So new projected partitions automatically appear each day.
  • SHOW partitions cloudtrail_logs_auto will yield no results

I haven't yet found a gotcha or a reason why not to use this new functionality. As long as there aren't any show-stopper issues, it feels like this changes the need for cloudtrail-partitioner: it could instead change to keeping the projection.accountid.values and projection.region.values properties up to date instead. Thoughts?

Perform ETL job to merge files

This would be a big change for this project. Athena falls over when it tries to read too many small files (it crashes due to rate limiting apparently). CloudTrail log files are often a few KB in size in less active accounts. Athena works best when it reads files that are 64MB apparently. A nightly ETL job could take the previous day's log files and concat them into 64MB files, possibly into a separate S3 bucket.

I'm unsure of doing this. This was part of feedback I received from the Athena team for problems I was running into with a client. I'm more in the camp that Athena should be fixed, and not that I need to build an ETL to work around its limitations.

The org master account does not seem to be queryable

In an Organization trail, if you have account 000000000000 as your Org master, and say 111111111111 as another account, then your S3 bucket will contain:

- AWSLogs
  - 000000000000
  - o-123
    - 000000000000
    - 111111111111

Note that the 000000000000 account has two buckets dedicated to it. The first bucket (/AWSLogs/000000000000) is empty. The real logs are at /AWSLogs/o-123/000000000000. Looks like the code identifies the first bucket as being the one to make queries against, which it should use the one that is a child of the org key.

Allow null log_path_prefix

The recent PR that was merged enforced a \ on log_path_prefix. When that value is ``, that creates a problem.

Using AWS Organisation

Hi, when using AWS organisations the logfile path need to be changed. I added organisation id after AWSLogs and did a redeploy but I think the best is to let the config handle it.

209c209
<     log_path_prefix = log_path_prefix + "AWSLogs/"
---
>     log_path_prefix = log_path_prefix + "AWSLogs/[AWSORGANISATIONID]/"

Set a default region

If you have a large environment with many AWS accounts, you won't want to run this from your laptop as the initial setup will take hours. If you run this from an EC2, no default region is set. We should get the location of the S3 bucket, and then set the region to that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.