Giter Club home page Giter Club logo

serverless-datalake-example's Introduction

Serverless Datalake Example/Framework: The Best Practice of Serverless Datalake Enginerring on AWS Glue/Athena/MWAA(Airflow)/S3


1. Build

To build this project, you need have JDK & Maven on your local, and you should also have an AWS account with Admin role.

  1. Check out project.
  2. Update src/main/profiles/prd.properties, change replace all "<...>" values against your environment.
  3. Run maven command under project root dir:
mvn clean package
  1. Get serverless-datalake-example-1.0.zip file under target folder.

2. Install

You have 2 ways to get installer package, one is building from source codes just as step above, the other one is downloading directly:

wget --tries=10 --timeout=10 https://github.com/bluishglc/serverless-datalake-example/releases/download/v1.0/serverless-datalake-example-1.0.zip
unzip serverless-datalake-example-1.0.zip

then run install command:

./serverless-datalake-example-1.0/bin/install.sh \
    --region <your-aws-region> \
    --app-bucket <your-app-bucket-name> \
    --data-bucket <your-data-bucket-name> \
    --airflow-dags-home s3://<your-airflow-dags-path> \
    --access-key-id '<your-access-key-id>' \
    --secret-access-key '<your-secret-access-key>' \
    --nyc-tlc-access-key-id '<your-global-account-access-key-id>' \
    --nyc-tlc-secret-access-key '<your-global-account-secret-access-key>'

Note: the parameters of cli will overwrite values in prd/dev properties files.

3. Create Data Repo (China Region Only)

Because the nyc-tlc data sets are hosted on global S3, they are unreachable via china account AKSK, so we need download partial csv files to local first, then upload to China S3. Following cli will download

4. Init

This step will create crawlers, jobs, databases and tables.

sdl.sh init

5. Run

There are 2 ways to run, one is by airflow, the other is by cli. for airflow, you must have a running airflow environment, and have a configured ssh connection name ssh_to_client which can connect to current node via ssh, then copy wfl/sdl_monthly_build.py to the dag folder of airflow or assign path to --airflow-dags-home in install command, if all done, you will see a dag named sdl-monthly-build, then you can start it from airflow console page. Or you can run this project via cli immediately as following:

./serverless-datalake-example-1.0/bin/sdl.sh build --year 2020 --month 01

This command will run a full batch of data in 2020/01.

6. Known Issues

When 1.0 relase, it works well on cn and us regions, however, recently, if run on us-east-1 region, there will be an error when run jobs:

Exception in User Class: java.io.IOException : com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied ....

It is not caused by IAM or S3 policies, I also checked VPC S3 endpoint, no any problems, so by now, we have not find root cause.

7. Updates

7.1 @2022-11-04

csv https download link is unavailable from May 13, 2022, they are changed to parquet files. so following cli does not work anymore.

wget "https://nyc-tlc.s3.amazonaws.com/trip data/${category}_tripdata_${YEAR}-${MONTH}.csv" -P "/tmp/nyc-tlc/"

although csv backup files are provided as public bucket, cli as following works in aws global regions:

aws s3 cp "s3://nyc-tlc/csv_backup/${category}_tripdata_${YEAR}-${MONTH}.csv" "/tmp/nyc-tlc/"

however, this public bucket does NOT support anonymous access, so for a china region account, it is still inaccessible. at beginning, we created a github repo to store csv files, and download files as following:

wget --tries=10 --timeout=10 "https://github.com/bluishglc/nyc-tlc-data/releases/download/v1.0/${category}_tripdata_${YEAR}-${MONTH}.csv.gz" -P "/tmp/nyc-tlc/"
gzip -d "/tmp/nyc-tlc/${category}_tripdata_${YEAR}-${MONTH}.csv.gz"

however, the network from China to Github is very unstable, downloads often failed, so we have to add 2 parameters:

  • --nyc-tlc-access-key-id
  • --nyc-tlc-secret-access-key

They are actually AWS global S3 account's AKSK, with this AKSK, the cli can download CSV files from US region to local, then upload to China S3 bucket.

serverless-datalake-example's People

Contributors

bluishglc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.