villasv / aws-airflow-stack Goto Github PK

Turbine: the bare metals that gets you Airflow

Home Page: https://victor.villas/aws-airflow-stack/

License: MIT License

Makefile 5.08% Python 54.26% Shell 40.66%

airflow airflow-cluster airflow-cookbook aws aws-cloudformation

aws-airflow-stack's Introduction

⚠️ This project is no longer receiving updates. We have moved on from using CloudFormation to manage our infrastructure and recommend others doing the same. As of 2021/2022, we think CDK and Terraform are now the best in class for IaC. Also, Airflow for Kubernetes has gotten more traction and we have moved into EKS for our self-managed Airflow. If you're looking for Open Source Airflow deployment options, we recommend Astronomer.

Turbine

Turbine is the set of bare metals behind a simple yet complete and efficient Airflow setup.

The project is intended to be easily deployed, making it great for testing, demos and showcasing Airflow solutions. It is also expected to be easily tinkered with, allowing it to be used in real production environments with little extra effort. Deploy in a few clicks, personalize in a few fields, configure in a few commands.

Overview

The stack is composed mainly of three services: the Airflow web server, the Airflow scheduler, and the Airflow worker. Supporting resources include an RDS to host the Airflow metadata database, an SQS to be used as broker backend, S3 buckets for logs and deployment bundles, an EFS to serve as shared directory, and a custom CloudWatch metric measured by a timed AWS Lambda. All other resources are the usual boilerplate to keep the wind blowing.

Deployment and File Sharing

The deployment process through CodeDeploy is very flexible and can be tailored for each project structure, the only invariant being the Airflow home directory at /airflow. It ensures that every Airflow process has the same files and can upgraded gracefully, but most importantly makes deployments really fast and easy to begin with.

There's also an EFS shared directory mounted at at /mnt/efs, which can be useful for staging files potentially used by workers on different machines and other synchronization scenarios commonly found in ETL/Big Data applications. It facilitates migrating legacy workloads not ready for running on distributed workers.

Workers and Auto Scaling

The stack includes an estimate of the cluster load average made by analyzing the amount of failed attempts to retrieve a task from the queue. The metric objective is to measure if the cluster is correctly sized for the influx of tasks. Worker instances have lifecycle hooks promoting a graceful shutdown, waiting for tasks completion when terminating.

The goal of the auto scaling feature is to respond to changes in queue load, which could mean an idle cluster becoming active or a busy cluster becoming idle, the start/end of a backfill, many DAGs with similar schedules hitting their due time, DAGs that branch to many parallel operators. Scaling in response to machine resources like facing CPU intensive tasks is not the goal; the latter is a very advanced scenario and would be best handled by Celery's own scaling mechanism or offloading the computation to another system (like Spark or Kubernetes) and use Airflow only for orchestration.

Get It Working

0. Prerequisites

Configured AWS CLI for deploying your own files (Guide)

1. Deploy the stack

Create a new stack using the latest template definition at templates/turbine-master.template. The following button will deploy the stack available in this project's master branch (defaults to your last used region):

The stack resources take around 15 minutes to create, while the airflow installation and bootstrap another 3 to 5 minutes. After that you can already access the Airflow UI and deploy your own Airflow DAGs.

2. Upstream your files

The only requirement is that you configure the deployment to copy your Airflow home directory to /airflow. After crafting your appspec.yml, you can use the AWS CLI to deploy your project.

For convenience, you can use this Makefile to handle the packaging, upload and deployment commands. A minimal working example of an Airflow project to deploy can be found at examples/project/airflow.

If you follow this blueprint, a deployment is as simple as:

make deploy stack-name=yourcoolstackname

Maintenance and Operation

Sometimes the cluster operators will want to perform some additional setup, debug or just inspect the Airflow services and database. The stack is designed to minimize this need, but just in case it also offers decent internal tooling for those scenarios.

Using Systems Manager Sessions

Instead of the usual SSH procedure, this stack encourages the use of AWS Systems Manager Sessions for increased security and auditing capabilities. You can still use the CLI after a bit more configuration and not having to expose your instances or creating bastion instances is worth the effort. You can read more about it in the Session Manager docs.

Running Airflow commands

The environment variables used by the Airflow service are not immediately available in the shell. Before running Airflow commands, you need to load the Airflow configuration:

$ export $(xargs </etc/sysconfig/airflow.env)
$ airflow list_dags

Inspecting service logs

The Airflow service runs under systemd, so logs are available through journalctl. Most often used arguments include the --follow to keep the logs coming, or the --no-pager to directly dump the text lines, but it offers much more.

$ sudo journalctl -u airflow -n 50

FAQ

Why does auto scaling takes so long to kick in?

AWS doesn't provide minute-level granularity on SQS metrics, only 5 minute aggregates. Also, CloudWatch stamps aggregate metrics with their initial timestamp, meaning that the latest stable SQS metrics are from 10 minutes in the past. This is why the load metric is always 5~10 minutes delayed. To avoid oscillating allocations, the alarm action has a 10 minutes cooldown.
Why can't I stop running tasks by terminating all workers?

Workers have lifecycle hooks that make sure to wait for Celery to finish its tasks before allowing EC2 to terminate that instance (except maybe for Spot Instances going out of capacity). If you want to kill running tasks, you will need to SSH into worker instances and stop the airflow service forcefully.
Is there any documentation around the architectural decisions?

Yes, most of them should be available in the project's GitHub Wiki. It doesn't mean those decisions are final, but reading them beforehand will help formulating new proposals.

Contributing

This project aims to be constantly evolving with up to date tooling and newer AWS features, as well as improving its design qualities and maintainability. Requests for Enhancement should be abundant and anyone is welcome to pick them up.

Stacks can get quite opinionated. If you have a divergent fork, you may open a Request for Comments and we will index it. Hopefully this will help to build a diverse set of possible deployment models for various production needs.

See the contribution guidelines for details.

You may also want to take a look at the Citizen Code of Conduct.

Did this project help you? Consider buying me a cup of coffee ;-)

Licensing

MIT License

Copyright (c) 2017 Victor Villas

See the license file for details.

aws-airflow-stack's People

Contributors

Stargazers

Watchers

aws-airflow-stack's Issues

Environment variables are not correclty injected to Supervisor

Airflow has problems with the Celery user key defined inside the Supervisor environment, but works normally with the bash environment.

Airflow database should not be publicly available

The default behavior is to be assigned a public endpoint.

Upgrade to Airflow 1.10.0

Released almost a month ago with a few breaking changes with configuration variables.

Running DAG's don't have their tasks scheduled

Something quite wrong is happening with manually triggered dags. Tasks are being assigned task_state=None and are not seen by the scheduler, so they're never getting queued.

Running a backfill from the CLI works, so I'm inclined to think this is a problem with the scheduler.

Parameters for AutoScaling alarm thresholds

This is something very dependent on the stack, must be parametrized.

Intermittent problems with EPEL 6 repository

Once in a while yum makecache fails, apparently because EPEL screws up (epel: Check uncompressed DB failed) and expected the wrong payload size ([Errno 14] HTTP Error 416 - Requested Range Not Satisfiable for all its mirrors).

Perhaps I can make this problem go from "uncommon" to "rare" if I configure yum to ignore unavailable repositories (yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true), therefore encouraging it to ignore the cache.

Shouldn't be very impactful, the only use of EPEL6 at the moment is the supervisor package.

Adhere to the maximum line length for YAML adopted by CFN

Uploading the code to the designer and then downloading it will convert a vew string blocks from standard line continuation | to >, introducing a bunch of blank lines. This is a minor inconvenience, the changes are rapidly undone.

A very minor issue, but making the upload/download idempotent would be pretty convenient.

Regions and AMIs should be parameterized

Feedback from the AWS team. Really a must for more diverse deployment scenarios.

Have the airflow worker process run under supervisord

Is it really necessary? Probably not, but having airflow attached to a service will make it easier to control it and it might help solving #7.

Merge CloudFormation::Init sections in a single resource

A very welcome trick is described here to share initialization configuration for EC2 instances.

Basically it comes from the fact that calling the initialization routine will accept the resource ID instead of (what I assumed) always fetching from the caller resource. This means that one resource can access the CFN::Init metadata from the other.

Using a dummy resource we can concentrate all init configs in a single place for once.

Turbine CFN stack deploy diagnostic tool

Having a diagnostic tool for airflow deployments could be really useful, although this task might seem too general and complex. Testing the UI and if tasks are getting scheduled/executed would be a good start, as well as connections.

Zero networking resources (parametrized) fork

At @gupy-io, the deployment must happen inside the appropriate VPC and Subnets which have VPN configured. Probably most production environments will be similar, and already will contain multi AZ subnets required for RDS.

Do not name SQS queues manually

Queue names are unique because they're used by CloudWatch metrics. Just let AWS pick a name so multiple stacks can be deployed simultaneously in the same region.

The /efs mount doesn’t survive reboots

Shouldn't be a issue most of the time, but rebooting shouldn't be so destructive.

SQS as Celery broker requires region configuration

Now that the stack targets any region, this configuration should be passed to Celery.

http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html#region

Relevant section: https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/default_airflow.cfg#L416

Use pre-baked AMIs with installed packages and scripts

This would improve two aspects: initialization times and code reuse (worker, scheduler and interface can be identical images).

Set Airflow's webserver hostname dynamically

Setting the hostname is not mandatory but very practical, specially when trying to consult logs.

Remove hardcoded Cidr blocks from SecurityGroups

Enhancement for the default setup, but actually a bug in the RFC/#24 with custom networking.

Don't name security groups manually

Again, another resource that will lead to name crashes if deploying multiple stacks.

Have the airflow scheduler process run under supervisord

This is important not only for reliability reasons, but it will serve as model for correctly starting processes at init time for workers and after continuous integration.

Use remote logging with S3 instead of EFS

EFS I/O can be quite expensive, S3 is reasonably well supported by Airflow at this point.

Have the airflow webserver process run under supervisord

This should follow the same model from #8, with minor changes and unfortunately lots of template duplication.

Celery role doesn't need permission to create queues

Airflow installation doesn’t work in the latest AMIs

Reported by @tomfaulhaber, didn't reproduce yet.

Still need to decide on #34, because providing baked AMIs adds a lot more complexity to the template, although it greatly simplifies the actual resource creation during autoscaling.

But again, autoscaling is not the main feature of the stack anymore. The installation is not that expensive and as long as it consistently works on the newer official Amazon Linux AMIs it should be fine.

Parameters for AutoScaling min/max group sizes

This is something very dependent on the stack, must be parametrized.

Use package managed supervisord service configuration

Currently including huge initscripts (init.d definitions) inside the template. That wasn't necessary using Ubuntu.

Adding a database suited for a feature store

Also very likely to be wanted by most people, although VERY much opinionated. Most likely to be schemaless and commonly used by the big data folk.

Don't assign a specific private IP address to the EFS MountTarget

Assigning fixed IP addresses makes replacing resources harder if necessary. It also makes updating CIDR blocks more complicated.

Celery with SQS

I was reading a blog post about SQS that was not working well with Airflow .

https://medium.com/@Twistacz/whats-wrong-with-sqs-and-why-we-gave-it-up-22780d4f48ef

Is this the case now ???
If so what is the best broker to make it work in AWS environment .

Optionally allow Spot Pricing in the AutoScaling Group

This was one of the initial goals and it was postponed until we had a working stack. Now it's the time.

Add a presentation-like document for workshops and lightning talks

Not only for me, of course. Almost anyone that is going to want to use this will have to pitch the idea and justify the infrastructure and investment associated. Make data engineers life easier.

Use the stack name as tag prefix on AWS resources

The stack was already configured in a way that multiple stacks can be created without conflicts - by using tags and not unique names to identify them.

But still, those tags are generally turbine- or some prefix alike. Ideally we should provide support for additional tagging information so users can easily distinguish between a production deployment and a testing deployment.

Use the cgroup task runner to limit resource usage

This contrib task runner allows us to configure a maximum resource usage for tasks which could offer some great flexibility and control.

Trust the installation of celery[sqs] to handle its dependencies

Currently installing boto3 to solve #14, but I should get rid of it as soon as they fix the dependency listing.

Worker machines created by ASG won't pick up tasks

Not sure exactly what's happening. It seems that Celery started but died, maybe the UserData script is not supposed to hang on it.

Make sure to install boto3 along with celery[sqs]

Just a temporary fix, until they update the dependencies.
Watch celery/celery#4195

Readme should advise to restdb and restart webserver appropriately

Simply following the given instructions does no good and will leave the webserver and scheduler temporarily with inconsistent views. Also the resetdb remark comes too late, as it is certain that it will be needed if deploying each own DAGs.

Make sure AWS_SECRET_ACCESS_KEY is urlencoded

It may contain characters that require urlencoding. Easiest way out is to inline some python.

supervisorctl fails to get back from its socket

Makes using the cli very painful

Set a name tag for EFS file systems

Add Continuous Integration to the Airflow files inside EFS

This is going to be a little complicated, but I believe it's doable with CodeDeploy and CodePipeline.

Finishing this will completely eliminate the need to setup things with SSH.

Scaling back instances might kill running tasks

The stack autoscaling model is not flexible at all: scale in response to the queue length. But except for DAGs consisting solely of quick operations, this is problematic.

If the downscaling in response to 0 queued tasks is N seconds, any tasks that take N+1 seconds might get killed. That's just an obvious example, you don't need to get execution times close to N to get problems if you have many tasks because the average queue length can still be close to zero.

It's very unlikely that there can be a one-size-fits-all autoscaling strategy, this one was implemented because it was easy and useful. The problem is, we should strive for a transparent infrastructure and should not have things like an Operator that allocates 5 machines, for example. DAGs should be all about data.

Reinstall pycurl with proper SLL backend

Celery was erroring with
ImportError: The curl client requires the pycurl library.
even though PyCurl is installed.

Turns out it was erroring to be imported because
ImportError: pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other).

Solution is already documented:
pip-3.5 install --no-cache-dir --compile --ignore-installed --install-option="--with-nss" pycurl