villasv / aws-airflow-stack Goto Github PK

View Code? Open in Web Editor NEW

376.0 11.0 69.0 2.67 MB

Turbine: the bare metals that gets you Airflow

Home Page: https://victor.villas/aws-airflow-stack/

License: MIT License

Makefile 5.08% Python 54.26% Shell 40.66%

airflow airflow-cluster airflow-cookbook aws aws-cloudformation

aws-airflow-stack's Issues

Readme should advise to restdb and restart webserver appropriately

Simply following the given instructions does no good and will leave the webserver and scheduler temporarily with inconsistent views. Also the resetdb remark comes too late, as it is certain that it will be needed if deploying each own DAGs.

Use the stack name as tag prefix on AWS resources

The stack was already configured in a way that multiple stacks can be created without conflicts - by using tags and not unique names to identify them.

But still, those tags are generally turbine- or some prefix alike. Ideally we should provide support for additional tagging information so users can easily distinguish between a production deployment and a testing deployment.

SQS as Celery broker requires region configuration

Now that the stack targets any region, this configuration should be passed to Celery.

http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html#region

Relevant section: https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/default_airflow.cfg#L416

Set a name tag for EFS file systems

Adhere to the maximum line length for YAML adopted by CFN

Uploading the code to the designer and then downloading it will convert a vew string blocks from standard line continuation | to >, introducing a bunch of blank lines. This is a minor inconvenience, the changes are rapidly undone.

A very minor issue, but making the upload/download idempotent would be pretty convenient.

Running DAG's don't have their tasks scheduled

Something quite wrong is happening with manually triggered dags. Tasks are being assigned task_state=None and are not seen by the scheduler, so they're never getting queued.

Running a backfill from the CLI works, so I'm inclined to think this is a problem with the scheduler.

Airflow database should not be publicly available

The default behavior is to be assigned a public endpoint.

Make sure to install boto3 along with celery[sqs]

Just a temporary fix, until they update the dependencies.
Watch celery/celery#4195

Set Airflow's webserver hostname dynamically

Setting the hostname is not mandatory but very practical, specially when trying to consult logs.

Scaling back instances might kill running tasks

The stack autoscaling model is not flexible at all: scale in response to the queue length. But except for DAGs consisting solely of quick operations, this is problematic.

If the downscaling in response to 0 queued tasks is N seconds, any tasks that take N+1 seconds might get killed. That's just an obvious example, you don't need to get execution times close to N to get problems if you have many tasks because the average queue length can still be close to zero.

It's very unlikely that there can be a one-size-fits-all autoscaling strategy, this one was implemented because it was easy and useful. The problem is, we should strive for a transparent infrastructure and should not have things like an Operator that allocates 5 machines, for example. DAGs should be all about data.

Adding a database suited for a feature store

Also very likely to be wanted by most people, although VERY much opinionated. Most likely to be schemaless and commonly used by the big data folk.

Upgrade to Airflow 1.10.0

Released almost a month ago with a few breaking changes with configuration variables.

Optionally allow Spot Pricing in the AutoScaling Group

This was one of the initial goals and it was postponed until we had a working stack. Now it's the time.

Remove hardcoded Cidr blocks from SecurityGroups

Enhancement for the default setup, but actually a bug in the RFC/#24 with custom networking.

Airflow installation doesn’t work in the latest AMIs

Reported by @tomfaulhaber, didn't reproduce yet.

Still need to decide on #34, because providing baked AMIs adds a lot more complexity to the template, although it greatly simplifies the actual resource creation during autoscaling.

But again, autoscaling is not the main feature of the stack anymore. The installation is not that expensive and as long as it consistently works on the newer official Amazon Linux AMIs it should be fine.

Celery with SQS

I was reading a blog post about SQS that was not working well with Airflow .

https://medium.com/@Twistacz/whats-wrong-with-sqs-and-why-we-gave-it-up-22780d4f48ef

Is this the case now ???
If so what is the best broker to make it work in AWS environment .

Stop running airflow worker processes as root

Specially the airflow celery worker, that has to be forced to accept running as root.
Can be done configuring supervisor and perhaps a few folder permissions granted.

supervisorctl fails to get back from its socket

Makes using the cli very painful

Reinstall pycurl with proper SLL backend

Celery was erroring with
ImportError: The curl client requires the pycurl library.
even though PyCurl is installed.

Turns out it was erroring to be imported because
ImportError: pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other).

Solution is already documented:
pip-3.5 install --no-cache-dir --compile --ignore-installed --install-option="--with-nss" pycurl

Worker machines created by ASG won't pick up tasks

Not sure exactly what's happening. It seems that Celery started but died, maybe the UserData script is not supposed to hang on it.

Setup Scrutinizer-ci to keep the template on S3 up to date

Any CI provider would do, but Scrutinizer doesn't require me to add another YAML file in the project so it has that going for me which is nice.

Convert Scheduler EC2 instance to Amazon Linux AMI

This is not strictly necessary, but will make integration with CloudFormation utilities easier.
It will make setting up the environment harder, but this is generally easier to find answers around.

Necessary step before #8.

Make the Interface instance open for TCP at 8080 by default

ATM testing requires me to open the SecurityGroup TCP from 10.0.0.0/16 to 0.0.0.0/0.
This is supposed to be a testing setup, so just leave it open.

Of course, it would be best to add the auth layer in the Airflow admin.

Intermittent problems with EPEL 6 repository

Once in a while yum makecache fails, apparently because EPEL screws up (epel: Check uncompressed DB failed) and expected the wrong payload size ([Errno 14] HTTP Error 416 - Requested Range Not Satisfiable for all its mirrors).

Perhaps I can make this problem go from "uncommon" to "rare" if I configure yum to ignore unavailable repositories (yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true), therefore encouraging it to ignore the cache.

Shouldn't be very impactful, the only use of EPEL6 at the moment is the supervisor package.

Environment variables are not correclty injected to Supervisor

Airflow has problems with the Celery user key defined inside the Supervisor environment, but works normally with the bash environment.

Create the SQS queue in the stack and reduce Celery user powers

Currently celery creates the queue on its own, perhaps requiring more privileges than it should.

The /efs mount doesn’t survive reboots

Shouldn't be a issue most of the time, but rebooting shouldn't be so destructive.

Adding an instance with Superset for visualization

Should visualization become part of the default stack? Probably not if we think in separation of concerns, but it's pretty useful and possibly wanted by many people (me included).

Parameters for AutoScaling alarm thresholds

This is something very dependent on the stack, must be parametrized.

Merge CloudFormation::Init sections in a single resource

A very welcome trick is described here to share initialization configuration for EC2 instances.

Basically it comes from the fact that calling the initialization routine will accept the resource ID instead of (what I assumed) always fetching from the caller resource. This means that one resource can access the CFN::Init metadata from the other.

Using a dummy resource we can concentrate all init configs in a single place for once.

villasv / aws-airflow-stack Goto Github PK

aws-airflow-stack's Issues

Recommend Projects

Recommend Topics

Recommend Org