villasv / aws-airflow-stack Goto Github PK
View Code? Open in Web Editor NEWTurbine: the bare metals that gets you Airflow
Home Page: https://victor.villas/aws-airflow-stack/
License: MIT License
Turbine: the bare metals that gets you Airflow
Home Page: https://victor.villas/aws-airflow-stack/
License: MIT License
Simply following the given instructions does no good and will leave the webserver and scheduler temporarily with inconsistent views. Also the resetdb
remark comes too late, as it is certain that it will be needed if deploying each own DAGs.
The stack was already configured in a way that multiple stacks can be created without conflicts - by using tags and not unique names to identify them.
But still, those tags are generally turbine-
or some prefix alike. Ideally we should provide support for additional tagging information so users can easily distinguish between a production deployment and a testing deployment.
Now that the stack targets any region, this configuration should be passed to Celery.
http://docs.celeryproject.org/en/latest/getting-started/brokers/sqs.html#region
Relevant section: https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/default_airflow.cfg#L416
Uploading the code to the designer and then downloading it will convert a vew string blocks from standard line continuation |
to >
, introducing a bunch of blank lines. This is a minor inconvenience, the changes are rapidly undone.
A very minor issue, but making the upload/download idempotent would be pretty convenient.
Something quite wrong is happening with manually triggered dags. Tasks are being assigned task_state=None
and are not seen by the scheduler, so they're never getting queued.
Running a backfill from the CLI works, so I'm inclined to think this is a problem with the scheduler.
The default behavior is to be assigned a public endpoint.
Just a temporary fix, until they update the dependencies.
Watch celery/celery#4195
Setting the hostname is not mandatory but very practical, specially when trying to consult logs.
The stack autoscaling model is not flexible at all: scale in response to the queue length. But except for DAGs consisting solely of quick operations, this is problematic.
If the downscaling in response to 0 queued tasks is N seconds, any tasks that take N+1 seconds might get killed. That's just an obvious example, you don't need to get execution times close to N to get problems if you have many tasks because the average queue length can still be close to zero.
It's very unlikely that there can be a one-size-fits-all autoscaling strategy, this one was implemented because it was easy and useful. The problem is, we should strive for a transparent infrastructure and should not have things like an Operator that allocates 5 machines, for example. DAGs should be all about data.
Also very likely to be wanted by most people, although VERY much opinionated. Most likely to be schemaless and commonly used by the big data folk.
Released almost a month ago with a few breaking changes with configuration variables.
This was one of the initial goals and it was postponed until we had a working stack. Now it's the time.
Enhancement for the default setup, but actually a bug in the RFC/#24 with custom networking.
Reported by @tomfaulhaber, didn't reproduce yet.
Still need to decide on #34, because providing baked AMIs adds a lot more complexity to the template, although it greatly simplifies the actual resource creation during autoscaling.
But again, autoscaling is not the main feature of the stack anymore. The installation is not that expensive and as long as it consistently works on the newer official Amazon Linux AMIs it should be fine.
I was reading a blog post about SQS that was not working well with Airflow .
https://medium.com/@Twistacz/whats-wrong-with-sqs-and-why-we-gave-it-up-22780d4f48ef
Is this the case now ???
If so what is the best broker to make it work in AWS environment .
Specially the airflow celery worker, that has to be forced to accept running as root.
Can be done configuring supervisor and perhaps a few folder permissions granted.
Makes using the cli very painful
Celery was erroring with
ImportError: The curl client requires the pycurl library.
even though PyCurl is installed.
Turns out it was erroring to be imported because
ImportError: pycurl: libcurl link-time ssl backend (openssl) is different from compile-time ssl backend (none/other)
.
Solution is already documented:
pip-3.5 install --no-cache-dir --compile --ignore-installed --install-option="--with-nss" pycurl
Not sure exactly what's happening. It seems that Celery started but died, maybe the UserData
script is not supposed to hang on it.
Any CI provider would do, but Scrutinizer doesn't require me to add another YAML file in the project so it has that going for me which is nice.
This is not strictly necessary, but will make integration with CloudFormation utilities easier.
It will make setting up the environment harder, but this is generally easier to find answers around.
Necessary step before #8.
ATM testing requires me to open the SecurityGroup TCP from 10.0.0.0/16
to 0.0.0.0/0
.
This is supposed to be a testing setup, so just leave it open.
Of course, it would be best to add the auth layer in the Airflow admin.
Once in a while yum makecache
fails, apparently because EPEL screws up (epel: Check uncompressed DB failed
) and expected the wrong payload size ([Errno 14] HTTP Error 416 - Requested Range Not Satisfiable
for all its mirrors).
Perhaps I can make this problem go from "uncommon" to "rare" if I configure yum
to ignore unavailable repositories (yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
), therefore encouraging it to ignore the cache.
Shouldn't be very impactful, the only use of EPEL6 at the moment is the supervisor
package.
Airflow has problems with the Celery user key defined inside the Supervisor environment, but works normally with the bash environment.
Currently celery creates the queue on its own, perhaps requiring more privileges than it should.
Shouldn't be a issue most of the time, but rebooting shouldn't be so destructive.
Should visualization become part of the default stack? Probably not if we think in separation of concerns, but it's pretty useful and possibly wanted by many people (me included).
This is something very dependent on the stack, must be parametrized.
A very welcome trick is described here to share initialization configuration for EC2 instances.
Basically it comes from the fact that calling the initialization routine will accept the resource ID instead of (what I assumed) always fetching from the caller resource. This means that one resource can access the CFN::Init metadata from the other.
Using a dummy resource we can concentrate all init configs in a single place for once.
This is important not only for reliability reasons, but it will serve as model for correctly starting processes at init time for workers and after continuous integration.
This is going to be a little complicated, but I believe it's doable with CodeDeploy and CodePipeline.
Finishing this will completely eliminate the need to setup things with SSH.
This would improve two aspects: initialization times and code reuse (worker, scheduler and interface can be identical images).
EFS I/O can be quite expensive, S3 is reasonably well supported by Airflow at this point.
Not only for me, of course. Almost anyone that is going to want to use this will have to pitch the idea and justify the infrastructure and investment associated. Make data engineers life easier.
Having a diagnostic tool for airflow deployments could be really useful, although this task might seem too general and complex. Testing the UI and if tasks are getting scheduled/executed would be a good start, as well as connections.
This contrib task runner allows us to configure a maximum resource usage for tasks which could offer some great flexibility and control.
This is something very dependent on the stack, must be parametrized.
Assigning fixed IP addresses makes replacing resources harder if necessary. It also makes updating CIDR blocks more complicated.
Again, another resource that will lead to name crashes if deploying multiple stacks.
Queue names are unique because they're used by CloudWatch metrics. Just let AWS pick a name so multiple stacks can be deployed simultaneously in the same region.
Feedback from the AWS team. Really a must for more diverse deployment scenarios.
This should follow the same model from #8, with minor changes and unfortunately lots of template duplication.
Currently installing boto3 to solve #14, but I should get rid of it as soon as they fix the dependency listing.
Is it really necessary? Probably not, but having airflow attached to a service will make it easier to control it and it might help solving #7.
It may contain characters that require urlencoding. Easiest way out is to inline some python.
Currently including huge initscripts (init.d
definitions) inside the template. That wasn't necessary using Ubuntu.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.