Exploring and creating workflows in Apache Airflow workflows
Apache Airflow is python-based workflow based management system developed by Airbnb.
Workflows can be used to automate the pipelines, ETL process. It uses Directed Acyclic Graphs (DAGs) to define worfklows.
A brief understanding of Airflow DAGs.
Note:
- Airflow scheduler runs the DAGs for the given/scheduled time, if the DAG run is successfull we cannot trigger for the same timestamp.
- Airflow SubDAGs are recommended not be used because SubDagOperator and tasks are independent of parent DAG.
Further reading can be done here -
[1]: https://airflow.apache.org/docs/stable/index.html
[2]: Airflow use case from Lyft
[3]: Airflow Operators and Hooks
[4]: Snowflake connector
[5]: Connecting to Snowflake using Airflow
[6]: Airflow SubDAGs
[7]: Slack integration
Airflow can send metrics to StatsD.
StatsD can send data to backend service for further visualisation and analysis (ex. Datadog). StatsD is composed of three components - client, server and backend.
It sends metrics in UDP packets, if metrics are very important one needs to use TCP connection/client for sending metrics (recently added to StatsD).
Useful commands: To listen to StatsD connection on port 8125
while true; do nc -l localhost 8125; done
Integrating the Datadog with Airflow:
Datadog is a monitoring service. It gets data from StatsD daemon of Airflow and DatadogD daemon sends these data to cloud host.
We can use Datadog for viewing/visualising the metric data and enhancing querying on the metric data.
Setup -
Config and mapping files:
- Check the configuration file for airflow - airflow.cfg
- And also check for Datadog and StatsD mapping - datadog.yml
Further reading on StatsD -
[1]: Setup Metrics for Airflow using StatsD
[2]: https://thenewstack.io/collecting-metrics-using-statsd-a-standard-for-real-time-monitoring/
[3]: Python StatsD documentation
[4]: https://sysdig.com/blog/monitoring-statsd-metrics/
[5]: https://www.scalyr.com/blog/statsd-measure-anything-in-your-system/
[6]: Datadog