- build docker image
- build final result
- have git tags to checkout
- instructions as a Github pages repo
This repository contains the files and data from a workshop at PARISOMA well as resources around Data Engineering.
I would love your feedback on the materials in the Github issues. And/or please do not hesitate to reach out to me directly via email at [email protected] or over twitter @clearspandex
The presentation can be found on Slideshare here or in this repository (presentation.pdf
).
Throughout this workshop, you will learn how to make a scalable and sustainable data pipeline in Python with Luigi
- Run a simple 1 stage Luigi flow reading/writing to local files
- Write a Luigi flow containing stages with multiple dependencies
- Visualize the progress of the flow using the centralized scheduler
- Parameterize the flow from the command line
- Output parameter specific output files
- Manage serialization to/from a Postgres database
- Integrate a Hadoop Map/Reduce task into an existing flow
- Parallelize non-dependent stages of a multi-stage Luigi flow
- Schedule a local Luigi job to run once every day
- Run any arbitrary shell command in a repeatable way
- Install Python, I recommend Anaconda (Mac OSX or Windows): http://continuum.io/downloads
- Get the files: Download the ZIP or
git clone https://github.com/Jay-Oh-eN/data-engineering-101
(git tutorial) this repository.
- Text Editor: I recommend [Sublime Text][sublime]
- A (modern) Web Browser: I recommend [Google Chrome][chrome]
- Docker: download Kinematic
Time | Activity |
---|---|
1:00-1:10 | Components of Data pipelines (Lecture) |
1:10-1:20 | What and Why Luigi (Lecture) |
1:20-1:40 | The Smallest (1 stage) pipeline (Live Code) |
1:25-1:40 | The Smallest (1 stage) pipeline (Lab) |
1:25-1:40 | The Smallest (1 stage) pipeline (Solution) |
Managing dependencies in a pipeline (10min) | |
Lab: Multi-stage pipeline and introduction to the Luigi Visualizer (15min) | |
Serialization in a Data Pipeline (10min) | |
Lab: Integrating your pipeline with HDFS and Postgres (20min) | |
Scheduling (10min) | |
Lab: Parallelism and recurring jobs with Luigi (20min) | |
Wrap up and next steps (5min) |
- Install Python, I recommend Anaconda (Mac OSX or Windows): http://continuum.io/downloads
- Get the files: Download the ZIP or
git clone https://github.com/Jay-Oh-eN/data-engineering-101
(git tutorial) this repository.
- Hadoop Docker (with script to transfer files
upload-data.sh
) - Luigi Client Docker
luigid --background --logdir logs
python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8
- Install libraries and dependencies:
pip install -r requirements.txt
- Start the UI server:
luigid --background --logdir logs
- Navigate with a web browser to
http://localhost:[port]
where[port]
is the port theluigid
server has started on (luigid
defaults to port 8082) - Run the final pipeline:
python ml-pipeline.py BuildModels --input-dir text --num-topics 10 --lam 0.8
- Start Hadoop cluster:
bin/start-dfs.sh; sbin/start-yarn.sh
- Setup Directory Structure:
hadoop fs -mkdir /tmp/text
- Get files on cluster:
hadoop fs -put ./data/text /tmp/text
- Retrieve results:
hadoop fs -getmerge /tmp/text-count/2012-06-01 ./counts.txt
- View results:
head ./counts.txt
text/ 20newsgroups text files
example_luigi.py example scaffold of a luigi pipeline
hadoop_word_count.py example luigi pipeline using Hadoop
ml-pipeline.py luigi pipeline covered in workshop
LICENSE Details of rights of use and distribution
presentation.pdf lecture slides from presentation
readme.md this file!
The data (in the text/
folder) is from the 20 newsgroups dataset, a standard benchmarking dataset for machine learning and NLP. Each file in text
corresponds to a single 'document' (or post) from one of two selected newsgroups (comp.sys.ibm.pc.hardware
or alt.atheism
). The first line provides which group the document is from and everything thereafter is the body of the post.
comp.sys.ibm.pc.hardware
I'm looking for a better method to back up files. Currently using a MaynStream
250Q that uses DC 6250 tapes. I will need to have a capacity of 600 Mb to 1Gb
for future backups. Only DOS files.
I would be VERY appreciative of information about backup devices or
manufacturers of these products. Flopticals, DAT, tape, anything.
If possible, please include price, backup speed, manufacturer (phone #?),
and opinions about the quality/reliability.
Please E-Mail, I'll send summaries to those interested.
Thanx in advance,
- Questioning the Lambda Architecture
- Luigi: NYC Data Science Meetup
- The Log: What every software engineer should know about real-time data's unifying abstraction
- I (heart) Log
- Why Loggly Loves Apache Kafka
- Buffer's New Data Architecture
- Putting Apache Kafka to Use
- Metric Driven Development
- The Unified Logging Infrastructure for Data Analytics at Twitter
- Stream Processing and Mining just got more interesting
- How to Beat the CAP Theorem
- Beating the CAP Theorem Checklist
Copyright 2015 Jonathan Dinu.
All files and content licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License