Data Engineering

How would you bootstrap a Data Team? Here below a "laundry list" of tasks, resources, job profiles, and blueprints on how to build a dream data team. Mission: manage all the data, learn from it, and deliver concrete and tangible business results to the rest of the organization.

Team

Profiles share both a Dev and Production Load.

1x Infra (Metal, Physical, Metal as a service, IaaS)
1x DevOps (Stack automation,, Containers and Platform as a service)
2x Data Engineer (Data Pipelines, Data Automation, Data as a Service, Data Ingestion)
2x Analytics (1x Batch Analytics, 1x Real-Time Analytics and Predictive APIs)
1x AI and ML (Machine Learning and AI algorithms)
1x Front-End Dev (Web and Js developer, web and mobile apps)

Extended team: for small scale projects, above profiles can cover the extended team's tasks:

1x Network Architect
1x Security Engineer
1x Community Writer
1x Data Viz Developer

Getting started team

Read, learn, memorize and practice the following:

from pyspark python documentation:

Principles

Python is the default language for the data stack.
Engineers over Scientists
Convention over Configuration over Coding.
More thinking less typing
Keep it simple
Re-use over Integrate over Build.
Service Oriented (CLI, Web HTTP APIs , Python Libraries)
Be kind, be curious

Tools Specific Principles

Spark

Spark is a library with many cohexisting layers, mostly because of back compatibility some of this APIs are still around both in the tool as well on the web with many Q&A still going round. When learning Spark please follow the following principles:

pyspark only (https://spark.apache.org/docs/latest/api/python/)
Learn ONLY the modules: pyspark.sql and pyspark.ml
Skip anything related to Scala, Java, R (according to the above general principle)
Skip categorically anything about RDDs, Map-Reduce, and MLlib

Git

Git is meant for people to experiment, develop and merge working code on a common master branch.
Please follow the given principles.

No working branches on the shared remote (gitlab/github) repo
Pull requests over commits
Diff and Test the code before commiting

Best way of working:

fork the remote common repository from gitlab/github
add multiple remotes to keep track on changes and rebase/marge if necessary
commit to your remote and pull request to the main repo.

Hardware

Storage 2-sockets
($22.5k, 16C, 256GB, 144TB, 2x10Gb):

PowerEdge R740xd (2 sockets, 2U, 12 x 3.5 bays)
(2x) Intel® Xeon® Silver 4110 2.1G, 8C/16T, 9.6GT/s, 11M Cache, Turbo, HT (85W) DDR4-2400
(8x) 32GB RDIMM, 2666MT/s, Dual Rank
(12x) 12TB 7.2K RPM SATA 6Gbps 512e 3.5in Hot-plug Hard Drive
Broadcom 57416 2 Port 10Gb Base-T + 5720 2 Port 1Gb Base-T, rNDC

Vanilla 2-sockets
($23k, 28C, 512GB, 64TB, 2x10Gb):

PowerEdge R740xd (2 sockets, 2U, 12 x 3.5 bays)
(2x) Intel® Xeon® Gold 5120 2.2G, 14C/28T, 10.4GT/s, 19M Cache, Turbo, HT (105W) DDR4-2400
(16x) 32GB RDIMM, 2666MT/s, Dual Rank
(8x) 8TB 7.2K RPM NLSAS 12Gbps 512e 3.5in Hot-plug Hard Drive
Broadcom 57416 2 Port 10Gb Base-T + 5720 2 Port 1Gb Base-T, rNDC

Compute 4 sockets
($25k, 40C, 512GB, 9.6TB, 2x10Gb):

PowerEdge R830 (4 sockets, 2U, 16 x 2.5”)
(4x) Intel® Xeon® E5-4620 v4 2.1GHz,25M Cache,8.0 GT/s,Turbo,HT,10C/20T (105W) DDR4 2133 MHz
(16x) 32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width
(4x) 2.4TB 10K RPM SAS 12Gbps 512e 2.5in Hot-plug Hard Drive
QLogic 57800 2x10Gb BT + 2x1Gb BT Network Daughter Card

Resources

data resources
[datalabframework](

Use cases

Refer to usecases.md for a more detailed description of the following e-retail and e-commerce use cases.

Recommendation engines
Market basket analysis
Warranty analytics
Price optimization
Inventory management
Customer sentiment analysis
Merchandising
Lifetime value prediction
Fraud detection
Option Selection (A/B testing)
UX automation
Data mining
Chatbots

Architecture

Management
- Versioning: gitlab
- CI-CD Framework: gitlab-ci
- Resource Management: Kubernetes
Storage
- Storage (Landing Storage): HDFS
- Storage (Object Store): Minio
- Storage (Block Storage): Ceph
Data Transport
- Pub/Sub Infrastructure: Kafka
Data Analytics
- Indexed data: Elasticsearch
- Ingestion and ETL Framework: Spark
- Data Science: Pandas/Scikit-Learn
- BI Visualization: Kibana
Data Formats:
- etl: parquet
- indexes: elasticsearch
- ds/ml: hdf5
- cache: Redis

Data Architecture

Ingestion Framework

Sources

DBs
Files
Streams

Targets

Change Logs
Table Snapshots
Data Views

Principles and Solutions:

Mutable Tables to Immutable Change Logs
Idempotency / Repeatability
Record Schema Changes
Implement validation strategies
Automated Data pipelines
Implement a Data Access Strategy
Define Merge commits behavior
Create Commit Robots/Agents
Define Promotion Strategy
Define a Data Sampling strategy

Monitoring and CI-CD Pipeline:

Dashboard of ingested data
Automation of Ingestion pipeline
Monitor Resources (Compute, Storage, Bandwith, Time)

Data Science / ML:

Auto-Config: frequency of ingestion, extraction parameters
Extract columns (datetime, indexes)
Detect outliers in quality of ingested data
Detect duplicated / derivative columns

Watch out for:

Poor Monitoring
Data Loss/Overwrites
Misconfigurations
Poor ACL

ETL Framework

Sources

Change Logs
Data Tables
Data Objects

Target

Stars (Facts and Dimension Tables)
Facts Tables (De-normalized)
Cubes (Aggregated Facts)

Principles and Solutions

Reduce heavy joins
Reproducable reporting
Data Historical validation
Fast dicing/slicing of data
Streaming Analytics
Curated Glossary/Data Dictionary
Curated Data Access
Curated Data secutity/privacy
Rendering of Reports
Solving the Ad-Hoc Reports

Data Science / ML:

Attention filtering reports
Smart Joins tables
ETL pipeline tuning

Monitoring and CI-CD Pipeline:

Dashboard of ETL Data
Automate Rendering
Automate ETL Generation
Monitor Resources (Compute, Storage, Bandwith, Time)

Watch out for:

No Curated Data
Poor understanding
Data Pollution
Conceptual errors
Poor data monitoring

Data Structures

From lower to higher abstraction levels:

Raw sources
Logs
Stars and Snowflakes
Cubes
Longitudinal
Domain Collections

=======

Data Science

Data science must be conducted with a predictable and well defined process.
All experiments should be conducted according to the following steps:

problem statement
hypotheses and assumptions
expected results
sketched solution
validate assumptions
collect results
analyze results
story telling
lessons learned

Main focus on data science is to understand the why's behind the numbers. So it's very important to:

be critical on both results and assumptions
no opinions but evidence based on maths, in particular statistics

natbusa / data-engineering Goto Github PK