Kathmandu University Department of Computer Science and Engineering
Subject: Data Acquisition Management System
Course Code: AIAC 557
Level: MTech in AI, Year 1, Semester II
Credit Hours: 3
Type: Elective [Theory + Practical]
After completiton of the course, students should be able to
- acquire data from various sources and ingest them to a data store( data lake, eDW ,data mart, delta lakes)
- work on cloud ecosystem and be able to complete a certification in on cloud provider (AWS, Google, Azure)
- Create ETL and ELT pipelines through various data processing (in SQL, DBT, DataFoam (part of bigquery), Spark) for different applications including data visualization and reporting (understand BI concepts)
- demonstrate understanding of data management issues, data quality, data governance
- Perform Basics of ML Operations: deploying model in production (batch/). Monitoring the performance the model (DataDrift/ ModelDrift)
- Orchestration (airflow) and creation of pipeline and be able to handover the pipeline to the operation team taking care of data management aspect such as incident management and such
- Data governance: creating data catalogue, lineage of data, identifying personal information from data, standard data models, using Open APIs.
- pull requirement from business stakeholder to build high level design by enterprise architect, solution architect design mid level, data engineer build solution architect
- Python Programming : Numpy, Pandas, Matplotlib, REST API, Web Scraping,
- Linux
- Git and Github
- Basic Data Science
In-Semester evaluation - 60 marks End-Semester Evaluation - 40 marks
Chapter 1: Introduction Data Science, Engineering and Management [4 Hr]
- Introduction to Data Science, Data Engineering and Data Management
- DIKW Pyramid and its issues
- Big Data and Big Data Ecosystem
- Data Lifecycle
- Data Management Principles and Challenges
- Data Management Strategy and Frameworks
- Data Engineering in Data Science (or ML) Lifecycle
- Important Python Libraries: Numpy, Pandas, Matplotlib, Seaborn to perform EDA
- REST API, Request and Web Scraping
- Text processing, extraction and classification
- Data Science
- Git and github
- Linux and Shell Scripting
- Docker
- Distributed Systems
Chapter 2: Data Handling [12 Hr]
- Data Acquisition and Ingestion
- Data Formats
- Web Scraping: Scrapy and BeautifulSoap
- Data Quality
- Data Wrangling and Cleaning
- Data Handling Ethics
- Data Governance
- Data Processing
- Hadoop and MapReduce
- Apache Spark: RDDs, DAtaFrames, SQL, MLLib, Streaming, GraphX
- Data Streams
- Apache Kafka: Topics, Parititions, Producer,Consumer, Kafka Connects
- Apache Flume
- YARN and Zookeeper
- Cloud Services provided by AWS, GCP, Azure
- AWS EC2, AMI, EVS, S3,RDS, Athena, Redshift, Lambda, CloudWatch, Glue for ETL jobs and EMR
- GCP Cloudstorage, DataFusion, BigQuery, Data-proc and Data Flow
- Azure data factory, SQL DB, Blob Storage, HDInsight, Databricks,
Chapter 3: Data Modelling, Design and Storage [15 Hr]
- Data Storage
- Distributed Storage: GFS & HDFS
- Database Schema and Notations
- Relational Database Management System (RDMS)
- Relationship and Entity Relationship Diagram
- Object-oriented DB and UML Notations
- Document Database: MongoDB
- Columnar Database: Cassandra, HBase
- Key-Value Pair DB: Redis
- Graph Database: Neo4J
- Multi-support, multi-paradigm database:
- Search DB: ElasticSearch
- Time-series DB
- Cloud Data Warehouse
- Data Lake and Data Mesh
- SnowFlake: Star schema and Snowflake Schema
- ETL and ELT pipeline
Chapter 4: Data Architecture and Orchestration
- Importance of Data Architecture
- Lambda and Kappa Architectures
- Data Architecture concepts and practices
- Data Engineering Architectures and Pipelines
- Orchestration
- Streaming Pipelines
- Model Deployment
- Model Monitoring
Chapter 5:Data Management Concepts
- Data Governance
- Data Security
- Data Integration and Interoperability
- Context Management
- Meta-data Management
- Data Management Maturity
- Organizational Change Management
Optional: System Design
- System Design System Components
- System Design System Components
- Scaling Data Systems
- Distributed System Design
- System Design Patterns for distributed systems
- Case Studies
- DAMA-DMBOK2 Data Management Book of Knowledge
- Fundamentals of Data Engineering by Joe Reis, Matt Housley
- Data Pipelines Pocket Reference by James Densmore
- Streaming Systems The What, Where, When, and How of Large-Scale Data Processing. by Akidau, Tyler Chernyak, Slava Lax, Reuven
- Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann