Giter Club home page Giter Club logo

ds200's Introduction

DS200 (Architecture for Management of Large Datasets)

This is the course web-page for Architecture for Management of Large Datasets being taught at IIT Bhilai, India in the Monsoon Semester of 2021.


Course Instructor: Dr. Gagan Raj Gupta

Other Instructors: Dr. Soumajit Pramanik, Dr. Subhajit Sidhanta

Teaching Assistants: Muttareddygari Sreechakra, Anirban Haldar

Canvas Link: https://canvas.instructure.com/courses/3570804

Motivation

Over the past few years, we have seen the emergence of "big data": disruptive technologies that have transformed commerce, science, and many aspects of society. These developments are enabled by infrastructure that allows us to distribute computations across hundreds or even thousands of commodity servers.

  • Getting data is becoming easier day by day, but we have too much to analyze (e.g. web, transactional data, text)
  • Data has errors of various types (missing, incorrect etc.), is incomplete and is hard to clean (e.g. user reviews/ratings, distorted images)
  • Data is usually high-dimensional (involving lot of columns or features) (e.g. text, images, videos, graphs)
  • Data usually has complex correlations and i.i.d. assumptions don't always work very well (e.g. graph data, time-series data)
  • Data is being generated at a great speed and it is too expensive to store all of it (e.g. user or machine transactions, queries)

In this course, we want to learn how large datasets are maintained and analyzed. If a single computer is not enough, how do we use multiple computers (even datacenters) to analyze large datasets? How do we make programming easy for data analysis and ML?

One key breakthrough that makes this all possible is the development of abstractions for data-intensive computing that allow programmers to reason about computations at a massive scale, hiding low-level details such as synchronization, data movement, and fault tolerance.

This course provides an introduction to big data infrastructure, starting with MapReduce, the first of these datacenter-scale programming abstractions. The Hadoop implementation of MapReduce lies at the core of an application stack that has gained widespread adoption in both industry and academia. A major focus of this course is algorithm design and "thinking at scale", applied to a variety of domains: text, graphs, relational data, etc. We will also cover a few next generation systems that are vying to replace MapReduce as the de facto big data processing platform of tomorrow.

Course Objectives

  • Motivate the need for managing large datasets.
  • Develop the architectural requirements for a data store (lake)
  • Introduce various distributed programming models and abstractions
  • Explain new paradigm of algorithm design with MapReduce for handling large datasets
  • Introduce streaming algorithms for processing streaming data
  • Provide hands-on experience to students in analyzing datasets in diverse fields (Industry 4.0, NLP, Graphs, Networks, Bio-informatics, Time-series)
  • Understand the software architecture

Pre-requisites

  • Basic knowledge of Python (most assignments will be based on Python)
  • Knowledge of basic computer science principles and skills

Tentative Course Outline

Lec # Date Topics covered in class Text Book Reference, readings
1 Sep 28 Large Datasets Examples; Data-center Architecture; Important of Analysis; Requirements on Architecture for Managing Large Datasets; How would you analyze large dataset?: Sequential vs. Parallel Programming; Higher levels of Abstraction for Parallel Programming: Datacenter is new computer; Data Intensive and Data Parallel Computing; MapReduce introduction DTP
2 Sep 30 Von-Neumann Model and current computers; Memory Hierarchy; Storage Technologies; Parallel Reads and Writes; Reliability and Cost Tradeoff with distributed file systems, latency and throughput; External Memory Algorithms: External Merge Sort References
3 Oct 1 Reliability Cost Tradeoff review; Streaming Model Introduction; Paralel Computing Models: synchronization; PRAM model: Computing Minimum with N/2 Processors via Tournament Method, Correctness, Solving with P<N/2 processors; BSP model; Map Reduce Explained; Demo of scalability via Map-Reduce on word-count problems References
4 Oct 5 Hash Functions, Aggregation of Data using Map Reduce, Optimizing MapReduce by using Combine and Partition, Conditions on using Combine, Example of Spotify and MailTrust, Word Count Optimized MRDP, MMDS 2.1
5 Oct 7 Examples of Aggregation: Count, Min, Max, Avergae, Sum, Median, Percentiles; Relational Algebra, SQL and Pandas Examples, Optimizing Percentile Calculations MRDP, MMDS 2.2
6 Oct 8 Tutorials on Input and Output of Map-Reduce; Regular Expressions, Shell Scripting; Map and Reduce in Python; Notion of efficiency of a parallel program; Compute Min (Comparison) efficiency; Amdahl’s law; References
7 Oct 12 Filtering patterns: Data cleaning, Bloom Filters for Set Membership MRDP
8 Oct 14 Filtering patterns: Top 10 , Distinct Items; Operations on Multiple Relations (Tables, Datasets): Union, Intersection, Difference, Joins; Matrix Multiplication Reservoir Sampling MRDP
9 Oct 21 Pipelining, Chaining, Bag operations, More patterns, WorkFlow Systems, Exam 1 Review/Prep HDG, MMDS Ch2
10-12 Oct 26,28,29 indexing with map-reduce, TF-IDF scoring with map-reduce, and Language models with map-reduce DTP
13,14 Nov 2,5 bfs/dfs, pagerank, random walk DTP
15,16,17 Nov 8-12 Spark Architecture and Programming MMDS, SDG
18 Nov 16 Intro to Spark ML SDG
HDFS, Yarn, Hadoop I/O HDG
Anatomy of MapReduce Job Run HDG
Pig, Hive, Zookeeper HDG

Meeting Times

Books/References/Practice materials

Similar Courses

ds200's People

Contributors

gagan-iitb avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.