The dat5_bos from wittedhaddock

DAT5 Course Repository

Course materials for General Assembly's Data Science course in Boston, MA (20 January 2015 - 07 April 2015). View student work in the student repository.

Instructor: Bryan Balin. Teaching Assistant: Harish Krishnamurthy.

Office hours: Wednesday 5-6pm; Friday 5-6:30pm, @Boston Public Library; as needed Tuesdays and Thursdas @GA.

Course Project information

Tuesday	Thursday
1/20: Introduction	1/22: Python & Pandas
1/27: Git and GitHub	1/29: SQL
2/3: Advanced Pandas Milestone: Question and Data Set	2/5: Numpy, Machine Learning, KNN
2/10: scikit-learn, Model Evaluation Procedures	2/12: Linear Regression
2/17: Logistic Regression, Preview of Other Models	2/19: Model Evaluation Metrics Milestone: Data Exploration and Analysis Plan
2/24: Working a Data Problem	2/26: Clustering and Visualization Milestone: Deadline for Topic Changes
3/3: Naive Bayes	3/5: Natural Language Processing
3/10: Decision Trees and Ensembles Milestone: First Draft	3/12: Advanced scikit-learn
3/17: No Class	3/19: Databases and MapReduce
3/24: Recommenders	3/26: Course Review, Companion Tools Milestone: Second Draft (Optional)
3/31: TBD	4/2: Project Presentations
4/7: Project Presentations

Installation and Setup

Install the Anaconda distribution of Python 2.7x.
Install Git and create a GitHub account.
Once you receive an email invitation from Slack, join our "datbos05 team" and add your photo!

Class 1: Introduction

Introduction to General Assembly
Course overview: our philosophy and expectations (slides)
Data science overview (slides)
Tools: check for proper setup of Anaconda, overview of Slack

Homework:

Resolve any installation issues before next class.

Optional:

Review the code for a recap of some Python basics.
Read Analyzing the Analyzers for a useful look at the different types of data scientists.
Check out the PyData Boston Meetup page to become acquainted with the local data community.

Class 2: Python & Pandas

slides. Python refresher code. Python code. Pandas code.

Brief overview of Python
Brief overview of Python environments: Python scripting, IPython interpreter, Spyder
Working with data in Pandas
- Loading and viewing data
- Indexing and selecting data
- Assigning, reassigning, and splitting data
- Describing and summarizing data
- Plotting data

Homework:

Do the class homework by Tuesday.
Read through the project page in detail.
Review a few projects from past Data Science courses to get a sense of the variety and scope of student projects.

Optional:

If you need more practice with Python, review the "Python Overview" section of A Crash Course in Python, work through some of Codecademy's Python course, or work through Google's Python Class and its exercises.
For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford.

Resources:

Online Python Tutor is useful for visualizing (and debugging) your code.

Class 3: Git and GitHub

Homework:

Check for proper setup of Git by forking the data science project examples and pulling the fork to your local hard drive.
Download the following for Class 4:
- SQLite. Please make sure to download the precompiled binaries for your OS, NOT the source code.
- Sublime Text Editor.
- DB Visualizer. Please download the free version.
- Baseball archive for SQLite.

Class 4: SQL

slides

Overview of the baseball archive

Installation of SQLite, Sublime, DB Visualizer, and our dataset
The SELECT statement
The WHERE clause
ORDER BY
LEFT JOIN and INNER JOIN
GROUP BY
DISTINCT
CASE statements
Subqueries and IS NOT NULL
CREATE TABLE
Using Pandas and SQL Seamlessly

Homework:

Complete the in-class excercises, if you haven't already:
- Find the player with the most at-bats in a single season.
- Find the name of the the player with the most at-bats in baseball history.
- Find the average number of at_bats of players in their rookie season.
- Find the average number of at_bats of players in their final season for all players born after 1980.
- Find the average number of at_bats of Yankees players who began their second season at or after 1980.
- Pass the SQL in the previous bullet into a pandas DataFrame and write it back to SQLite.
Create full, working queries to answer at least four novel questions you have about the dataset using the following concepts:
- The WHERE clause
- ORDER BY
- LEFT JOIN and INNER JOIN
- GROUP BY
- SELECT DISTINCT
- CASE statements
- Subqueries and IS NOT NULL
Using Pandas, (1) query the Baseball dataset, (2) transform the data in some way, and (3) write a new table back to the databse.
Commit and Sync your SQL and Pandas files to your GitHub fork and issue a pull request.

Resources: * SQLite homepage * SQLite Syntax

SQL Tutorials: * Note: These tutorials are for all flavors of SQL, not just SQLite, so some of the functions may behave differently in SQLite. * SQL tutorial * SQLZoo

wittedhaddock / dat5_bos Goto Github PK

dat5_bos's Introduction

DAT5 Course Repository

Installation and Setup

Class 1: Introduction

Class 2: Python & Pandas

Class 3: Git and GitHub

Class 4: SQL

Class 5: Advanced Pandas

Class 6: Numpy, Machine Learning, KNN

Class 7: scikit-learn, Model Evaluation Procedures

Class 8: Linear Regression

Class 9: Logistic Regression, Preview of Other Models

Class 10: Model Evaluation Metrics

Class 11: Working a Data Problem

Class 12: Clustering and Visualization

Class 13: Naive Bayes

Class 14: Natural Language Processing

Class 15: Decision Trees and Ensembles

Class 16: Advanced scikit-learn

Class 17: Databases and MapReduce

Class 18: Recommenders

Class 19: Course Review, Companion Tools

Class 20: TBD

Class 21: Project Presentations

Class 22: Project Presentations

dat5_bos's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org