Course materials for General Assembly's Data Science course in Boston, MA (20 January 2015 - 07 April 2015). View student work in the student repository.
Instructor: Bryan Balin. Teaching Assistant: Harish Krishnamurthy.
Office hours: Wednesday 5-6pm; Friday 5-6:30pm, @Boston Public Library; as needed Tuesdays and Thursdas @GA.
Tuesday | Thursday |
---|---|
1/20: Introduction | 1/22: Python & Pandas |
1/27: Git and GitHub | 1/29: SQL |
2/3: Advanced Pandas Milestone: Question and Data Set |
2/5: Numpy, Machine Learning, KNN |
2/10: scikit-learn, Model Evaluation Procedures | 2/12: Linear Regression |
2/17: Logistic Regression, Preview of Other Models |
2/19: Model Evaluation Metrics Milestone: Data Exploration and Analysis Plan |
2/24: Working a Data Problem | 2/26: Clustering and Visualization Milestone: Deadline for Topic Changes |
3/3: Naive Bayes | 3/5: Natural Language Processing |
3/10: Decision Trees and Ensembles Milestone: First Draft |
3/12: Advanced scikit-learn |
3/17: No Class | 3/19: Databases and MapReduce |
3/24: Recommenders | 3/26: Course Review, Companion Tools Milestone: Second Draft (Optional) |
3/31: TBD | 4/2: Project Presentations |
4/7: Project Presentations |
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "datbos05 team" and add your photo!
- Introduction to General Assembly
- Course overview: our philosophy and expectations (slides)
- Data science overview (slides)
- Tools: check for proper setup of Anaconda, overview of Slack
Homework:
- Resolve any installation issues before next class.
Optional:
- Review the code for a recap of some Python basics.
- Read Analyzing the Analyzers for a useful look at the different types of data scientists.
- Check out the PyData Boston Meetup page to become acquainted with the local data community.
slides. Python refresher code. Python code. Pandas code.
- Brief overview of Python
- Brief overview of Python environments: Python scripting, IPython interpreter, Spyder
- Working with data in Pandas
- Loading and viewing data
- Indexing and selecting data
- Assigning, reassigning, and splitting data
- Describing and summarizing data
- Plotting data
Homework:
- Do the class homework by Tuesday.
- Read through the project page in detail.
- Review a few projects from past Data Science courses to get a sense of the variety and scope of student projects.
Optional:
- If you need more practice with Python, review the "Python Overview" section of A Crash Course in Python, work through some of Codecademy's Python course, or work through Google's Python Class and its exercises.
- For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford.
Resources:
- Online Python Tutor is useful for visualizing (and debugging) your code.
Homework:
- Check for proper setup of Git by forking the data science project examples and pulling the fork to your local hard drive.
- Download the following for Class 4:
- SQLite. Please make sure to download the precompiled binaries for your OS, NOT the source code.
- Sublime Text Editor.
- DB Visualizer. Please download the free version.
- Baseball archive for SQLite.
Overview of the baseball archive
- Installation of SQLite, Sublime, DB Visualizer, and our dataset
- The SELECT statement
- The WHERE clause
- ORDER BY
- LEFT JOIN and INNER JOIN
- GROUP BY
- DISTINCT
- CASE statements
- Subqueries and IS NOT NULL
- CREATE TABLE
- Using Pandas and SQL Seamlessly
Homework:
-
Complete the in-class excercises, if you haven't already:
- Find the player with the most at-bats in a single season.
- Find the name of the the player with the most at-bats in baseball history.
- Find the average number of at_bats of players in their rookie season.
- Find the average number of at_bats of players in their final season for all players born after 1980.
- Find the average number of at_bats of Yankees players who began their second season at or after 1980.
- Pass the SQL in the previous bullet into a pandas DataFrame and write it back to SQLite.
-
Create full, working queries to answer at least four novel questions you have about the dataset using the following concepts:
- The WHERE clause
- ORDER BY
- LEFT JOIN and INNER JOIN
- GROUP BY
- SELECT DISTINCT
- CASE statements
- Subqueries and IS NOT NULL
-
Using Pandas, (1) query the Baseball dataset, (2) transform the data in some way, and (3) write a new table back to the databse.
-
Commit and Sync your SQL and Pandas files to your GitHub fork and issue a pull request.
Resources: * SQLite homepage * SQLite Syntax
SQL Tutorials: * Note: These tutorials are for all flavors of SQL, not just SQLite, so some of the functions may behave differently in SQLite. * SQL tutorial * SQLZoo