Applied-Data-Science-in-Python

Content for the class "Applied Data Science in Python"

Module 1: Applied Plotting, Charting & Data Representation in Python

In Week one you will be introduced to the principles of data visualization. This week’s assignment asks that you carefully read Alberto Cairo's work, Graphics Lies, Misleading Visuals. You will locate and identify a visual that displays misleading information. You will interpret the features of the visual in order to identify the mechanism(s) that is/are used by the "encoder" to mislead the "decoder." For each mechanism that you identify, you will explain how it was used to mislead.

In Week two you will delve into basic charting. For this week’s assignment, you will work with real world CSV weather data. You will manipulate the data to display the minimum and maximum temperature for a range of dates and demonstrate that you know how to create a line graph using matplotlib. Additionally, you will demonstrate procedure of composite charts, by overlaying a scatter plot of record breaking data for a given year.

In Week three you will explore charting fundamentals. For this week’s assignment you will work to implement a new visualization technique based on academic research. This assignment is flexible and you can address it using a variety of difficulties - from an easy static image to an interactive chart where users can set ranges of values to be used.

In Week four, then everything starts to come together. Your final assignment is entitled “Becoming a Data Scientist.” This assignment requires that you identify at least two publicly accessible datasets from the same region that are consistent across a meaningful dimension. You will state a research question that can be answered using these data sets and then create a visual using matplotlib that addresses your stated research question. You will then be asked to justify how your visual addresses your research question.

Module 2: Applied-Machine-Learning-in-Python

In Week One, you will be introduced to basic machine learning concepts, tasks, and workflow using an example classification problem based on the K-nearest neighbors method, and implemented using the scikit-learn library. This week’s assignment has you work through the process of loading and examining a dataset, training a k-nearest neighbors classifier on the dataset, and then evaluating the accuracy of the classifier and using it to classify new data.

In Week Two, you will delve into a wider variety of supervised learning methods for both classification and regression, learning about the connection between model complexity and generalization performance, the importance of proper feature scaling, and how to control model complexity by applying techniques like regularization to avoid overfitting. In addition to k-nearest neighbors, this week covers linear regression (least-squares, ridge, lasso, and polynomial regression), logistic regression, support vector machines, decision trees, and the use of cross-validation for model evaluation. For this week’s assignment, you’ll explore the relationship between model complexity and generalization performance, by looking at the effect of key parameters on the accuracy of different classification and regression models.

In Week Three, you will cover evaluation and model selection methods that you can use to help understand and optimize the performance of your machine learning models. For this week’s assignment, you will train a classifier to detect fraudulent financial transactions, analyze its performance with different evaluation metrics, and then optimize the classifier’s performance based on different evaluation metrics, depending on the goals of the detection task (e.g. to minimize false positives vs false negatives).

In Week Four, you will cover more advanced supervised learning methods that include ensembles of trees (random forests, gradient boosted trees), and neural networks (with an optional summary on deep learning). You will also learn about the critical problem of data leakage in machine learning and how to detect and avoid it. The final assignment brings everything together: you will design features for, and build your own classifier on, a prediction problem on a complex real-world dataset.

Module 3: Applied Text Mining in Python

In Week One, you will be introduced to basic text mining tasks, and will be able to interpret text in terms of its building blocks – i.e. words and sentences, and reading in text files, processing text, and addressing common issues with unstructured text. You will also learn how to write regular expressions to find and extract words and concepts that follow specific textual patterns. You will be introduced to UTF-8 encoding and how multi-byte characters are handled in Python. This week’s assignment will focus on identifying dates using regular expressions and normalize them.

In Week Two, you will delve into NLTK, a very popular toolkit for processing text in Python. Through NLTK, you will be introduced to common natural language processing tasks and how to extract semantic meaning from text. For this week’s assignment, you’ll get a hands-on experience with NLTK to process and derive meaningful features and statistics from text.

In Week Three, you will engage with two of the most standard text classification approaches, viz. naïve Bayes and support vector machine classification. Building on some of the topics you might have encountered in Course 3 of this specialization, you will learn about deriving features from text and using NLTK and scikit-learn toolkits for supervised text classification. You will also be introduced to another natural language challenge of analyzing sentiment from text reviews. For this week’s assignment, you will train a classifier to detect spam messages from non-spam (“ham”) messages. Through this assignment, you will also get a hands-on experience with cross-validation and training and testing phases of supervised classification tasks.

In Week Four, you will be introduced to more advanced text mining approaches of topic modeling and semantic text similarity. You will also explore advanced information extraction topics, such as named entity recognition, building on concepts you have seen through Module One and Module Three of this course. The final assignment lets you explore semantic similarity of text snippets and building topic models using the gensim package. You will also experience the practical challenge of making sense of topic models in real life.

Applied Social Network Analysis in Python

Week One introduces you to different types of networks in the real world and why we study them. You will cover the basic elements of networks such as nodes, edges, and attributes and different types of networks such as directed, undirected, weighted, signed, and bipartite. You will also learn how to represent and manipulate networked data using the NetworkX library. The assignment will give you an opportunity to use NetworkX to analyze a networked dataset of employees in a small company, their relationships, and preferences of movies to watch for an upcoming movie night.

In Week Two you will learn about how to analyze the connectivity of a network based on measures of distance, reachability, and redundancy of paths between nodes. This type of analysis will allow you explore the robustness of a network when it is exposed to random or targeted attacks such as the removal of nodes and edges. In the assignment, you will practice using NetworkX to compute measures of connectivity of a network of email communication among the employees of a mid-size manufacturing company.

In Week Three you will explore ways of measuring the importance or centrality of a node in a network. You will cover several different centrality measures including Degree, Closeness, and Betweenness centrality, Page Rank, and Hubs and Authorities. You will learn about the assumptions each measure makes, the algorithms we can use to compute them, and the different functions available on NetworkX to measure centrality. You will also compare the ranking of nodes by centrality produced by the different measures. In the assignment, you will practice choosing the most appropriate centrality measure on a real-world setting, where you are tasked with choosing a person from a social network who should be given a promotional voucher in order to maximize the impact of the promotion on the network.

In Week Four you will explore the evolution of networks over time. You will learn about different models that generate networks with realistic features such as the Preferential Attachment Model and Small World Networks. You will also explore the link prediction problem, where you will learn useful features that can predict whether a pair of disconnected nodes will be connected in the future. In the assignment, you will be challenged to identify which model generated a given network. Additionally, you will have the opportunity to combine different concepts of the course by predicting the salary, position, and future connections of the employees of a company using their logs of email exchanges.

delphieritas / applied-data-science-with-python-1 Goto Github PK

applied-data-science-with-python-1's Introduction

Applied-Data-Science-in-Python

Module 1: Applied Plotting, Charting & Data Representation in Python

Module 2: Applied-Machine-Learning-in-Python

Module 3: Applied Text Mining in Python

Applied Social Network Analysis in Python

applied-data-science-with-python-1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent