justmarkham / scikit-learn-videos Goto Github PK

Jupyter notebooks from the scikit-learn video series

Home Page: https://courses.dataschool.io/introduction-to-machine-learning-with-scikit-learn

CSS 0.44% Jupyter Notebook 99.56%

scikit-learn machine-learning data-science jupyter-notebook tutorial python

scikit-learn-videos's Introduction

Introduction to Machine Learning with scikit-learn

This video series will teach you how to solve Machine Learning problems using Python's popular scikit-learn library. There are 10 video tutorials totaling 4.5 hours, each with a corresponding Jupyter notebook.

You can watch the entire series on YouTube and view all of the notebooks using nbviewer.

The series is also available as a free online course that includes updated content, quizzes, and a certificate of completion.

Note: The notebooks in this repository have been updated to use Python 3.9.1 and scikit-learn 0.23.2. The original notebooks (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the archive branch. You can read about how I updated the code in this blog post.

What is Machine Learning, and how does it work? (video, notebook)
- What is Machine Learning?
- What are the two main categories of Machine Learning?
- What are some examples of Machine Learning?
- How does Machine Learning "work"?
Setting up Python for Machine Learning: scikit-learn and Jupyter Notebook (video, notebook)
- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the Jupyter Notebook?
- What are some good resources for learning Python?
Getting started in scikit-learn with the famous iris dataset (video, notebook)
- What is the famous iris dataset, and how does it relate to Machine Learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using Machine Learning terminology?
- What are scikit-learn's four key requirements for working with data?
Training a Machine Learning model with scikit-learn (video, notebook)
- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other Machine Learning models?
Comparing Machine Learning models in scikit-learn (video, notebook)
- How do I choose which model to use for my supervised learning task?
- How do I choose the best tuning parameters for that model?
- How do I estimate the likely performance of my model on out-of-sample data?
Data science pipeline: pandas, seaborn, scikit-learn (video, notebook)
- How do I use the pandas library to read data into Python?
- How do I use the seaborn library to visualize data?
- What is linear regression, and how does it work?
- How do I train and interpret a linear regression model in scikit-learn?
- What are some evaluation metrics for regression problems?
- How do I choose which features to include in my model?
Cross-validation for parameter tuning, model selection, and feature selection (video, notebook)
- What is the drawback of using the train/test split procedure for model evaluation?
- How does K-fold cross-validation overcome this limitation?
- How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
- What are some possible improvements to cross-validation?
Efficiently searching for optimal tuning parameters (video, notebook)
- How can K-fold cross-validation be used to search for an optimal tuning parameter?
- How can this process be made more efficient?
- How do you search for multiple tuning parameters at once?
- What do you do with those tuning parameters before making real predictions?
- How can the computational expense of this process be reduced?
Evaluating a classification model (video, notebook)
- What is the purpose of model evaluation, and what are some common evaluation procedures?
- What is the usage of classification accuracy, and what are its limitations?
- How does a confusion matrix describe the performance of a classifier?
- What metrics can be computed from a confusion matrix?
- How can you adjust classifier performance by changing the classification threshold?
- What is the purpose of an ROC curve?
- How does Area Under the Curve (AUC) differ from classification accuracy?
Building a Machine Learning workflow (video, notebook)
- Why should you use a Pipeline?
- How do you encode categorical features with OneHotEncoder?
- How do you apply OneHotEncoder to selected columns with ColumnTransformer?
- How do you build and cross-validate a Pipeline?
- How do you make predictions on new data using a Pipeline?
- Why should you use scikit-learn (rather than pandas) for preprocessing?

Bonus Video

At the PyCon 2016 conference, I taught a 3-hour tutorial that builds upon this video series and focuses on text-based data. You can watch the tutorial video on YouTube.

Here are the topics I covered:

Model building in scikit-learn (refresher)
Representing text as numerical data
Reading a text-based dataset into pandas
Vectorizing our dataset
Building and evaluating a model
Comparing models
Examining a model for further insight
Practicing this workflow on another dataset
Tuning the vectorizer (discussion)

Visit this GitHub repository to access the tutorial notebooks and many other recommended resources.

scikit-learn-videos's People

Contributors

Stargazers

Watchers

Forkers

rpj911 annavmontoya aluna351 yiyingw wenliangz snowdj jhonasttan neuropil fujianhai butchland hidayat722 mbengoufa tarsam2009 krishnatray ababook davidcolbyreed jiajiaxingxing biwa222 ml-ai-nlp-ir mathsci vsingh58 muyiibidun ewellinger angelayuan siddhartharay007 flxc3r new-high-score alisoncossette fsgp shokuninsan drbaguiar divfor mitchshack techpartnerz arthyt ivo-me nunofernandes-plight cmadore nkhuyu yangspeaking seasons90 basilrormose sen15recess maythapk chengat1314 adonisbruno aimlnerd raghavendra990 muditrastogi m4ckr0 rahul-c1 chenzhongtao rmaheshkumarblr chenhq hatib72 rmudunuri dotran gaoch023 kamilgruca absarf perevales vanqm bricesh gizmo3d arkoneogy dkbradley madjelan lpalanisamy kgantsov cmccann11 jslmann qingkaikong acourtney2015 gvanzin miketam1021 tanthml godfanmiao huangrh duapraveen speedbird21 skkoobb ralphgragutaya arunkumarpt anuragism hendryli mcervantes viennachen cv56 enshengdong pnpatel bardolfranck ranjeet-floyd btng manaranjanp mbourgedata roryneary raheja romainlopez hughdbrown bernardong

scikit-learn-videos's Issues

type error

the line

print('{:^9} {} {:^25}'.format(iteration, data[0], data[1]))

gives a type error

print('{:^9} {} {:^25}'.format(iteration, data[0], str(data[1])))

solves the problem

Self-Organizing Maps in Scikit Learn

Hello!

First of all, thankyou so much for this series and all the resources you have mentioned with them. I started out with machine learning a few months ago and after reading, searching online, I was still not able to grasp the core of the machine learning. Your videos made it really simple and easy to understand! Most of my confusions cleared up! I hope you keep making these videos.

Anyways, Can you please make or refer me a video tutorial or great resource on Self-Organizing Maps in Scikit Learn? It will be a great help!

Thankyou again!

headers in Advertising.csv have changed

Issue with skvideo.io

Can someone let me know the issue with this?when i tried installing with anaconda and using it on to detect cars in a video using Haar Cascade Classifier.
cheers
.
/Users/mohan/anaconda3/lib/python3.5/site-packages/skvideo/init.py:356: UserWarning: avconv/avprobe not found in path:
warnings.warn("avconv/avprobe not found in path: " + str(path), UserWarning)

Error: 'unable to contact kernel' on Jupyter 3.2.x +

When running these notebooks on Jupyter 3.2.x or 4.2.x, I get the following error:

Failed to start the kernel

The 'None' kernel is not available. Please pick another suitable kernel instead, or install that kernel.

Note, the Kernel is running when I create a new notebook, so the problem seems to be related to incompatibility between these notebooks and Jupyter.

Environment

localhost

Here is my local Environment:

The version of the notebook server is 4.2.1 and is running on:

Python 3.5.1 |Anaconda 4.1.0 (64-bit)| (default, Jun 15 2016, 15:32:45)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Current Kernel Information:

unable to contact kernel

mybinder.org

The version of the notebook server is 3.2.0-8b0eef4 and is running on:

Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

Notebook links broken?

Hi! I just noticed that all the links to IPython notebooks show a 400 error when clicked on.

Confusion Matrix setup in scikit-learn

Hi!
I wanted to start off by saying that your tutorials and videos are really great! so clear and simple!

I've been working on a binary classification problem for my school with scikit-learn and I have been scratching my head in regards to how it displays the confusion matrix. For instance I have as output

 [ [30  5]
    [2 42] ]

I noticed by looking at the classification report that scikit learn by default outputs the negative class first. This leads me to understand that the first list is the negative class and that the second is the positive class. However, what I don't understand how to interpret what each number stands for as in TP, FP, TN, FN.

TN(30) FN (5)
FP(2) TP (42)

Is this a current representation of the input above?

Thanks a bunch!

09_classification_metrics.ipynb data URL spioled

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)

->
temporary solution (N.B.: comment='#' in read_csv is important)

url = 'https://gist.githubusercontent.com/ktisha/c21e73a1bd1700294ef790c56c8aec1f/raw/819b69b5736821ccee93d05b51de0510bea00294/pima-indians-diabetes.csv'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names, comment='#')

Show accuracy

Hi, I'm used KNN for my research but I don't know how to display accuracy of a result
This is my code:

# In[38]:
from sklearn import neighbors, metrics
from sklearn.neighbors import NearestNeighbors
import numpy as np
import pandas as pd
import sys
import json
import math

data    = pd.read_excel('dataset.xlsx')
data    = np.array(data.as_matrix())
# In[40]:
knn=neighbors.KNeighborsClassifier(n_neighbors=5)
# In[41]:
X = data[:,:-3]
Y = data[:,-1:]
Y = np.zeros(len(Y))
for i in range(0,len(Y)):
    if data[i,5] == 0:
        Y[i] = 0
    elif data[i,5] == 1:
        Y[i] = 1
    elif data[i,5] == 2:
        Y[i] = 2
    elif data[i,5] == 3:
        Y[i] = 3
    elif data[i,5] == 4:
        Y[i] = 4
# In[42]:
knn.fit(X, Y)
# In[49]:
result = knn.predict([[220.4, 6.39,1855]])
print(result)
# result = knn.predict(X)
# print(metrics.accuracy_score(Y[2000], result))

This is my dataset:

Thank you!

in 09_classification_metrics

running this:

examine the class distribution of the testing set (using a Pandas Series method)

y_test.value_counts()

I get this:

AttributeError Traceback (most recent call last)
in ()
1 # examine the class distribution of the testing set (using a Pandas Series method)
----> 2 y_test.value_counts()

AttributeError: 'numpy.ndarray' object has no attribute 'value_counts'

Meaning of cross_val_score output

Hi.

I have a question from scikit-learn-videos/07_cross_validation.ipynb. The output of the classification accuracy is usually several digits after the decimal e,g. 0.966666666667. If I multiply this value with the total number of observations i.e. 25, I will get 24.1666666667. What does this mean? That 24.1666666667 were classified correctly. Should not it give me a whole number? such as 24 maybe.