Giter Club home page Giter Club logo

This is the first try of MNIST classification for Kaggle competition

This is the illustration of using PCA and other machine learning algorithms (i.e. RandomForest).

By N. Satsawat

Date: 24-Apr-2017

# Import necessary library

import numpy as np
import pandas as pd
import datetime 
import seaborn as sb
sb.set_style("dark")

import matplotlib.pyplot as plt
%pylab inline

import os, sys
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
from sklearn import svm
def now():
    tmp = datetime.datetime.now().strftime("%m/%d/%y %H:%M:%S")
    return tmp
# Read-in training dataset
train = pd.read_csv('train.csv')
print("The sizing of training data: {}".format(train.shape))
print("First 5 records: \n")
print(train.head(5))

# This will list the entire column name
# print(list(train))

target = train["label"]
train = train.drop("label",1)
The sizing of training data: (42000, 785)
First 5 records: 

   label  pixel0  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  \
0      1       0       0       0       0       0       0       0       0   
1      0       0       0       0       0       0       0       0       0   
2      1       0       0       0       0       0       0       0       0   
3      4       0       0       0       0       0       0       0       0   
4      0       0       0       0       0       0       0       0       0   

   pixel8    ...     pixel774  pixel775  pixel776  pixel777  pixel778  \
0       0    ...            0         0         0         0         0   
1       0    ...            0         0         0         0         0   
2       0    ...            0         0         0         0         0   
3       0    ...            0         0         0         0         0   
4       0    ...            0         0         0         0         0   

   pixel779  pixel780  pixel781  pixel782  pixel783  
0         0         0         0         0         0  
1         0         0         0         0         0  
2         0         0         0         0         0  
3         0         0         0         0         0  
4         0         0         0         0         0  

[5 rows x 785 columns]
# Understand more on the data
print("Check the label of training data:\n{}".format(target.value_counts(sort=True)))

# What about the actual pixel variables?
figure(figsize(10,10))
for digit_num in range(0,4):
    subplot(2,2,digit_num+1)
    plt.hist(train.iloc[digit_num])
    suptitle("Before transform")

# Normalize the training data
train /= 255

figure(figsize(10,10))
for digit_num in range(0,4):
    subplot(2,2,digit_num+1)
    plt.hist(train.iloc[digit_num])
    suptitle("After transform")
Check the label of training data:
1    4684
7    4401
3    4351
9    4188
2    4177
6    4137
0    4132
4    4072
8    4063
5    3795
Name: label, dtype: int64

png

png

# Convert pixel from 784x1 (1 dimension) to 28x28 (2 dimensions)
figure(figsize(5,5))
for digit_num in range(0,25):
    subplot(5,5,digit_num+1)
    grid_data = train.iloc[digit_num].as_matrix().reshape(28,28) 
    plt.imshow(grid_data, interpolation = "none", cmap = "bone")
    xticks([])
    yticks([])

png

### Create function to evaluate the score of each classification model
def eval_model_classifier(model, data, target, split_ratio):
    trainX, testX, trainY, testY = train_test_split(data, target, train_size=split_ratio, random_state=0)
    model.fit(trainX, trainY)    
    return model.score(testX,testY)
### 1st round: RandomForestClassification

# Initialise values
num_estimators_array = np.array([1,5,10,50,100,200,500]) 
num_smpl = 10 # Test run the model according to samples_number
num_grid = len(num_estimators_array)
score_array_mu = np.zeros(num_grid) # Keep mean
score_array_sigma = np.zeros(num_grid) # Keep Standard deviation 
j=0

print("{}: RandomForestClassification Starts!".format(now()))
for n_estimators in num_estimators_array:
    score_array = np.zeros(num_smpl) # Initialize
    for i in range(0,num_smpl):
        rf_class = RandomForestClassifier(n_estimators = n_estimators, n_jobs=1, criterion="gini")
        score_array[i] = eval_model_classifier(rf_class, train.iloc[0:1000], target.iloc[0:1000], 0.8)
        print("{}: Try {} with n_estimators = {} and score = {}".format(now(), i, n_estimators, score_array[i]))
    score_array_mu[j], score_array_sigma[j] = mean(score_array), std(score_array)
    j=j+1

print("{}: RandomForestClassification Done!".format(now()))
03/03/17 16:04:05: RandomForestClassification Starts!
03/03/17 16:04:05: Try 0 with n_estimators = 1 and score = 0.49
03/03/17 16:04:05: Try 1 with n_estimators = 1 and score = 0.58
03/03/17 16:04:05: Try 2 with n_estimators = 1 and score = 0.525
03/03/17 16:04:05: Try 3 with n_estimators = 1 and score = 0.51
03/03/17 16:04:05: Try 4 with n_estimators = 1 and score = 0.59
03/03/17 16:04:05: Try 5 with n_estimators = 1 and score = 0.565
03/03/17 16:04:05: Try 6 with n_estimators = 1 and score = 0.475
03/03/17 16:04:05: Try 7 with n_estimators = 1 and score = 0.57
03/03/17 16:04:05: Try 8 with n_estimators = 1 and score = 0.51
03/03/17 16:04:05: Try 9 with n_estimators = 1 and score = 0.48
03/03/17 16:04:05: Try 0 with n_estimators = 5 and score = 0.685
03/03/17 16:04:05: Try 1 with n_estimators = 5 and score = 0.73
03/03/17 16:04:05: Try 2 with n_estimators = 5 and score = 0.655
03/03/17 16:04:05: Try 3 with n_estimators = 5 and score = 0.655
03/03/17 16:04:05: Try 4 with n_estimators = 5 and score = 0.725
03/03/17 16:04:06: Try 5 with n_estimators = 5 and score = 0.74
03/03/17 16:04:06: Try 6 with n_estimators = 5 and score = 0.74
03/03/17 16:04:06: Try 7 with n_estimators = 5 and score = 0.68
03/03/17 16:04:06: Try 8 with n_estimators = 5 and score = 0.725
03/03/17 16:04:06: Try 9 with n_estimators = 5 and score = 0.64
03/03/17 16:04:06: Try 0 with n_estimators = 10 and score = 0.83
03/03/17 16:04:06: Try 1 with n_estimators = 10 and score = 0.83
03/03/17 16:04:06: Try 2 with n_estimators = 10 and score = 0.76
03/03/17 16:04:06: Try 3 with n_estimators = 10 and score = 0.785
03/03/17 16:04:06: Try 4 with n_estimators = 10 and score = 0.815
03/03/17 16:04:06: Try 5 with n_estimators = 10 and score = 0.76
03/03/17 16:04:06: Try 6 with n_estimators = 10 and score = 0.815
03/03/17 16:04:06: Try 7 with n_estimators = 10 and score = 0.775
03/03/17 16:04:06: Try 8 with n_estimators = 10 and score = 0.795
03/03/17 16:04:06: Try 9 with n_estimators = 10 and score = 0.8
03/03/17 16:04:07: Try 0 with n_estimators = 50 and score = 0.87
03/03/17 16:04:07: Try 1 with n_estimators = 50 and score = 0.875
03/03/17 16:04:07: Try 2 with n_estimators = 50 and score = 0.885
03/03/17 16:04:07: Try 3 with n_estimators = 50 and score = 0.865
03/03/17 16:04:08: Try 4 with n_estimators = 50 and score = 0.87
03/03/17 16:04:08: Try 5 with n_estimators = 50 and score = 0.86
03/03/17 16:04:08: Try 6 with n_estimators = 50 and score = 0.885
03/03/17 16:04:09: Try 7 with n_estimators = 50 and score = 0.88
03/03/17 16:04:09: Try 8 with n_estimators = 50 and score = 0.88
03/03/17 16:04:09: Try 9 with n_estimators = 50 and score = 0.875
03/03/17 16:04:10: Try 0 with n_estimators = 100 and score = 0.885
03/03/17 16:04:10: Try 1 with n_estimators = 100 and score = 0.9
03/03/17 16:04:11: Try 2 with n_estimators = 100 and score = 0.88
03/03/17 16:04:12: Try 3 with n_estimators = 100 and score = 0.89
03/03/17 16:04:12: Try 4 with n_estimators = 100 and score = 0.9
03/03/17 16:04:13: Try 5 with n_estimators = 100 and score = 0.89
03/03/17 16:04:13: Try 6 with n_estimators = 100 and score = 0.895
03/03/17 16:04:14: Try 7 with n_estimators = 100 and score = 0.89
03/03/17 16:04:14: Try 8 with n_estimators = 100 and score = 0.88
03/03/17 16:04:15: Try 9 with n_estimators = 100 and score = 0.88
03/03/17 16:04:16: Try 0 with n_estimators = 200 and score = 0.9
03/03/17 16:04:17: Try 1 with n_estimators = 200 and score = 0.91
03/03/17 16:04:18: Try 2 with n_estimators = 200 and score = 0.89
03/03/17 16:04:19: Try 3 with n_estimators = 200 and score = 0.88
03/03/17 16:04:20: Try 4 with n_estimators = 200 and score = 0.895
03/03/17 16:04:21: Try 5 with n_estimators = 200 and score = 0.885
03/03/17 16:04:22: Try 6 with n_estimators = 200 and score = 0.905
03/03/17 16:04:23: Try 7 with n_estimators = 200 and score = 0.9
03/03/17 16:04:24: Try 8 with n_estimators = 200 and score = 0.9
03/03/17 16:04:26: Try 9 with n_estimators = 200 and score = 0.88
03/03/17 16:04:28: Try 0 with n_estimators = 500 and score = 0.895
03/03/17 16:04:31: Try 1 with n_estimators = 500 and score = 0.88
03/03/17 16:04:34: Try 2 with n_estimators = 500 and score = 0.91
03/03/17 16:04:37: Try 3 with n_estimators = 500 and score = 0.91
03/03/17 16:04:40: Try 4 with n_estimators = 500 and score = 0.905
03/03/17 16:04:43: Try 5 with n_estimators = 500 and score = 0.925
03/03/17 16:04:47: Try 6 with n_estimators = 500 and score = 0.89
03/03/17 16:04:50: Try 7 with n_estimators = 500 and score = 0.91
03/03/17 16:04:54: Try 8 with n_estimators = 500 and score = 0.895
03/03/17 16:04:57: Try 9 with n_estimators = 500 and score = 0.9
03/03/17 16:04:57: RandomForestClassification Done!
figure(figsize(7,3))
errorbar(num_estimators_array, score_array_mu, yerr=score_array_sigma, fmt='k.-')
xscale("log")
xlabel("number of estimators",size = 16)
ylabel("accuracy",size = 16)
xlim(0.9,600)
ylim(0.5,0.95)
title("Random Forest Classifier", size = 18)
grid(which="both")

png

C_array = np.array([0.5, 0.1, 1, 5, 10])
score_array = np.zeros(len(C_array))
i=0
for C_val in C_array:
    svc_class = svm.SVC(kernel='linear', random_state=1, C = C_val)
    score_array[i] = eval_model_classifier(svc_class, train.iloc[0:1000], target.iloc[0:1000], 0.8)
    i=i+1

score_mu, score_sigma = mean(score_array), std(score_array)

figure(figsize(7,3))
errorbar(C_array, score_array, yerr=score_sigma, fmt='k.-')
xlabel("C assignment",size = 16)
ylabel("accuracy",size = 16)
title("SVM Classifier (Linear)", size = 18)
grid(which="both")

png

# Note: 
# Gamma: Kernel coefficient - the higher, it will try to exact fit to the training data, hence, can cause overfitting

gamma_array = np.array([0.001, 0.01, 0.1, 1, 10])
score_array = np.zeros(len(gamma_array))
score_mu = np.zeros(len(gamma_array))
score_sigma = np.zeros(len(gamma_array))
i=0
for gamma_val in gamma_array:
    svc_class = svm.SVC(kernel='rbf', random_state=1, gamma = gamma_val)
    score_array[i] = eval_model_classifier(svc_class, train.iloc[0:1000], target.iloc[0:1000], 0.8)
    score_mu[i], score_sigma[i] = mean(score_array[i]), std(score_array[i])
    i=i+1


figure(figsize(10,5))
errorbar(gamma_array, score_mu, yerr=score_sigma, fmt='k.-')
xscale('log')
xlabel("Gamma",size = 16)
ylabel("accuracy",size = 16)
ylim(0.1, 1)
title("SVM Classifier (RBF)", size = 18)
grid(which="both")

png

# Some PCA

pca = PCA(n_components=16)
pca.fit(train)
transform = pca.transform(train)

figure(figsize(15,10))
plt.scatter(transform[:,0],transform[:,1], s=20, c = target, cmap = "nipy_spectral", edgecolor = "None")
plt.colorbar()
clim(0,9)

xlabel("PC1")
ylabel("PC2")
<matplotlib.text.Text at 0x1ad8eda99b0>

png

n_components_array= np.array([2, 4, 8, 16, 32, 64, 128])
vr = np.zeros(len(n_components_array))

i=0;
for n_components in n_components_array:
    pca = PCA(n_components=n_components)
    pca.fit(train)
    vr[i] = sum(pca.explained_variance_ratio_)
    i=i+1 

figure(figsize(8,4))
plot(n_components_array,vr,'k.-')
xscale("log")
ylim(9e-2,1.1)
yticks(linspace(0.2,1.0,9))
xlim(1.5, 150)
grid(which="both")
xlabel("number of PCA components",size=16)
ylabel("variance ratio",size=16)
<matplotlib.text.Text at 0x1ad8e6afeb8>

png

# K-nearest
n_neighbors_array = range(2,20)
weight_array = ['uniform','distance']
score_array = np.zeros(len(n_neighbors_array))

for weight in weight_array:
    i=0
    for n in n_neighbors_array:
        nbrs = KNeighborsClassifier(n_neighbors=n, weights=weight)
        score_array[i] = eval_model_classifier(nbrs, transform, target, 0.8)
        print("{}: for n_neighbors = {} and weight ={} produces score = {}".format(now(), n, weight, score_array[i]))
        i+=1

print("Done")
03/03/17 16:24:59: for n_neighbors = 2 and weight =uniform produces score = 0.9508333333333333
03/03/17 16:25:03: for n_neighbors = 3 and weight =uniform produces score = 0.96
03/03/17 16:25:08: for n_neighbors = 4 and weight =uniform produces score = 0.9607142857142857
03/03/17 16:25:13: for n_neighbors = 5 and weight =uniform produces score = 0.9603571428571429
03/03/17 16:25:19: for n_neighbors = 6 and weight =uniform produces score = 0.9592857142857143
03/03/17 16:25:25: for n_neighbors = 7 and weight =uniform produces score = 0.9591666666666666
03/03/17 16:25:31: for n_neighbors = 8 and weight =uniform produces score = 0.9582142857142857
03/03/17 16:25:37: for n_neighbors = 9 and weight =uniform produces score = 0.958452380952381
03/03/17 16:25:44: for n_neighbors = 10 and weight =uniform produces score = 0.9576190476190476
03/03/17 16:25:50: for n_neighbors = 11 and weight =uniform produces score = 0.9570238095238095
03/03/17 16:25:57: for n_neighbors = 12 and weight =uniform produces score = 0.9578571428571429
03/03/17 16:26:04: for n_neighbors = 13 and weight =uniform produces score = 0.9553571428571429
03/03/17 16:26:11: for n_neighbors = 14 and weight =uniform produces score = 0.955952380952381
03/03/17 16:26:19: for n_neighbors = 15 and weight =uniform produces score = 0.9530952380952381
03/03/17 16:26:26: for n_neighbors = 16 and weight =uniform produces score = 0.9544047619047619
03/03/17 16:26:34: for n_neighbors = 17 and weight =uniform produces score = 0.9545238095238096
03/03/17 16:26:42: for n_neighbors = 18 and weight =uniform produces score = 0.9536904761904762
03/03/17 16:26:51: for n_neighbors = 19 and weight =uniform produces score = 0.9532142857142857
03/03/17 16:26:56: for n_neighbors = 2 and weight =distance produces score = 0.9557142857142857
03/03/17 16:27:03: for n_neighbors = 3 and weight =distance produces score = 0.9611904761904762
03/03/17 16:27:08: for n_neighbors = 4 and weight =distance produces score = 0.9615476190476191
03/03/17 16:27:14: for n_neighbors = 5 and weight =distance produces score = 0.9614285714285714
03/03/17 16:27:19: for n_neighbors = 6 and weight =distance produces score = 0.9610714285714286
03/03/17 16:27:24: for n_neighbors = 7 and weight =distance produces score = 0.9602380952380952
03/03/17 16:27:31: for n_neighbors = 8 and weight =distance produces score = 0.9608333333333333
03/03/17 16:27:38: for n_neighbors = 9 and weight =distance produces score = 0.9598809523809524
03/03/17 16:27:44: for n_neighbors = 10 and weight =distance produces score = 0.9604761904761905
03/03/17 16:27:50: for n_neighbors = 11 and weight =distance produces score = 0.9582142857142857
03/03/17 16:27:57: for n_neighbors = 12 and weight =distance produces score = 0.9596428571428571
03/03/17 16:28:06: for n_neighbors = 13 and weight =distance produces score = 0.9572619047619048
03/03/17 16:28:13: for n_neighbors = 14 and weight =distance produces score = 0.9579761904761904
03/03/17 16:28:20: for n_neighbors = 15 and weight =distance produces score = 0.9547619047619048
03/03/17 16:28:28: for n_neighbors = 16 and weight =distance produces score = 0.955595238095238
03/03/17 16:28:35: for n_neighbors = 17 and weight =distance produces score = 0.955595238095238
03/03/17 16:28:44: for n_neighbors = 18 and weight =distance produces score = 0.955
03/03/17 16:28:53: for n_neighbors = 19 and weight =distance produces score = 0.955
Done
# Train all model using testing data

pca = PCA(n_components=32)
pca.fit(train, target)
transform = pca.transform(train)

KNNmodel = KNeighborsClassifier(n_neighbors=4, weights='distance').fit(transform, target)
print(KNNmodel)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
           weights='distance')
test = pd.read_csv("test.csv")
test /=255
test_transform = pca.transform(test)
y_pred = KNNmodel.predict(test_transform)
test['Label'] = pd.Series(y_pred)
test['ImageId'] = test.index +1
sub = test[['ImageId', 'Label']]
print(test['Label'].value_counts(sort=True))
1    3233
7    2906
9    2805
2    2805
0    2779
6    2762
4    2741
3    2729
8    2700
5    2540
Name: Label, dtype: int64
sub.to_csv('submission_knn.csv', index=False)

Satsawat Natakarnkitkul (Net)'s Projects

amazon-sagemaker-examples icon amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.

default-credit-card icon default-credit-card

Predict the credit score for credit card clients in Taiwan from April 2005 to September 2005

fraud-detection icon fraud-detection

PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country.

generative-ai icon generative-ai

This repository composes of building generative AI on AWS Cloud

grab-ai-sea icon grab-ai-sea

This repository is for Grab AI challenge for S.E.A. The chosen challenge is Safety.

hr-analytics icon hr-analytics

This repository demonstrates how data science can help to identify the employee attrition which is part of Human Resource Management

investingcrawler icon investingcrawler

The intention of this project is to scrape the commodity or other asset data from investing.com, which utilizes Python to get date, price (open, close), volume, and movement (change in %).

jekyll-now icon jekyll-now

Build a Jekyll blog in minutes, without touching the command line.

jigsaw-toxicity-comment icon jigsaw-toxicity-comment

This repository utilizes the Jigsaw comment data to analyze, visualize, and classify the toxic comments

model-interpretation icon model-interpretation

Method and walkthrough on the method to help explaining and interpreting the machine learning algorithm using the churn banking dataset

mushroom-classifier icon mushroom-classifier

This repository quickly demonstrates the usage of varSelRF to classify the "edible" mushroom

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.