Giter Club home page Giter Club logo

cnn-alexnet's Introduction

CNN-AlexNet

πŸ•΅πŸ» Model 1: AlexNet : Image Classification

Paper : ImageNet Classification with Deep Convolutional Neural Networks Talk : NIPS 2012 ; Slide :link

2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) Winner.

It is a benchmark competition where teams across the world compete to classify, localize, detect ... images of 1000 categories, taken from the imagenet dataset.The imagenet dataset holds 15M images of 22K categories but for this contest: 1.2M images in 1K categories were chosen.Their goal was, Classification i.e, make 5 guesses about label for an image.Team "SuperVision (AlexNet) ", achieved top 5 test error rate of 15.4% ( next best entry achieved an error of 26.2% ) more than 10.8 percentage points ahead of the runner up. This was a huge success. Check ILSVRC 2012 results.

This paper is important as it stands as a stepping stone for CNNs in Computer Vision community. It was record breaking, new and exciting.

Overview

AlexNet is a Convolutional Neural Network architecture, introduced in 2012 by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton. It has 7 hidden weight layers & contains ● 650,000 neurons ● 60,000,000 parameters ● 630,000,000 connections. In simple terms, it is a model to correctly classify images. Later in 2014, Alex once again shows a unique way to parrallelize CNNs in his paper, "One weird trick for parallelizing convolutional neural networks"

Architecture:

Alexnet contained only 8 layers, first 5 were convolutional layers followed by fully connected layers. It had max-pooling layers and dropout layers in between. A simple skeleton looks like :

But wait,

What are Convolutional, Fully Connected, Max-pooling (P), Dropout & Normalization (N) Layers ? Read it here, I have explained everything in detail or else you can also read it in CS231n's blog on CNN.

The Network had a very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer). The Architecture can be summarized as :

( Image ) ->CONV1->P1->N1  ->CONV2->P2->N2 ->CONV3 ->CONV4 ->CONV5->P3 ->FC6 ->FC7 ->FC8 -> ( Label )

But why does the architecture diagram in the paper looks so scary ?

It is because, the figure shows training as well, training was done in 2 GPUs. One GPU runs the layer parts at the top of the figure while the other runs the layer parts at the bottom. The GPUs communicate only at certain layers.The communication overhead is kept low and this helps to achieve good performance overall. You can check this slide for future reference. Also, these comparisons are handy.

model.summary():

Input Image size : 227 x 227 x 3
(paper says - 224 x 224 , but there's some padding going on, 227 works)

● CONV1
Output (from Conv1): 55 x 55 x 96                      //55 = (227-11)/(4+1) = (Image size - Filter size)/stride+1
First Layer Conv1 has 96 11x11x3 filters at stride 4, pad 0

Output (from Pool1): 27 x 27 x 96
Max Pool 1 has 3 x 3 filter applied at stride 2

Ouput ( from Normalization Layer ): 27 x 27 x 96

●CONV2

Output (from Conv2): 27 x 27 x 256  
Second Layer Conv2 has 256 5x5x48 filters at stride 1, pad 2

Output (from Pool2): 13 x 13 x 256
Max Pool 2 has 3 x 3 filter applied at stride 2

Ouput ( from Normalization Layer ): 13 x 13 x 256

●CONV3

Output (from Conv3): 13 x 13 x 384
Third Layer Conv3 has 384 3x3x256 filters at stride 1, pad 1

●CONV4

Output (from Conv4): 13 x 13 x 384
Fourth Layer Conv4 has 384 3x3x192 filters at stride 1, pad 1

●CONV5

Output (from Conv5): 13 x 13 x 256
Fifth Layer Conv5 has 256 3x3x192 filters at stride 1, pad 1

Output (from Pool3): 6 x 6 x 256
Max Pool 3 has 3 x 3 filter applied at stride 2


●FC6
Fully Connected Layer 6 : 4096 neurons

●FC7
Fully Connected Layer 7 : 4096 neurons

●FC8
Fully Connected Layer 7 : 1000 neurons ( class scores )

Important Points:

● uses ReLu(Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid function which 
  was the earlier standard for traditional neural networks.
● ReLU non linearity is applied to the output of every convolutional layer and fully connected layer.
  reducing the over-fitting by using a Dropout layer after every FC layer.
● Rectified Linear Units (first use), overlapping pooling, dropout (0.5) trick to avoid overfitting
● Layer 1 (Convolutional) : 55*55*96 = 290,400 neurons & each has 11*11*3 = 363 weights and 1 bias i.e, 
  290400 * 364 = 105,705,600 paramaters on the first layer of the AlexNet alone!
● Training on multiple GPUs ( 2 NVIDIA GTX 580 3 GB GPU ) for 5-6 days.
  Top-1 and Top-5 error rates decreases by 1.7% & 1.2% respectively comparing to the net trained with 
  one GPU and half neurons!!
● Local Response Normalization
  Response normalization reduces top-1 and top-5 error rates by 1.4% and 1.2% , respectively.
● Overlapping Pooling ( s x z , where s < z ) compared to the non-overlapping scheme s = 2, z = 2
  top-1 and top-5 error rates decrease by 0.4% and 0.3%, respectively.
  overlap pooling makes it hard to overfit.
● Reducing Overfitting
  Heavy Data Augmentation!
    - 60 million parameters, 650,000 neurons (Overfits a lot.)
    - Crop 224x224 patches (and their horizontal reflections.)
    - At test time, average the predictions on the 10 patches.
● Reducing Overfitting 
    - Dropout
● Stochastic Gradient Descent (SGD) Learning
● batch size = 128
● 96 Convolutional Kernels ( 11 x 11 x 3 size kernels. ) - CONV1, CONV2, CONV4 & CONV5:
    - top 48 kernels on GPU 1 : color-agnostic
    - bottom 48 kernels on GPU 2 : color-specific.
● CONV3, FC1 & FC2 - Connection with all feature maps in preceding layers.
● In the paper, they say "Depth is really important.removing a single convolutional layer degrades
  the performance."

Practical:

Net Backend Weights
AlexNet Tensorflow Weights
AlexNet Caffe Weights

Let's build the Alexnet in Keras ('tf' backend) and test it on COCO-dataset dataset. We will develop all the three methods and train the dataset. The three methods are:

  1. Train from Scratch ( End2End )
  2. Transfer Learning
  3. Feature Extraction

I have explained here what the three methods mean, and how it is to be done.

● Approach: Will update soon. Training ...

RESULTS:

References

CS231n : Lecture 9 | CNN Architectures - AlexNet

If you find something amusing, don't forget to share with us. Create an issue and let us know.

cnn-alexnet's People

Contributors

florist-notes avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

aeccodingclub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.