Giter Club home page Giter Club logo

dimensionality-reduction-technique-pca-lda-ica-svd's Introduction

Dimensionality-Reduction-Technique-PCA-LDA-ICA-SVD

DIMENTIONALITY REDUCTION

  • Many machine learning problems have thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution
  • Reducing dimensionality does lose some information (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may also make your system perform slightly worse
  • For example, in face recognition, the size of a training image patch is usually larger than 60 x 60 , which corresponds to a vector with more than 3600 dimensions
  • In some cases , reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance (but in general it won’t; it will just speed up training)

Practical reasons

  • Redundancy reduction and intrinsic structure discovery
  • Intrinsic structure discovery
  • Removal of irrelevant and noisy features
  • Feature extraction
  • Visualization purpose
  • Computation and Machine learning perspective

PCA (Principle component analysis)

  • PCA is by far the most popular dimensionality algorithm which is in use
  • The main idea of it is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. The same is done by transforming the variables to a new set of variables known as Principal Components
  • PCA is basically the Linear Algebra.
  • By simple example will help to understand what is it and how is it works.
  • If we take 100*2 matrix
  • Then we have two choice
    • To standardized this data
    • Without standardized

So 1st we are going to understand with standardization

Now for implementing PCA

  • Step1:
# generating a random data of 100*2
height = np.round(np.random.normal(1.75, 0.20, 10), 2)
weight = np.round(np.random.normal(60.32, 15, 10), 2)
Data = np.column_stack((height, weight))
print("printing the Data:")
  • Step2:

Now find mean of this Data column wise

Mean =np.mean(Data,axis=0)
print("Mean of this Data:" + str(Mean))
  • Step3:

Now find standard variation of Data

Std = np.std(Data, axis=0)
print("Standard Deviation of this Data:" + str(Std))
  • Step4:

Now Standardized this data and find Co-variance matrix

stdData = (Data - Mean) / Std
print("Our Stdandized matrix is :" + str(stdData))
print(stdData.shape)

find Co-variance matrix

covData = np.cov(stdData.T)
print("Our Co-variance matrix is:" + str(covData))
  • Step5:

find eighen values and eighen vectors

values, vectors = eig(covData)
print(values)
print(vectors)
  • Step6:
i=0
pairs=[(np.abs(values[i]), vectors[:,i]) for i in range(len(values))]
print(pairs)
print("------------------------------------------------------------------------------------------------------------------")
pairs.sort(key=lambda x: x[0], reverse = True)
print(pairs)

above we have pair the eighen values with eighen vectors and sorted them by eighen values

Now in this When we have taken mnist-dataset and applied above line of code then we are getting -nan error during standardization and it's because in mnist dataset 60000*784 the some variables are colinear with each other .

LDA (Linear Discriminant Analysis)

  • PCA mainly focuses on the most variation among all the variables.
  • In LDA we are interested in maximizing the seperatibility between all the known catagories.
  • LDA projects the data in the way that maximize the seperation of two catagories.

Two criteria

  • Maximize the distance between the means of the catagories

  • Minimize the scatter within each catagory

  • Step1:Between class variance OR Between class matrix

  • Step2:Within class variance OR Within class matrix

  • Step3:Construct lower dimensional space which maximizes the between class variance and minimizes the within class variance

  • Step4:Projection

PCA vs LDA

  • PCA - UNSUPERVISED
  • LDA - SUPERVISED

Now as we have seen two methods let's compare both of them on various datasets like wine,digits and iris datasets and visualize the plot of the results.

Hear we are going to use sklearn library's datasets and decomposition function for PCA and LDA.

- Importing dataset
#for iris
iris = datasets.load_iris()
print (iris)
#for wine Dataset
X = wine.data
y = wine.target
target_names = wine.target_names
#for digits dataset
X = digits.data
y = digits.target
target_names = digits.target_names
  • Calculating PCA and LDA with the help of sklearn library function
#for PCA
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
print(X_r)
#for LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
print(X_r2)

hear n_components cannot be larger than min(n_features, n_classes - 1).hear i have given example about iris dataset Using min(n_features, n_classes - 1) = min(4, 3 - 1) = 2 components.

now for differantiate between this two i have made plots of both the results from that you can see the difference

  • lda_vs_pca plot of these three databases

Digits Dataset PCA Plot

Digits Datasets LDA Plot

IRIS Datasets PCA plot

IRIS Datasets LDA Plot

WINE Datasets PCA plot

WINE Datasets LDA plot

dimensionality-reduction-technique-pca-lda-ica-svd's People

Contributors

snayan06 avatar

Stargazers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.