- Many machine learning problems have thousands or even millions of features for each training instance. Not only does this make training extremely slow, it can also make it much harder to find a good solution
- Reducing dimensionality does lose some information (just like compressing an image to JPEG can degrade its quality), so even though it will speed up training, it may also make your system perform slightly worse
- For example, in face recognition, the size of a training image patch is usually larger than 60 x 60 , which corresponds to a vector with more than 3600 dimensions
- In some cases , reducing the dimensionality of the training data may filter out some noise and unnecessary details and thus result in higher performance (but in general it won’t; it will just speed up training)
- Redundancy reduction and intrinsic structure discovery
- Intrinsic structure discovery
- Removal of irrelevant and noisy features
- Feature extraction
- Visualization purpose
- Computation and Machine learning perspective
- PCA is by far the most popular dimensionality algorithm which is in use
- The main idea of it is to reduce the dimensionality of a data set consisting of many variables correlated with each other, either heavily or lightly, while retaining the variation present in the dataset, up to the maximum extent. The same is done by transforming the variables to a new set of variables known as Principal Components
- PCA is basically the Linear Algebra.
- By simple example will help to understand what is it and how is it works.
- If we take 100*2 matrix
- Then we have two choice
- To standardized this data
- Without standardized
So 1st we are going to understand with standardization
Now for implementing PCA
- Step1:
# generating a random data of 100*2
height = np.round(np.random.normal(1.75, 0.20, 10), 2)
weight = np.round(np.random.normal(60.32, 15, 10), 2)
Data = np.column_stack((height, weight))
print("printing the Data:")
- Step2:
Now find mean of this Data column wise
Mean =np.mean(Data,axis=0)
print("Mean of this Data:" + str(Mean))
- Step3:
Now find standard variation of Data
Std = np.std(Data, axis=0)
print("Standard Deviation of this Data:" + str(Std))
- Step4:
Now Standardized this data and find Co-variance matrix
stdData = (Data - Mean) / Std
print("Our Stdandized matrix is :" + str(stdData))
print(stdData.shape)
find Co-variance matrix
covData = np.cov(stdData.T)
print("Our Co-variance matrix is:" + str(covData))
- Step5:
find eighen values and eighen vectors
values, vectors = eig(covData)
print(values)
print(vectors)
- Step6:
i=0
pairs=[(np.abs(values[i]), vectors[:,i]) for i in range(len(values))]
print(pairs)
print("------------------------------------------------------------------------------------------------------------------")
pairs.sort(key=lambda x: x[0], reverse = True)
print(pairs)
above we have pair the eighen values with eighen vectors and sorted them by eighen values
Now in this When we have taken mnist-dataset and applied above line of code then we are getting -nan error during standardization and it's because in mnist dataset 60000*784 the some variables are colinear with each other .
- PCA mainly focuses on the most variation among all the variables.
- In LDA we are interested in maximizing the seperatibility between all the known catagories.
- LDA projects the data in the way that maximize the seperation of two catagories.
-
Maximize the distance between the means of the catagories
-
Minimize the scatter within each catagory
-
Step1:Between class variance OR Between class matrix
-
Step2:Within class variance OR Within class matrix
-
Step3:Construct lower dimensional space which maximizes the between class variance and minimizes the within class variance
-
Step4:Projection
- PCA - UNSUPERVISED
- LDA - SUPERVISED
Now as we have seen two methods let's compare both of them on various datasets like wine,digits and iris datasets and visualize the plot of the results.
Hear we are going to use sklearn library's datasets and decomposition function for PCA and LDA.
- Importing dataset#for iris
iris = datasets.load_iris()
print (iris)
#for wine Dataset
X = wine.data
y = wine.target
target_names = wine.target_names
#for digits dataset
X = digits.data
y = digits.target
target_names = digits.target_names
- Calculating PCA and LDA with the help of sklearn library function
#for PCA
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
print(X_r)
#for LDA
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
print(X_r2)
hear n_components cannot be larger than min(n_features, n_classes - 1).hear i have given example about iris dataset Using min(n_features, n_classes - 1) = min(4, 3 - 1) = 2 components.
now for differantiate between this two i have made plots of both the results from that you can see the difference
- lda_vs_pca plot of these three databases