PCA Implementation and Visualization in in Python and Numpy

Introduction

In this lab, we shall implement PCA, mainly using Numpy library in Python. Here, our desired outcome of the PCA is to project a feature space on a smaller subspace that preserves the variance of data. We shall look into performing linear data reduction and visualization with PCA. A possible application of this technique would be pattern classification, while reducing the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space, by extracting a subspace that describes the variance of complete dataset.

Objectives

You will be able to :

Implement Principal Component Analysis using Python and Numpy
Generate Multivariate datasets using covariance and mean matrices
Create 3 dimensional scatter plots using numpy and matplotlib
Visualize the results of PCA (principal components) in a low dimensional space

The PCA approach

In this lab, we shall look at following six key stages involved in a PCA experiment.

Start with a dataset consisting of d-dimensional samples. (This does not include the target variable)
Compute the mean for every dimension of the whole dataset as a d-dimensional mean vector
Compute the covariance matrix of the whole data set
Compute eigenvectors and corresponding eigenvalues
Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix
Use the eigenvector matrix to transform the samples onto the new subspace.

Let's move on and perform our basic step i.e. acquiring data.

Stage 1: Creating a Multivariate Dataset

For this experiment, we shall create a 3-dimensional sample i.e. three features. We will generate 40 samples randomly drawn from a multivariate Gaussian distribution. Also, note that the samples come from two different classes, where one half (i.e., 20) samples of our data set belong to class 1 and the other half to class 2.

Sampling from Normal Distribution

Below , $\mu1$ and $\mu2$ represent the sample means and $\Sigma1$, $\Sigma2$, covariance matrices for two classes.

$\mu1=\begin{bmatrix}0\0\0\end{bmatrix}$, $\mu2=\begin{bmatrix}1\1\1\end{bmatrix}$

$\Sigma1 = \begin{bmatrix}1\quad 0\quad 0\0\quad 1\quad0\0\quad0\quad1\end{bmatrix}$, $\Sigma2 = \begin{bmatrix}1\quad 0\quad 0\0\quad 1\quad0\0\quad0\quad1\end{bmatrix}$

Using above mean and covariance matrices:

Create mean amd covariance matrices from above data
Generate total 40 (20x2 for class 1 and 2) 3-dimensional samples randomly drawn from a multivariate Gaussian distribution.
Samples should belong to from different classes with a 50/50 distribution.

Note: Visit this link to get an idea on sampling from a multivariate gaussian distribution using given covariance matrix and feature means. Use numpy.random.multivariate_normal(mean, cov_matrix, num_samples) as described in the link.

# your code here

((3, 20), (3, 20))

Visualize the Dataset

We can now plot both of these samples on a 3d scatter matrix showing data along x, y and z dimension. Here is a reference plot with accompanying code to help you get going. USe Axes3D for defining data in a 3 dimensional space.

Create a 3d scatter plot showing both classes

# your code here

Combine Classes into a single dataset

As PCA is an unsupervised learning algorithm i.e. it doesn't require a target (or label) to learn the variance of features, So we can ignore the class labels and combine both datasets.

Concatenate feature sets for both classes into a single dataset. (Remember to perform the concatenation along the rows, as we dont want to add new features)

# your code here

samples = None

(3, 40)

Great, so now we have our final dataset, ready for analysis. This was our Stage 1 of PCA. Let's move on the stage 2.

Stage 2: Computing the 3-dimensional Mean Vector

This is a simple one. For the dataset we have created above, we need to calculate the mean of each feature and save it as a vector.

Calculate the mean of all three dimensions in samples and store as a vector of three values in the correct order.
Show the mean vector in the output

# your code here

Mean Vector:
[[0.65193393]
 [0.59422553]
 [0.37430913]]

Stage 3: Computing Covariance Matrix

The next step is to calculate the covariance matrix for our samples data using the in-built numpy.cov() function.

Calculate covariance matrix for the dataset by passing all three dimensions of data to np.cov()

# your code here

cov_matrix = None

Covariance Matrix:
[[1.63206317 0.51610568 0.35210596]
 [0.51610568 1.24413622 0.42949518]
 [0.35210596 0.42949518 1.39279619]]

Stage 4: Perform Eigen-Decomposition

We shall now perform the eigen-decomposition of the covariance matrix computed in the last stage. We shall make use of numpy.linalg.eig(cov) as seen earlier.

Perform Eigen-Decomposition to calculate eigenvectors and eigenvalues

# Your code here 

# eigen_value, eigen_vector =


# Eigen values: array([2.30703911, 1.14520348, 0.81675298]),
# Eigen Vectors: [[-0.67430892, -0.64746073,  0.35510854],
#                 [-0.53366192,  0.09488555, -0.84035807],
#                 [-0.51040418,  0.75616885,  0.40950731]]))

Checking the eigenvector-eigenvalue calculation

Let us quickly check that the eigenvector-eigenvalue calculation is correct and satisfy the equation

$$\Sigma v = \lambda v$$

where sigma is the covariance matrix , v is the eigenvector and lambda is the eigen value. As the value may not be exactly the same on both sides of the equation, we will use np.testing.assert_array_almost_equal() function to set a desired level acceptance for equality. Visit here to learn more about this function.

Check for the calculated eigenvectors and eigenvalues, if the R.H.S and L.H.S of above equation are almost equal.

# your code here


# No output - no exception raised . We are good to move forward

Visualize Eigenvectors - Optional

This step is not necessary for PCA but it is a good idea to visualize the eigen vectors and visually see their direction. Here is a code for visualizing the eigen vectors along with our data.

Uncomment to run below cell, once you have above stages fulfilled

# Uncomment and run below to visualize your principal components 

# %matplotlib inline

# from matplotlib import pyplot as plt
# from mpl_toolkits.mplot3d import Axes3D
# from mpl_toolkits.mplot3d import proj3d
# from matplotlib.patches import FancyArrowPatch


# class Arrow3D(FancyArrowPatch):
#     def __init__(self, xs, ys, zs, *args, **kwargs):
#         FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
#         self._verts3d = xs, ys, zs

#     def draw(self, renderer):
#         xs3d, ys3d, zs3d = self._verts3d
#         xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
#         self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
#         FancyArrowPatch.draw(self, renderer)

# fig = plt.figure(figsize=(10,10))
# ax = fig.add_subplot(111, projection='3d')

# ax.plot(samples[0,:], samples[1,:], samples[2,:], 'o', markersize=8, color='blue', alpha=0.2)
# ax.plot([mu_x], [mu_y], [mu_z], 'o', markersize=10, color='red', alpha=0.5)
# for v in eigen_vector.T:
#     a = Arrow3D([mu_x, v[0]], [mu_y, v[1]], [mu_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
#     ax.add_artist(a)
# ax.set_xlabel('x_values')
# ax.set_ylabel('y_values')
# ax.set_zlabel('z_values')

# plt.title('Eigenvectors')

# plt.show()

Stage 5: Sort and Select `k` Largest EigenVectors

The eigenvectors can only define the directions of the new axis, since they have all the same unit length 1. To decide which eigenvectors we want to keep in our lower dimensional space, we consider the eigenvalues. The eigenvectors with the lowest eigenvalues carry the least information about the distribution of the data (and vice versa), and those are the ones we want to drop.

Rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top k eigenvectors.

# Your code here

[(2.3070391145943794, array([-0.67430892, -0.53366192, -0.51040418])),
 (1.1452034776669042, array([-0.64746073,  0.09488555,  0.75616885])),
 (0.8167529808821581, array([ 0.35510854, -0.84035807,  0.40950731]))]

`dxk` dimensional matrix

For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix (d is the total number of dimensions i.e. 3 and k is the number of new dimensions used here i.e. 2).

Use np.reshape() and np.hstack to reshape and combine eigenvectors having top 2 eigenvalues

# Your code here

Matrix:
 [[-0.67430892 -0.64746073]
 [-0.53366192  0.09488555]
 [-0.51040418  0.75616885]]

Stage 6: Transform Samples to Low-Dimensional Subspace

In the last step, we shall use the matrix WW that we just computed to transform our samples onto the new subspace by dot multiplying it with original samples set.

Take a dot product of matrix calculated above with complete feature set samples.

# Your code here

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0	1
0	0.109119	0.116520
1	1.112799	-0.986342
2	1.475209	0.375554
3	0.245209	0.556289
4	-0.395697	-0.834696

Visualize low dimensional data

SO here we see that dataset has been transformed into a low dimensional space. Let's try to visualize this dataset with respect to the class distribution we initially started off with. Remember that out of 40 samples, first 20 samples belong to class A and last 20 belong to class b.

Draw a histogram for class 1 [0:20] and class 2 [20:40] from the transformed dataset
Use markers for both classes and label accordingly

# Your code here

Here it is , we have successfully reduced the dimensions of the given dataset from 3 to 2, along with visualizing the data at every step. Next we shall look at performing PCA in sci-kit learn with some real life datasets for a number of applications.

Summary

In this lab we looked at implementing PCA in Python and Numpy. We looked at different stages that are involved in running a PCA analysis in detail. We also used 3D and 2D visualizations along the way to get a better intuition of the whole process. Next we shall see how to repeat this exercise in sk-learn.

robillersomeone / dsc-3-34-07-pca-implementation-visualization-python-numpy-nyc-career-ds-102218 Goto Github PK

dsc-3-34-07-pca-implementation-visualization-python-numpy-nyc-career-ds-102218's Introduction