In this lab, we shall implement PCA, mainly using Numpy library in Python. Here, our desired outcome of the PCA is to project a feature space on a smaller subspace that preserves the variance of data. We shall look into performing linear data reduction and visualization with PCA. A possible application of this technique would be pattern classification, while reducing the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space, by extracting a subspace that describes the variance of complete dataset.
You will be able to :
- Implement Principal Component Analysis using Python and Numpy
- Generate Multivariate datasets using covariance and mean matrices
- Create 3 dimensional scatter plots using numpy and matplotlib
- Visualize the results of PCA (principal components) in a low dimensional space
In this lab, we shall look at following six key stages involved in a PCA experiment.
- Start with a dataset consisting of d-dimensional samples. (This does not include the target variable)
- Compute the mean for every dimension of the whole dataset as a d-dimensional mean vector
- Compute the covariance matrix of the whole data set
- Compute eigenvectors and corresponding eigenvalues
- Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors with the largest eigenvalues to form a d×k dimensional matrix
- Use the eigenvector matrix to transform the samples onto the new subspace.
Let's move on and perform our basic step i.e. acquiring data.
For this experiment, we shall create a 3-dimensional sample i.e. three features. We will generate 40 samples randomly drawn from a multivariate Gaussian distribution. Also, note that the samples come from two different classes, where one half (i.e., 20) samples of our data set belong to class 1 and the other half to class 2.
Below ,
$\mu1=\begin{bmatrix}0\0\0\end{bmatrix}$, $\mu2=\begin{bmatrix}1\1\1\end{bmatrix}$
$\Sigma1 = \begin{bmatrix}1\quad 0\quad 0\0\quad 1\quad0\0\quad0\quad1\end{bmatrix}$, $\Sigma2 = \begin{bmatrix}1\quad 0\quad 0\0\quad 1\quad0\0\quad0\quad1\end{bmatrix}$
Using above mean and covariance matrices:
- Create mean amd covariance matrices from above data
- Generate total 40 (20x2 for class 1 and 2) 3-dimensional samples randomly drawn from a multivariate Gaussian distribution.
- Samples should belong to from different classes with a 50/50 distribution.
Note: Visit this link to get an idea on sampling from a multivariate gaussian distribution using given covariance matrix and feature means. Use numpy.random.multivariate_normal(mean, cov_matrix, num_samples)
as described in the link.
# your code here
((3, 20), (3, 20))
We can now plot both of these samples on a 3d scatter matrix showing data along x, y and z dimension. Here is a reference plot with accompanying code to help you get going. USe Axes3D
for defining data in a 3 dimensional space.
- Create a 3d scatter plot showing both classes
# your code here
As PCA is an unsupervised learning algorithm i.e. it doesn't require a target (or label) to learn the variance of features, So we can ignore the class labels and combine both datasets.
- Concatenate feature sets for both classes into a single dataset. (Remember to perform the concatenation along the rows, as we dont want to add new features)
# your code here
samples = None
(3, 40)
Great, so now we have our final dataset, ready for analysis. This was our Stage 1 of PCA. Let's move on the stage 2.
This is a simple one. For the dataset we have created above, we need to calculate the mean of each feature and save it as a vector.
- Calculate the mean of all three dimensions in
samples
and store as a vector of three values in the correct order. - Show the mean vector in the output
# your code here
Mean Vector:
[[0.65193393]
[0.59422553]
[0.37430913]]
The next step is to calculate the covariance matrix for our samples
data using the in-built numpy.cov()
function.
- Calculate covariance matrix for the dataset by passing all three dimensions of data to
np.cov()
# your code here
cov_matrix = None
Covariance Matrix:
[[1.63206317 0.51610568 0.35210596]
[0.51610568 1.24413622 0.42949518]
[0.35210596 0.42949518 1.39279619]]
We shall now perform the eigen-decomposition of the covariance matrix computed in the last stage. We shall make use of numpy.linalg.eig(cov)
as seen earlier.
- Perform Eigen-Decomposition to calculate eigenvectors and eigenvalues
# Your code here
# eigen_value, eigen_vector =
# Eigen values: array([2.30703911, 1.14520348, 0.81675298]),
# Eigen Vectors: [[-0.67430892, -0.64746073, 0.35510854],
# [-0.53366192, 0.09488555, -0.84035807],
# [-0.51040418, 0.75616885, 0.40950731]]))
Let us quickly check that the eigenvector-eigenvalue calculation is correct and satisfy the equation
where sigma is the covariance matrix , v is the eigenvector and lambda is the eigen value. As the value may not be exactly the same on both sides of the equation, we will use np.testing.assert_array_almost_equal()
function to set a desired level acceptance for equality. Visit here to learn more about this function.
- Check for the calculated eigenvectors and eigenvalues, if the R.H.S and L.H.S of above equation are almost equal.
# your code here
# No output - no exception raised . We are good to move forward
This step is not necessary for PCA but it is a good idea to visualize the eigen vectors and visually see their direction. Here is a code for visualizing the eigen vectors along with our data.
- Uncomment to run below cell, once you have above stages fulfilled
# Uncomment and run below to visualize your principal components
# %matplotlib inline
# from matplotlib import pyplot as plt
# from mpl_toolkits.mplot3d import Axes3D
# from mpl_toolkits.mplot3d import proj3d
# from matplotlib.patches import FancyArrowPatch
# class Arrow3D(FancyArrowPatch):
# def __init__(self, xs, ys, zs, *args, **kwargs):
# FancyArrowPatch.__init__(self, (0,0), (0,0), *args, **kwargs)
# self._verts3d = xs, ys, zs
# def draw(self, renderer):
# xs3d, ys3d, zs3d = self._verts3d
# xs, ys, zs = proj3d.proj_transform(xs3d, ys3d, zs3d, renderer.M)
# self.set_positions((xs[0],ys[0]),(xs[1],ys[1]))
# FancyArrowPatch.draw(self, renderer)
# fig = plt.figure(figsize=(10,10))
# ax = fig.add_subplot(111, projection='3d')
# ax.plot(samples[0,:], samples[1,:], samples[2,:], 'o', markersize=8, color='blue', alpha=0.2)
# ax.plot([mu_x], [mu_y], [mu_z], 'o', markersize=10, color='red', alpha=0.5)
# for v in eigen_vector.T:
# a = Arrow3D([mu_x, v[0]], [mu_y, v[1]], [mu_z, v[2]], mutation_scale=20, lw=3, arrowstyle="-|>", color="r")
# ax.add_artist(a)
# ax.set_xlabel('x_values')
# ax.set_ylabel('y_values')
# ax.set_zlabel('z_values')
# plt.title('Eigenvectors')
# plt.show()
The eigenvectors can only define the directions of the new axis, since they have all the same unit length 1. To decide which eigenvectors we want to keep in our lower dimensional space, we consider the eigenvalues. The eigenvectors with the lowest eigenvalues carry the least information about the distribution of the data (and vice versa), and those are the ones we want to drop.
- Rank the eigenvectors from highest to lowest corresponding eigenvalue and choose the top k eigenvectors.
# Your code here
[(2.3070391145943794, array([-0.67430892, -0.53366192, -0.51040418])),
(1.1452034776669042, array([-0.64746073, 0.09488555, 0.75616885])),
(0.8167529808821581, array([ 0.35510854, -0.84035807, 0.40950731]))]
For our simple example, where we are reducing a 3-dimensional feature space to a 2-dimensional feature subspace, we are combining the two eigenvectors with the highest eigenvalues to construct our d×k-dimensional eigenvector matrix (d is the total number of dimensions i.e. 3 and k is the number of new dimensions used here i.e. 2).
- Use
np.reshape()
andnp.hstack
to reshape and combine eigenvectors having top 2 eigenvalues
# Your code here
Matrix:
[[-0.67430892 -0.64746073]
[-0.53366192 0.09488555]
[-0.51040418 0.75616885]]
In the last step, we shall use the matrix WW that we just computed to transform our samples onto the new subspace by dot multiplying it with original samples set.
- Take a dot product of matrix calculated above with complete feature set
samples
.
# Your code here
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
0 | 1 | |
---|---|---|
0 | 0.109119 | 0.116520 |
1 | 1.112799 | -0.986342 |
2 | 1.475209 | 0.375554 |
3 | 0.245209 | 0.556289 |
4 | -0.395697 | -0.834696 |
SO here we see that dataset has been transformed into a low dimensional space. Let's try to visualize this dataset with respect to the class distribution we initially started off with. Remember that out of 40 samples, first 20 samples belong to class A and last 20 belong to class b.
- Draw a histogram for class 1 [0:20] and class 2 [20:40] from the transformed dataset
- Use markers for both classes and label accordingly
# Your code here
Here it is , we have successfully reduced the dimensions of the given dataset from 3 to 2, along with visualizing the data at every step. Next we shall look at performing PCA in sci-kit learn with some real life datasets for a number of applications.
In this lab we looked at implementing PCA in Python and Numpy. We looked at different stages that are involved in running a PCA analysis in detail. We also used 3D and 2D visualizations along the way to get a better intuition of the whole process. Next we shall see how to repeat this exercise in sk-learn
.