Sunday, January 28, 2024

PCA Introduction

PCA is the method that finds and sort main directions over which our data vary.

The data are charted in an X/Y graph, that means we chose two (conventional) directions over which describe our data. The two axes (actually their directions) are – loosely speaking – the basis we use to calculate the coordinate numerical values.

The axes direction is somewhat arbitrary, that means we can always choose other directions. For instance we could find out the direction over which our data changes more quickly and define that as x-axis. It turns out that the direction over which the data changes the most is the first principal component. 

Below is how we can compute this in Python 


import matplotlib.pyplot as plt

from pandas import read_csv

 

# Read the data

data = read_csv('data.csv')

cs = data["X"].values

temp = data["Y"].values

 

# Take away the mean

x = cs - cs.mean()

y = temp - temp.mean()

 

# Group the data into a single matrix

datamatrix = np.array([x,y])

 

# Calculate the covariance matrix

covmat = np.cov(datamatrix)

 

# Find eigenvalues and eigenvectors of the covariance matrix

w,v = np.linalg.eig(covmat)

 

# Get the index of the largest eigenvalue

maxeig = np.argmax(w)

 

# Get the slope of the line passing through the origin and the largest eigenvector

m = -v[maxeig, 1]/v[maxeig, 0]

line = m*x



plt.scatter(x, y)

plt.xlabel('x')

plt.ylabel('y')

 

plt.quiver(0,0, x[0], line[0], units = 'xy', scale = 1, color='r', width = 0.2)

plt.axis('equal')

plt.ylim((-18,18))

 

plt.show()




Step 1, reading the data and assigning it to numpy arrays

Step 2, for PCA to work we need to take away the mean from both coordinates, that is we want the data to be centred at the origin of the x-y coordinates

Step 3, group the data in a single array

Step 4, calculate the covariance matrix of this array. Since we are dealing with a 2D dataset (bivariate data), the covariance matrix will be 2×2

Step 5, calculate eigenvalues and eigenvectors of the covariance matrix

Step 6, get the index of the largest eigenvalue. The first principal component we are looking for is the eigenvector corresponding to the largest eigenvalue

Step 7, this one is just needed for plotting. We get the slope of the line that is parallel to the principal component

Step 8,  Now we just need to plot the first principal component on top of the data.



sklearn provides quick library for doing this. 


from sklearn.decomposition import PCA

 

datazip = list(zip(x,y))

pca = PCA(n_components=2)

pca.fit(datazip)

 

# Print the eigenvectors

print(pca.components_)


No comments:

Post a Comment