PCA is the method that finds and sort main directions over which our data vary.
The data are charted in an X/Y graph, that means we chose two (conventional) directions over which describe our data. The two axes (actually their directions) are – loosely speaking – the basis we use to calculate the coordinate numerical values.
The axes direction is somewhat arbitrary, that means we can always choose other directions. For instance we could find out the direction over which our data changes more quickly and define that as x-axis. It turns out that the direction over which the data changes the most is the first principal component.
Below is how we can compute this in Python
import matplotlib.pyplot as plt
from pandas import read_csv
# Read the data
data = read_csv('data.csv')
cs = data["X"].values
temp = data["Y"].values
# Take away the mean
x = cs - cs.mean()
y = temp - temp.mean()
# Group the data into a single matrix
datamatrix = np.array([x,y])
# Calculate the covariance matrix
covmat = np.cov(datamatrix)
# Find eigenvalues and eigenvectors of the covariance matrix
w,v = np.linalg.eig(covmat)
# Get the index of the largest eigenvalue
maxeig = np.argmax(w)
# Get the slope of the line passing through the origin and the largest eigenvector
m = -v[maxeig, 1]/v[maxeig, 0]
line = m*x
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.quiver(0,0, x[0], line[0], units = 'xy', scale = 1, color='r', width = 0.2)
plt.axis('equal')
plt.ylim((-18,18))
plt.show()
Step 1, reading the data and assigning it to numpy arrays
Step 2, for PCA to work we need to take away the mean from both coordinates, that is we want the data to be centred at the origin of the x-y coordinates
Step 3, group the data in a single array
Step 4, calculate the covariance matrix of this array. Since we are dealing with a 2D dataset (bivariate data), the covariance matrix will be 2×2
Step 5, calculate eigenvalues and eigenvectors of the covariance matrix
Step 6, get the index of the largest eigenvalue. The first principal component we are looking for is the eigenvector corresponding to the largest eigenvalue
Step 7, this one is just needed for plotting. We get the slope of the line that is parallel to the principal component
Step 8, Now we just need to plot the first principal component on top of the data.
sklearn provides quick library for doing this.
from sklearn.decomposition import PCA
datazip = list(zip(x,y))
pca = PCA(n_components=2)
pca.fit(datazip)
# Print the eigenvectors
print(pca.components_)
No comments:
Post a Comment