Unsupervised Learning: PCA
We use the Einstein summation notation in this article.
Principal Component Analysis (PCA) is a commonly used trick for dimensionality reduction so that the new features represents most of the variances of the data.
Representations of Dataset
Give a dataset of $n$ features and $m$ rows,
\begin{equation}
\mathbf X = \begin{pmatrix}
\mathbf X_1
\mathbf X_2
\cdots
\mathbf X_n
\end{pmatrix},
\end{equation}
where each $\mathbf X_i$ is a $m$ dimensional column vector.
We are looking for a $n\times p$ matrix $\mathbf P$ so that
where $i\in [1, m]$, $k\in [1,p]$, and $j\in [1, n]$. This projection will select out $p$ features.
Vector Notation Formalism
The coefficients $P_{jk}$ are called loadings.
What is PCA
The transformation $\mathbf P$ select out features that represents the most variances. How are the variances represented? The covariance matrix sheds light on the problem.
In an ideal condition, the covariance matrix of the dataset matrix is diagonalized. This indicates that the dataset matrix features are not correlated, i.e., the basis axes are in the direction of the spread.
In order to perform PCA, we have to acquire a new transformed dataset matrix $\mathbf Z$ so that the covariance matrix $\mathbf {C_{Z}}$ of $\mathbf Z$ is (mostly) diagonalized. It requires
The covariance matrix $\mathbf {C_{Z}}$ is diagonalized if choose $\mathbf P$ so that it diagonalizes the matrix $(\mathbf X^T\mathbf X)$, i.e.,
Notice that $\mathbf P \mathbf P^T = \mathbf I$.
The component $\mathbf Z_i$ that has the largest variance is the first principal component. In most cases, we arrange the transformation $\mathbf P$ so that the first component $\mathbf Z_1$ is the first principal component. The variance of the first principle component is
\begin{equation} \frac{1}{n}\sum_{n=1}^N z_{n1}, \end{equation}
which is maximized.
Relation between PCA and Linear Regression
Why is $\mathbf P$ related to the covariance matrix, again?
The idea behind PCA is to find the mixings of the features so that the new axes give us the largest variance. We could invent an algorithm to explore the whole parameter space (shifts, scales, and rotations of the axes) and find the best parameters. However, this would be rather inefficient. The proof Eq. ($\ref{eqnpcacovmatdiag}$) provides theoretical support.
Why the covariance matrix?
The covariance matrix for ${\mathbf X_i}$ is
Recall that
We shift our data to be centered by the mean. If we always prepare our data so that the mean is 0, i.e., $\bar X_i = 0$, the covariance could be written as
If we were to calculate the covariance matrix for the new features ${Z_i}$,
From this, we could easily derive that a linear transformation for the covariance matrice indicates a linear transformation of the features.
If we were to require the covariance matrix for $\mathbf{C_{Z}}$ to be diagonalized, we would have found that $\mathbf{C_{Z}}$ is composed of the eigenvalues. We could deliberately remove those eigenvectors with small eigenvalues and keep those that help us the most with the variance of the data.
To summarize, we could find the principal components by finding the eigenvectors of the covariance matrix:
 Standardize the data: shift the data so that the mean is 0.
 Calculate the covariance matrix of the features, $\mathbf {C_{X}}$.
 Calculate the eigenvectors of the covariance matrix.
 Construct the transformation matrix, $\mathbf P$.
 Transform the data using $\mathbf U$, $\mathbf Z = \mathbf X \mathbf P$.
 Decide the number of principal components $p$ and choose the first $p$ columns of $\mathbf Z$.
SVD and PCA
It is straight forward to prove that the principal components are he orthonormal basis that spans the row space of $\mathbf X$ in our setup ^{1}.

Shlens, J. (2003). A Tutorial on Principal Component Analysis. ↩