Correlation coefficient is also known as Pearson’s product moment coefficient.

Review of Standard Deviation

For a series of data A, we have the standard deviations

where $n$ is the number of elements in series A.

Standard deviation is very easy to understand. It is basically the average Eucleadian distance between the data points and average value. In this article, we will take another point of view.

Now imagine we have two series $(a_i - \bar A)$ and $(a_j - \bar A)$. The geometric mean squared for $i=j$ is

From this point of view, standard deviation is in fact a measure of the mean of geometric mean of the deviation of each element.

Standard Deviation of the Sample

Generalize Standard Deviation to Covariances

Knowledge card: Covariance matrix.

Similarly, for two series A and B of the same length, we could define a quantity to measure the geometric mean of the deviation of the two series correspondingly

which is named the covariance of A and B, i.e., $\text{Cov} ({A,B})$.

It is easy to show that

At first glance, the square in the definition seems to be only for notation purpose at this point.

Meanwhile, using this idea of the mean of geometric mean, we could easily generalize it to the covariance of three series,

or even arbitrary N series,

which should be called the covariance of all the N series, $\mathrm{Cov} ({A_1, A_2,\cdots, A_N })$.

Of course, we do not use these since we could easily build a covariance matrix to indicate all the possible covariances between any two variables, for example,

Covariance measures the correlation of these two series. To see this, we assume that we have two series A = B, which leads to $\sigma_{A,B} = \sigma_{A}$. Suppose we have two series at a completely opposite phase,

index A B
1 1 -1
2 -1 1
3 1 -1
4 -1 1
5 1 -1
6 -1 1
7 1 -1

we have $\sigma_{A,B} = -1 $. The negative sign tells us that our series are anti-correlated.

Covariance is also related to dispersion matrix.

Correlation Coefficient

However, we would find that the value of the covariance depends on the values of the standard deviation of each series, which makes it hard to determine how strong the correlation is.

The obvious normalization factor is the multiplication of covariance of the two series, $\sigma_A$ and $\sigma_B$, i.e.,

The geometric mean view of it is

which is some kind of geometric mean of the geometric mean of each series.