# Chi-square Correlation Test for Nominal Data

In this article, we will discuss the chi-square correlation test for detecting correlations between two series.

## Steps

- Find out all the possible values of the two nominal series A and B;
- Count the co-occurrences of the combinations (A, B);
- Calculate the expected co-occurrences of the combinations (A, B);
- Calculate chi-square;
- Determine whether the hypothesis can be rejected.

## Define the Series

Suppose we are analyzing two series A and B. Series A can take values $a_1$ and $a_2$, while series B can take values $b_1$ , $b_2$ and $b_3$.

As an example, we will use the following A and B series for our calculations in this article.

index | A | B |
---|---|---|

1 | a1 | b2 |

2 | a1 | b2 |

3 | a1 | b1 |

4 | a2 | b1 |

5 | a2 | b3 |

6 | a2 | b2 |

7 | a1 | b2 |

8 | a2 | b2 |

## Count Co-ocurrences

To analyze correlations between the two series, we need to look at whether the values of series A and those of series B would occur together. For example, we would like to know the possibility of values for B if we have $a_1$ occurred.

One of the extreme examples is that A and B are exactly the same. In this case, we would know that the value for B is always the same as A for each row. Then we would know that all the possible combinations of (A, B) are

- (a1, a1)
- (a2, a2)

We could construct the occurrence table.

a1 | a2 | |
---|---|---|

a1 | number of occurrences | 0 |

a2 | 0 | number of occurrences |

Now we construct a contigency table to denote the ocurrences of the values, (A, B).

a1 | a2 | |
---|---|---|

b1 | 1 | 1 |

b2 | 3 | 2 |

b3 | 0 | 1 |

where the cells are filled with the number of occurrences of the corresponding combinations. For example, the combination (a1, b1) occurred once, thus 1 in the first row first column.

This table records the **observed frequencies**, which we denote as table O and each cell is denoted as $o_{ij}$.

This table tells us about the possible correlations already. Imagine we have two columns that are exactly the same, we would have a table that have large number of occurences on the diagonal elements.

However, those numbers in the table depends on the number of rows that we have in our original table. To find the actual correlation, we need to normalize it. We could simply divided everything by the total number of rows in the original table. But Pearson had a better idea.

Pearson’s chi-square correlation is a smart idea.

First of all, we define an expectation table E. Each element of E is calculated as

This $e_{ij}$ serves as the average occurance of each combinations of $a_i$ and $b_j$. If we have $a_i$ in each row but only one $b_j$ occurrences, the average is 1. This mean that given $b_j$ we would definitely only see one $a_i$.

When we have multiple $a_i$ and $b_j$, this average still works. Suppose we have $a_1$ occurred 4 times in total and we have a total of 8 rows. Assuming that this $a_1$ will appear randomly in the rows, what is the average probability to see this $a_1$ if we choose a random row? It is $4/8=0.5$. Then we will expect $1\times 0.5=0.5$ occurrences $a_1$ for one occurance of $b_2$. If we have 3 occurances of $b_2$, we would expect to see $3\times 0.5==1.5$ occurrences of $a_1$.

This is why it is treated as expected frequencies of each combinations.

As a side note, suppose we have A and B exactly the same, and they all have the same values, a1.

index | A | B |
---|---|---|

1 | a1 | b1 |

2 | a1 | b1 |

3 | a1 | b1 |

4 | a1 | b1 |

5 | a1 | b1 |

6 | a1 | b1 |

7 | a1 | b1 |

8 | a1 | b1 |

Then we expect $e_{11} = n$, where $n=8$ is the number of rows.

Now if we compare the original table with this one,

we get the deviation from the expected table. With a few little twitches, we would define

If A and B are the same and each possible values occurred m times, then we would have

Then we get

Then we say this chi-square analysis doesn’t reject our hypothesis that these two columns are correlated.

## How to Use the Number Chi-square

The final question is how to use the result. We usually have a threshold $\chi_0^2$. Whenever our calculated value is larger than this one, we decide that our analysis reject the hypothesis that the two columns are correlated. This value $\chi_0^2$ can be found in the textbooks.