The code used in this article can be found in this repo.

Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.

The first example we investigate is a pure 0 dataset.

object
0
0
0
0
0
0
0
0
0
0
0
0

For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.

## Definition

Given a dataset $\{0,1,…,d\}^n$, the Gini impurity is calculated as

where $p(i)$ is the probability of a random picked record being class $i$.

In the above example, we have two classes, $\{0,1\}$. The probabilities are

The Gini impurity is

## Examples

Suppose we have another dataset with 50% of the values being 50%.

object
0
0
1
0
0
1
1
1
0
0
0
1

The Gini impurity is

For data with two possible values $\{0,1\}$, the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for two-value dataset.

For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that $p_3 = 1 - p_1 - p_2$.