The code used in this article can be found in this repo.

Suppose we have a dataset $\{0,1\}^{10}$, which has 10 records and 2 possible classes of objects $\{0,1\}$ in each record.

The first example we investigate is a pure 0 dataset.

object
0
0
0
0
0
0
0
0
0
0
0
0

For such an all-0 dataset, we would like to define its impurity as 0. Same with an all-1 dataset. For a dataset with 50% of 1 and 50% of 0, we would define its impurity as max due to the symmetries between 0 and 1.

Definition

Given a dataset $\{0,1,…,d\}^n$, the Gini impurity is calculated as

where $p(i)$ is the probability of a random picked record being class $i$.

In the above example, we have two classes, $\{0,1\}$. The probabilities are

The Gini impurity is

Examples

Suppose we have another dataset with 50% of the values being 50%.

object
0
0
1
0
0
1
1
1
0
0
0
1

The Gini impurity is

For data with two possible values $\{0,1\}$, the maximum Gini impurity is 0.25. The following chart shows all the possible values of the Gini impurity for two-value dataset.

Gini impurity for data with two possible values. The color indicates the Gini impurity.

For data with three possible values, the Gini impurity is also visualized using the same chart given the condition that $p_3 = 1 - p_1 - p_2$.

Gini impurity for data with three possible values. The color indicates the Gini impurity.