Anscombe’s Quartet

Anscombe’s quartet is a brilliant idea that shows the importance and convenience of visual representation of data.

Anscombe’s quartet has four datasets. The values of each dataset are shown below.

x1 = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

x2 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74]

x3 = [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0], 
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

x4 = [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0], 
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89]

These datasets seems to be quite different. But they are quite similar statistically speaking.

  1. The averages of x and y are 9.0 and 7.5, respectively.
  2. The variances of x and y are 3.162 and 1.937 respectively.
  3. They even fit to the same linear line $y = 0.5 x + 3$ with the same least square loss.

However, we immediately spot the differences between them when we visualize them in a coordinate system.

What are the differences between the datasets?

There are probably a million different ways to tell them apart. A very simple calculation is the percentile. For example, the medians of the x data are (9.0, 9.0, 9.0, 8.0). The medians of the y data are (7.58, 8.14, 7.11, 7.04).

Here we simply plot the box plots. This tells us that some of the data is quite skewed.

There are many insights from this example. But the most important one is that we should plot out the data every time we are working on EDA. Even when visualization is not a choice at the moment, we usually calculate more statistical measures. For example, we calculate the mean and median together as centroids to get a feeling of how skewed the data is.