# Bias-Variance

## Bias and Variance

Suppose we have a perfect model $f(X)$ that represent a tight model of the dataset $(X,Y)$ but some irredicible error,

On the other hand, we could get another model using a specific model such as k-nearest neighbors, which we denote as .

Why the two models?

Why are we talking about the perfect model and a model using a specific method?

The perfect model $f(X)$ is our ultimate goal, while the model using a specific method $k(X)$ is our effort of approaching the ultimate model.

What is our bias? It measures the deficit between $k(X)$ and the perfect model $f(X)$,

Zero bias means we are matching the perfect model.

E[g(X)]

What is variance? Variance is about the model itself:

The larger the variance, the more wiggly the model is.

## Mean Square Error

Bias measures the deficit between the specific model and the perfect model. How do we measure the deficit between the specific model and the actual data point? We need Mean Squared Error (MSE).

The Mean Squared Error (MSE) is defined as

A straightforward decomposition using equation ($\ref{dataset-using-true-model}$) shows that we have three components in our MSE. To make the equations look nice, we drop the $(X)$, hence $k$ in the equation means $k(X)$.

We have this Irreducible Error because the mean of the irreducible error is required to be zero, $\operatorname{E}(\epsilon)=0$. If this is not zero then the model $f(X)$ is not perfect.

## Bias-Variance Tradeoff

The more parameters we introduce in the model, it is more likely to reduce the bias. However, at some point, the more complexity we have in the model, the more wiggles the model will have. Thus the variance will be larger.

Free Parameters

Fermi once said,

I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

There is a nice story about Dyson and Fermi behind this.