SVM is calculating a hyperplane to seperate the data points into groups according to the label.

Hyperplane

A Few Key Concepts in SVM

Though the concept of SVM is simple, one might find the algorithm to be quite complicated at the first glance.

Which hyperplane to choose

Suppose we have two classes in our dataset, class A and B, our hyperplane will seperate the two classes.

The plane has to make sure that most data points of class A and B are on the two sides of the hyperplane.

  1. Lines that are going though at least one point of class A and one point of class B. The examples are shown as dashed and dot-dashed lines.
  2. Lines that are going though the edge points of class A. It could be defined later. so we are thinking about lines shown as dotted lines.

Can we use these hyperplanes?

Those are absurd limiting choices. However those places us at a position that we might need a plane that has a fare distance between the two classes. Our intuition tells that the hyperplane might not work for those data points close to the hyperplane. That being said, we are more confident if the hyperplane is far away from all data points.

Maybe we could require the hyperplane to be equally far away from the two classes. Here we define the distance between a hyperplane and a group data points to be the smallest distance between the hyperplane and data points. This distance is called the margin.

Maybe this one? We calculate the distance between the data points of class A and the hyperplane, and we find the smallest distance two be $d_{A,min}$. Meanwhile we calculate the distance between the hyperplane the the data points of class B, and denote it as $d_{B,min}$. We should find $d_{A, min} = d_{B, min}$. This is the max margin stragety.

We would like to take the extreme limits, again, to understand which hyperplane works the best for the classification problem.

Why does that max margin stragety work

How is the hyperplane being used

The the hyperplane could be represented with a normal vector $\hat{\mathbf n}$ and a shift $\beta_0$.

Why?

Outliers?

Is SVM susceptible to outliers?

Nonlinearity?