Support Vector Machines - Part 1: Hyperplane Classifiers

These notes are part 1 of 6 lectures I gave at the University of Kiev in April 2023, for the Machine Learning Algorithms course.

Additional lectures in this series were taught by Petra Posedel Šimović from the University of Zagreb and Ramón Prat Casanovas from the Autonomous University of Barcelona.

These lectures are heavily based on 2006 - Hastie et al - Elements of Statistical Learning, which I can highly recommend.

Separating Hyperplane Classifiers

Separating hyperplane classifiers construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible.

Geometry of Hyperplanes

Let

\[f(x) = x^\top\beta + \beta_0,\]

with $\beta, x\in\mathbb{R}^p$ and $\beta_0$ a scalar offset or intercept. Consider the hyperplane / affine set $L$ defined by

\[L := \{x\in\mathbb{R}^p: f(x) = 0 \}.\]

We focus now on three properties of hyperplanes that are essential to our construction of hyperplane classifiers.

Property 1: The normal to the surface $L$ is given by $\beta^* := \frac{1}{\lVert \beta \rVert}\beta.$

Proof: For any two points, $x_1,x_2\in L$,

\[\begin{align} \beta^\top(x_1 - x_2)&=-\beta_0 - (-\beta_0) \\ &= 0, \end{align}\]

where $(x_2-x_1)$ is parallel to the surface $L$.

Property 2: The signed distance from any point $x\in\mathbb{R}$ to the hyperplane $L$ is given by $\frac{1}{\lVert \beta \rVert} f(x)$, i.e., $f(x)$ is proportional to the signed distance from $x$ to the hyperplane defined by $f(x)=0$.

Proof: For any $x\in\mathbb{R}^p$ and $x_0\in L$, $(x-x_0)$ represents a vector pointing from $x$ to the hyperplane. Then the signed distance from $x$ to the hyperplane is the projection of this vector onto the normal of $L$. Thus,

\[\begin{aligned} \beta^{* \top}\left(x-x_0\right) & =\frac{1}{\|\beta\|}\left(\beta^\top x+\beta_0\right), \\ & =\frac{1}{\left\|f^{\prime}(x)\right\|} f(x) . \end{aligned}\]

Property 3: All points on one side of the hyperplane satisfy $f(x)>0$, whereas all points on the other side of the hyperplane satisfy $f(x)<0$.

Proof: Follows directly from Property 2.

The Separating Hyperplane Classifier

Definition: (Linear Separability) Consider two sets in $\mathbb{R}^p$, $X_1$ and $X_2$. These sets are said to be linearly separable if there exists a $\beta\in\mathbb{R}^p$ and a $\beta_0\in\mathbb{R}$ such that $f(x)>0$ for every $x\in X_1$ and $f(x)<0$ for every $x\in X_2$.

Now consider a set of $N$ pairs of training data,

\[\mathcal{D} = \{(x_i, y_i) | x_i\in\mathbb{R}^p,\ y_i\in\{-1, 1 \}\}_{i=1}^N,\]

and a classification induced by $f(x)$,

\[G(x)=\operatorname{sign}\left[x^\top \beta+\beta_0\right].\]

If the classes are linearly separable, there exists a hyperplane that separates them from each other; assigning to all data points on one a side a specific sign and to all the data points on the other side the opposite sign.

Thus, there exists a $f(x) = x^\top\beta + \beta_0$ that will have the property $y_if(x_i)>0$ for all $i$ (this occurs when all the data is correctly classified by the classifier induced by $f$.)

We can attempt to exploit this property to help find $f$, by setting up the minimization problem

\[\min_{\beta,\,\beta_0} - \sum_{i\in\mathcal{M}}y_i(x_i^\top\beta + \beta_0),\]

where $\mathcal{M}$ is the set of points misclassified by the starting $\beta$ and $\beta_0$. In fact, this was one of the earlier approaches to the problem of finding separating hyperplanes, known as Rosenblatt’s perceptron learning algorithm.

This approach has several issues, but the primary one is that it has many solutions, and which one is found depends on the starting point. This issue can be overcome by adding additional constraints to the separating hyperplane.

The Optimal Separating Hyperplane Classifier

The optimal separating hyperplane separates the two classes AND maximizes the distance to the closest point from either class. This provides a unique solution to the separating hyperplane problem and results in a better classifier.

So we consider maximizing the margin, $M\in\mathbb{R}^+$, such that each point is at least the distance $M$ away from the separating hyperplane,

\[\max_{\beta,\,\beta_0} M \quad\mathrm{s.t.}\quad y_i\left[\frac{1}{\|\beta\|}(x_i^\top\beta + \beta_0)\right]\geq M,\text{ for } i=1,\dots,N.\]

We have already shown that the term inside the square brackets is the signed distance from $x_i$ to the hyperplane. The product with $y_i$ attempts to ensure that each point is classified correctly.

It should be clear that for any $\beta$ and $\beta_0$ that satisfies this inequality, any positively scaled multiple of them will satisfy the inequality as well. To remove this extra degree of freedom we can write our margin in terms of our norm, so we set ${\lVert\beta\rVert}=1 / M$.

Our problem becomes

\[\begin{aligned} &\max_{\beta,\,\beta_0} \frac{1}{\|\beta\|} \quad\mathrm{s.t.}\quad y_i(x_i^\top\beta + \beta_0)\geq 1, \text{ for } i=1,\dots,N, & \end{aligned}\]

or equivalently,

\[\begin{aligned} &\min_{\beta,\,\beta_0} {\|\beta\|} \quad\mathrm{s.t.}\quad y_i(x_i^\top\beta + \beta_0)\geq 1, \text{ for } i=1,\dots,N. & \end{aligned}\]

Thus, we are minimizing ${\lVert\beta\rVert}$ while ensuring all the points are at least $\frac{1}{\lVert\beta\rVert}$ away from the hyperplane.

The objective function is convex and the constraint equations are linear inequalities. If they were linear equalities, we would use the method of Lagrange multipliers. Instead, we rely on the generalization, often called the Karush-Kuhn-Tucker approach.

Computing the Optimal Separating Hyperplane Classifier

First, we re-write our problem such that the derivatives are slightly neater and the constraints are in the standard form,

\[\min_{\beta,\,\beta_0} \frac{1}{2}{\|\beta\|^2} \quad\mathrm{s.t.}\quad y_i(x_i^\top\beta + \beta_0)-1\geq 0, \text{ for } i=1,\dots,N.\]

This problem can be solved using standard software (although it is usually transformed into its dual formulation). However, to gain insight into the solution we will now examine the Lagrangian,

\[\mathcal{L}\left(l,\alpha\right) = \frac{1}{2}{\|\beta\|^2} -\sum_{i=1}^N\alpha_i\left[y_i(x_i^\top\beta + \beta_0)-1\right],\]

with $l=(\beta,\beta_0)^\top$ a vector containing the variables w.r.t. which we want to minimize the function and $\alpha=(\alpha_1,\dots,\alpha_N)$ the vector of the Lagrange multipliers.

For the vector $l = (\beta,\beta_0)^\top$ to be a saddle point of the Lagrangian above as well as a solution to the original minimization problem, it needs to satisfy a set of conditions known as the Karush-Kuhn-Tucker conditions (see Theorem 12.1 in 2006 - Nocedal Wright - Numerical Optimization), which uniquely characterize the solution. Three of these conditions are relevant to our understanding of optimal hyperplane classifiers.

The first is obtained by setting the partial derivative of $\mathcal{L}$ w.r.t $\beta$ equal to zero (a necessary condition for a saddle point), which provides

\[\beta=\sum_{i=1}^N \alpha_i y_i x_i.\]

The further two conditions that are relevant to us are

\[\alpha_i\geq0\qquad\text{and}\qquad\alpha_i\left[y_i(x_i^\top\beta + \beta_0)-1\right]=0\]

for $i=1,\dots,N$.

For each $i$, the final condition can be satisfied in one of two ways:

Either $\alpha_i=0$, in which case the constraint is called inactive, as it plays no role in determining the solution $\beta$.
$\alpha_i\geq 0$ and $y_i(x_i^\top\beta + \beta_0)=1$

In the second case, the data point $x_i$ lies exactly on the boundary of the slab defined by the optimal hyperplane and our margin. This point is known as a support point or support vector and it is clear from our first condition that it is only these points that dictate our solution! Note however, that finding these points require using the entire data set.