Support Vector Machines - Part 1: Hyperplane Classifiers

These notes are part 1 of 6 lectures I gave at the University of Kiev in April 2023, for the Machine Learning Algorithms course.

  1. Support Vector Machines - Part 1: Hyperplane Classifiers
  2. Support Vector Machines - Part 2: SVCs and SVMs

Additional lectures in this series were taught by Petra Posedel Šimović from the University of Zagreb and Ramón Prat Casanovas from the Autonomous University of Barcelona.

These lectures are heavily based on 2006 - Hastie et al - Elements of Statistical Learning, which I can highly recommend.

Separating Hyperplane Classifiers

Separating hyperplane classifiers construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible.

Geometry of Hyperplanes

Let

f(x)=xβ+β0,

with β,xRp and β0 a scalar offset or intercept. Consider the hyperplane / affine set L defined by

L:={xRp:f(x)=0}.

We focus now on three properties of hyperplanes that are essential to our construction of hyperplane classifiers.

Property 1: The normal to the surface L is given by β:=1ββ.

Proof: For any two points, x1,x2L,

β(x1x2)=β0(β0)=0,

where (x2x1) is parallel to the surface L.

Property 2: The signed distance from any point xR to the hyperplane L is given by 1βf(x), i.e., f(x) is proportional to the signed distance from x to the hyperplane defined by f(x)=0.

Proof: For any xRp and x0L, (xx0) represents a vector pointing from x to the hyperplane. Then the signed distance from x to the hyperplane is the projection of this vector onto the normal of L. Thus,

β(xx0)=1β(βx+β0),=1f(x)f(x).

Property 3: All points on one side of the hyperplane satisfy f(x)>0, whereas all points on the other side of the hyperplane satisfy f(x)<0.

Proof: Follows directly from Property 2.

The Separating Hyperplane Classifier

Definition: (Linear Separability) Consider two sets in Rp, X1 and X2. These sets are said to be linearly separable if there exists a βRp and a β0R such that f(x)>0 for every xX1 and f(x)<0 for every xX2.

Now consider a set of N pairs of training data,

D={(xi,yi)|xiRp, yi{1,1}}i=1N,

and a classification induced by f(x),

G(x)=sign[xβ+β0].

If the classes are linearly separable, there exists a hyperplane that separates them from each other; assigning to all data points on one a side a specific sign and to all the data points on the other side the opposite sign.

Thus, there exists a f(x)=xβ+β0 that will have the property yif(xi)>0 for all i (this occurs when all the data is correctly classified by the classifier induced by f.)

We can attempt to exploit this property to help find f, by setting up the minimization problem

minβ,β0iMyi(xiβ+β0),

where M is the set of points misclassified by the starting β and β0. In fact, this was one of the earlier approaches to the problem of finding separating hyperplanes, known as Rosenblatt’s perceptron learning algorithm.

This approach has several issues, but the primary one is that it has many solutions, and which one is found depends on the starting point. This issue can be overcome by adding additional constraints to the separating hyperplane.

The Optimal Separating Hyperplane Classifier

The optimal separating hyperplane separates the two classes AND maximizes the distance to the closest point from either class. This provides a unique solution to the separating hyperplane problem and results in a better classifier.

So we consider maximizing the margin, MR+, such that each point is at least the distance M away from the separating hyperplane,

maxβ,β0Ms.t.yi[1β(xiβ+β0)]M, for i=1,,N.

We have already shown that the term inside the square brackets is the signed distance from xi to the hyperplane. The product with yi attempts to ensure that each point is classified correctly.

It should be clear that for any β and β0 that satisfies this inequality, any positively scaled multiple of them will satisfy the inequality as well. To remove this extra degree of freedom we can write our margin in terms of our norm, so we set β=1/M.

Our problem becomes

maxβ,β01βs.t.yi(xiβ+β0)1, for i=1,,N,

or equivalently,

minβ,β0βs.t.yi(xiβ+β0)1, for i=1,,N.

Thus, we are minimizing β while ensuring all the points are at least 1β away from the hyperplane.

The objective function is convex and the constraint equations are linear inequalities. If they were linear equalities, we would use the method of Lagrange multipliers. Instead, we rely on the generalization, often called the Karush-Kuhn-Tucker approach.

Computing the Optimal Separating Hyperplane Classifier

First, we re-write our problem such that the derivatives are slightly neater and the constraints are in the standard form,

minβ,β012β2s.t.yi(xiβ+β0)10, for i=1,,N.

This problem can be solved using standard software (although it is usually transformed into its dual formulation). However, to gain insight into the solution we will now examine the Lagrangian,

L(l,α)=12β2i=1Nαi[yi(xiβ+β0)1],

with l=(β,β0) a vector containing the variables w.r.t. which we want to minimize the function and α=(α1,,αN) the vector of the Lagrange multipliers.

For the vector l=(β,β0) to be a saddle point of the Lagrangian above as well as a solution to the original minimization problem, it needs to satisfy a set of conditions known as the Karush-Kuhn-Tucker conditions (see Theorem 12.1 in 2006 - Nocedal Wright - Numerical Optimization), which uniquely characterize the solution. Three of these conditions are relevant to our understanding of optimal hyperplane classifiers.

The first is obtained by setting the partial derivative of L w.r.t β equal to zero (a necessary condition for a saddle point), which provides

β=i=1Nαiyixi.

The further two conditions that are relevant to us are

αi0andαi[yi(xiβ+β0)1]=0

for i=1,,N.

For each i, the final condition can be satisfied in one of two ways:

  1. Either αi=0, in which case the constraint is called inactive, as it plays no role in determining the solution β.
  2. αi0 and yi(xiβ+β0)=1

In the second case, the data point xi lies exactly on the boundary of the slab defined by the optimal hyperplane and our margin. This point is known as a support point or support vector and it is clear from our first condition that it is only these points that dictate our solution! Note however, that finding these points require using the entire data set.