Support Vector Machines - Part 1: Hyperplane Classifiers
These notes are part 1 of 6 lectures I gave at the University of Kiev in April 2023, for the Machine Learning Algorithms course.
- Support Vector Machines - Part 1: Hyperplane Classifiers
- Support Vector Machines - Part 2: SVCs and SVMs
Additional lectures in this series were taught by Petra Posedel Šimović from the University of Zagreb and Ramón Prat Casanovas from the Autonomous University of Barcelona.
These lectures are heavily based on 2006 - Hastie et al - Elements of Statistical Learning, which I can highly recommend.
Separating Hyperplane Classifiers
Separating hyperplane classifiers construct linear decision boundaries that explicitly try to separate the data into different classes as well as possible.
Geometry of Hyperplanes
Let
with
We focus now on three properties of hyperplanes that are essential to our construction of hyperplane classifiers.
Property 1: The normal to the surface
Proof:
For any two points,
where
Property 2: The signed distance from any point
Proof:
For any
Property 3: All points on one side of the hyperplane satisfy
Proof: Follows directly from Property 2.
The Separating Hyperplane Classifier
Definition: (Linear Separability)
Consider two sets in
Now consider a set of
and a classification induced by
If the classes are linearly separable, there exists a hyperplane that separates them from each other; assigning to all data points on one a side a specific sign and to all the data points on the other side the opposite sign.
Thus, there exists a
We can attempt to exploit this property to help find
where
This approach has several issues, but the primary one is that it has many solutions, and which one is found depends on the starting point. This issue can be overcome by adding additional constraints to the separating hyperplane.
The Optimal Separating Hyperplane Classifier
The optimal separating hyperplane separates the two classes AND maximizes the distance to the closest point from either class. This provides a unique solution to the separating hyperplane problem and results in a better classifier.
So we consider maximizing the margin,
We have already shown that the term inside the square brackets is the signed distance from
It should be clear that for any
Our problem becomes
or equivalently,
Thus, we are minimizing
The objective function is convex and the constraint equations are linear inequalities. If they were linear equalities, we would use the method of Lagrange multipliers. Instead, we rely on the generalization, often called the Karush-Kuhn-Tucker approach.
Computing the Optimal Separating Hyperplane Classifier
First, we re-write our problem such that the derivatives are slightly neater and the constraints are in the standard form,
This problem can be solved using standard software (although it is usually transformed into its dual formulation). However, to gain insight into the solution we will now examine the Lagrangian,
with
For the vector
The first is obtained by setting the partial derivative of
The further two conditions that are relevant to us are
for
For each
- Either
, in which case the constraint is called inactive, as it plays no role in determining the solution . and
In the second case, the data point