Support Vector Machines - Part 2: SVCs and SVMs

These notes are part 2 of 6 lectures I gave at the University of Kiev in April 2023, for the Machine Learning Algorithms course.

  1. Support Vector Machines - Part 1: Hyperplane Classifiers
  2. Support Vector Machines - Part 2: SVCs and SVMs

Additional lectures in this series were taught by Petra Posedel Šimović from the University of Zagreb and Ramón Prat Casanovas from the Autonomous University of Barcelona.

These lectures are heavily based on 2006 - Hastie et al - Elements of Statistical Learning, which I can highly recommend.

Support Vector Classifiers

We consider again the optimal hyperplane classifier,

\[G(x)=\operatorname{sign}\left[x^\top \beta+\beta_0\right],\]

but now for the case when the data is not linearly separable. This means that some points will lie on the wrong side of their respective margin, perhaps even on the wrong side of the hyperplane (and thus be misclassified).

To account for this, we modify our optimization problem such that each point is allowed a slack of $\xi_i$, a relative distance the point is allowed to be inside the margin,

\[\begin{aligned} &\min_{\beta,\,\beta_0} {\|\beta\|} \quad\mathrm{s.t.}\quad y_i(x_i^\top\beta + \beta_0)\geq 1 - \xi_i, \text{ for } i=1,\dots,N. & \end{aligned}\]

It is clear that if $\xi_i>1$, then the corresponding point, $x_i$ is misclassified. So we can bound the number of misclassified points by insisting that $\sum_{i=1}^N\xi_i\leq K$, where $K$ is a tuneable constant.

The resulting optimization problem can be stated as

\[\min \|\beta\| \text { subject to }\left\{\begin{array}{l} y_i\left(x_i^T \beta+\beta_0\right) \geq 1-\xi_i \forall i \\ \xi_i \geq 0, \sum \xi_i \leq K\text {,} \end{array}\right.\]

and the solution is known as a support vector classifier. The support vector classifier is the generalization of the optimal separating hyperplane classifier to the case of non-linearly separable data.

Computing the Support Vector Classifier

The above optimization problem has a convex objective function and linear inequality constraints, so, as before, we rely on the Karush-Kuhn-Tucker approach.

We can re-write the optimization problem in a more standard form

\[\begin{aligned} & \min _{\beta, \beta_0} \frac{1}{2}\|\beta\|^2+C \sum_{i=1}^N \xi_i \\ & \text { subject to } \quad \xi_i \geq 0,\\ &\text{ and } \quad y_i\left(x_i^\top \beta+\beta_0\right) - (1-\xi_i) \geq 0\ \forall i, \end{aligned}\]

where the “cost” parameter $C$ has replaced the constant $K$ and serves as the tuning parameter for the algorithm. This problem can be solved numerically using standard software. However, to highlight how support vector classifiers differ from separating hyperplane classifiers, we once again examine the corresponding Lagrangian.

Let $\xi=(\xi_1,\dots,\xi_N)$ be a vector of the slack parameters and construct $l=(\beta, \beta_0, \xi^\top)^\top$. Then the Lagrangian becomes

\[\begin{aligned} \mathcal{L}(l, \alpha, \mu) =& \frac{1}{2}{\|\beta\|^2} + C\sum_{i=1}^N\xi_i\\ &-\sum_{i=1}^N\alpha_i\left[y_i(x_i^\top\beta + \beta_0)-(1-\xi_i)\right] \\ &- \sum_{i=1}^N\mu_i\xi_i, \end{aligned}\]

where $\alpha$ and $\mu$ are the vectors of the Lagrange multipliers. The Karush-Kuhn-Tucker conditions characterize the unique solution to this minimization problem, and we focus on three of the necessary constraints,

\[\begin{aligned} &\beta=\sum_{i=1}^N \alpha_i y_i x_i,\\ &\alpha_i \geq 0, \text{ and } \\ &\alpha_i\left[y_i\left(x_i^\top\beta+\beta_0\right)-\left(1-\xi_i\right)\right]=0. \end{aligned}\]

As before, the final constraint can be satisfied in one of two ways: either $\alpha_i=0$ and the corresponding data point plays no role in determining the solution $\beta$, or

\[y_i\left(x_i^\top\beta+\beta_0\right)=\left(1-\xi_i\right)\]

and the point lies either on or within its corresponding margin. Thus, only the points that lie on or within the margin (and are potentially misclassified) play a role in determing $\beta.$ These points are known as the support vectors.

Support Vector Machines

Previously, we worked with training data of the form \(\mathcal{D} = \{(x_i, y_i) | x_i\in\mathbb{R}^p,\ y_i\in\{-1, 1 \}\}_{i=1}^N,\) and considered finding linear boundaries in the input feature space. First, for the case of linearly separable data, we considered optimal separating hyperplanes, and then for the non-linearly separable case, we considered support vector classifiers.

We introduce support vector machines as a generalization of support vector classifiers that instead find a linear boundary in an enlarged feature space created by using basis expansions.

Consider a set of $M>p$ basis functions, $h_m(x)$, for $m = 1,\dots, M$, and the enlarged input feature space created by transforming each input using these basis functions, i.e., the new input features become \(h(x_i) = (h_1(x_i), \dots, h_M(x_i))\) for $i = 1,\dots,N$.

A hyperplane in this basis, $\hat{f}(x) = h(x)^\top\hat{\beta} + \hat{\beta}_0,$ induces the classifier $\hat{G}(x) = \operatorname{sign}(\hat{f}(x))$. We can now proceed as we did for the classical support vector classifier, but the final result will be a non-linear classifier in the original space.

This linear classifier in a higher-dimensional basis is known as a support vector machine (SVM) and, on the surface, it seems like it has made our classification problem more computationally expensive by increasing the dimension of our input features. In the next section, we show how this is not the case.

Computing the Support Vector Machine

The Inner Product Form

Consider the previously derived Lagrangian of the support vector classifier, but now in the transformed feature space,

\[\begin{aligned} \hat{\mathcal{L}}(l, \alpha, \mu) =& \frac{1}{2}{\|\hat\beta\|^2} + C\sum_{i=1}^N\xi_i\\ &-\sum_{i=1}^N\alpha_i\left[y_i\left(h(x_i)^\top\hat\beta + \hat\beta_0\right)-(1-\xi_i)\right] \\ &- \sum_{i=1}^N\mu_i\xi_i, \end{aligned}\]

where now $l=(\hat\beta,\hat\beta_0,\xi^\top)$. Setting the partial derivatives w.r.t. each of the components of $l$ equal to zero provides

\[\begin{aligned} \hat\beta & =\sum_{i=1}^N \alpha_i y_i h(x_i), \\ 0 & =\sum_{i=1}^N \alpha_i y_i, \\ \alpha_i & =C-\mu_i, \forall i, \end{aligned}\]

which, when substituted back into $\hat{\mathcal{L}}$, yields what’s known as the Lagrangian dual objective function

\[\hat{\mathcal{L}}_D=\sum_{i=1}^N \alpha_i-\frac{1}{2} \sum_{i=1}^N \sum_{i^{\prime}=1}^N \alpha_i \alpha_{i^{\prime}} y_i y_{i^{\prime}} h(x_i)^\top h(x_{i^{\prime}}).\]

Also, using the first partial derivative result above, we can rewrite the solution to our classification problem as

\[\hat{f}(x) = h(x)^\top\sum_{i=1}^N \alpha_i y_i h(x_i) + \hat{\beta}_0.\]

Both of these expressions (the Lagrangian dual objective function and the hyperplane solution) can be written in terms of the inner product,

\[\hat{\mathcal{L}}_D=\sum_{i=1}^N \alpha_i-\frac{1}{2} \sum_{i=1}^N \sum_{i^{\prime}=1}^N \alpha_i \alpha_{i^{\prime}} y_i y_{i^{\prime}}\left\langle h\left(x_i\right), h\left(x_{i^{\prime}}\right)\right\rangle .\]

and

\[\hat{f}(x)=\sum_{i=1}^N \alpha_i y_i\left\langle h(x), h\left(x_i\right)\right\rangle+\beta_0.\]

The Kernel Form

Since the optimization problem and the solution involve $h(x)$ only through inner products, we in actual fact do not need to specify the transformation $h(x)$ at all, but only the kernel function

\[K\left(x, x^{\prime}\right)=\left\langle h(x), h\left(x^{\prime}\right)\right\rangle\]

that computes the inner product on the transformed space. Note that $K$ should a symmetric positive (semi-) definite function.

Two popular choices for $K$ in the SVM literature are:

Consider for example a feature space with two inputs $X_1$ and $X_2$, and a polynomial kernel of degree 2. Then

\[\begin{aligned} K\left(X, X^{\prime}\right) & =\left(1+\left\langle X, X^{\prime}\right\rangle\right)^2 \\ & =\left(1+X_1 X_1^{\prime}+X_2 X_2^{\prime}\right)^2 \\ & =1+2 X_1 X_1^{\prime}+2 X_2 X_2^{\prime}+\left(X_1 X_1^{\prime}\right)^2+\left(X_2 X_2^{\prime}\right)^2+2 X_1 X_1^{\prime} X_2 X_2^{\prime} . \end{aligned}\]

Then $M=6$, and if we choose $h_1(X)=1, h_2(X)=\sqrt{2} X_1, h_3(X)=\sqrt{2} X_2, h_4(X)=X_1^2, h_5(X)=X_2^2$, and $h_6(X)=\sqrt{2} X_1 X_2$, then $K\left(X, X^{\prime}\right)=$ $\left\langle h(X), h\left(X^{\prime}\right)\right\rangle$.