We talk about VC dimension theory and explain how to improve
the regularization property by reducing the hypothesis space $\mathcal{H}$. We also talk about linear models and how to take
advantage of the linear model framework to learn with non-linear learning models. Building on the fundation of all the theories we have
covered so far, a very impressive machine learning technique called Support Vector
Machines (SVM) was invented. For problems of limited
data set size, SVM has routinely been shown to outperform other learning
algorithms, so if you can only choose one machine learning model, you generally will not regret by going
with SVM. Only in recently years, SVM is
surpassed by deep learning models, but that is only when large data set is
available. As this note only considers
small data set, SVM, for us, is the king of the castle.
Maximum Margin Linear Classifier
SVM has built-in regularization property! It relies on an intuitive idea of maximum classification margin. Let us only consider the case where the $N$ input data points in $\mathcal{D}$ are linearly separable, i.e., there exists a hyper plane that can classify all $N$ points without error. Although there are many solutions that can fit the bill (Figure 5.1a), SVM says we would like to go with the one that has the maximum margin, or we say it is the fattest decision plane with two margin planes called boundaries (Figure 5.1b), as this one is most robust against measurement uncertainty. Why is that? Each data point $\mathbf{x}$ in the real physical world carries some degree of uncertainty. When we measure the height of a person, uncertainty comes from the tool we used for the measuring, comes from the height itself varying during different times of the day, comes from the measuring protocol whether the person is shoes-on or off, etc. So it is best to assume each $\mathbf{x}$ is only an estimation of its true value, in another word, its true value is within a halo around $\mathbf{x}$. If we let the halo expand and eventually its true value will cross the linear separating plane and it means the point will be misclassified. The plane with the maximum margin is able to handle the largest halo size, therefore, can accommodate the largest degree of uncertainty in $\mathbf{x}$. Figure 5.1c & d compares the error-tolerance of a small-margin classifier and the maximum-margin classifier, the later is more robust against error in $\mathbf{x}$.
Figure 5.1. The blue decision plane is the fattest one we can find for this example (a & b). It can tolerate the largest amount of uncertainty in input data points $\mathbf{x}$ (c & d). |
The maximum margin concept is also supported by the VC theory. Given $N$ data points, if we try to place in a plane with certain margin size. We can see in Figure 5.2 that the larger the margin, the smaller the possible space the separator can occupy, i.e., the smaller the hypothesis set space, therefore the smaller the effective VC dimension is. If a few margins are sorted ascendingly: $m_1 \subset m_2 \subset \ldots \subset m_n$, then their VC dimensions: $d_1^{vc} \ge d_2^{vc} \ge \ldots \ge d_n^{vc}$ [1]. Among all the linear separation planes, the one with the maximum margin has the best regularization property.
Figure 5.2. A smaller-margin decision plane outlined in green can occupy a larger hypothesis space, compared to the maximum-margin decision place outlined in blue. |
SVM Engine
We here only consider the case where $N$ data points are linearly separable. A linear separating plane is the model:
$$\begin{equation} y = {\rm sign}(\mathbf{w} \cdot \mathbf{x} + b). \end{equation}$$
For $\mathbf{x}$ on the separating plane, $y = 0$. For other points, $y$ is either $+1$ or $-1$. Notice here $\mathbf{x}$ does not contains the constant element 1 in its first dimension, we explicitly write $\mathbf{w}_0$ as $b$, so that math would be simpler this way. If our boundaries are maximum margin, they must touch at least two data points, one on each side, otherwise, the margin cannot be the maximum. Since we can always linearly scale $b$ and $\mathbf{w}$ by multiplying them with a positive scaling factor without changing the classification results given by Equation (1), so that we can choose the scaling factor in such a way, so that $y = -1$ and $y = +1$ for the data points touching the two boundaries.
For $\mathbf{x}$ on the separating plane, $y = 0$. For other points, $y$ is either $+1$ or $-1$. Notice here $\mathbf{x}$ does not contains the constant element 1 in its first dimension, we explicitly write $\mathbf{w}_0$ as $b$, so that math would be simpler this way. If our boundaries are maximum margin, they must touch at least two data points, one on each side, otherwise, the margin cannot be the maximum. Since we can always linearly scale $b$ and $\mathbf{w}$ by multiplying them with a positive scaling factor without changing the classification results given by Equation (1), so that we can choose the scaling factor in such a way, so that $y = -1$ and $y = +1$ for the data points touching the two boundaries.
In Figure 5.3, $\mathbf{x}^+$ is on the +1 boundary plane, $\mathbf{x}^-$ is on the -1 boundary plane, we have:
$$\begin{equation} \left\{
\begin{aligned}
\mathbf{w} \cdot \mathbf{x}^+ + b & = 1, \\
\mathbf{w} \cdot \mathbf{x}^- + b & = -1.
\end{aligned}
\right. \end{equation}$$
Figure 5.3. Illustration on how to calculate the size of a margin. |
The thickness of the margin equals the projection of $\mathbf{x}^+ - \mathbf{x}^-$ (Figure 5.3, green arrow) along the normal direction of the planes ($\mathbf{w}$ in Figure 5.3):
$$\begin{equation} \begin{aligned} {\rm margin} &= (\mathbf{x}^+ - \mathbf{x}^-) \cdot \frac{\mathbf{w}}{\Vert{\mathbf{w}}\Vert} \\ &= \frac{2}{\Vert{\mathbf{w}}\Vert} \end{aligned} \end{equation}. $$
The second step uses Equation (2). Therefore, maximizing margin is equivalent to minimizing $\mathbf{w}^\intercal\mathbf{w}$. Recall in linear models, $\mathbf{w}^\intercal\mathbf{w}$ has been used as a regularization term to describe the complexity of the linear hypothesis (with $w_0$ already excluded), therefore, we say SVM has built-in regularization, as it prefers simpler hypotheses. Just to be aware that SVM has anti-over-fitting built in does not mean it is free from over-fitting, it will over-fit if we run a 100-dimensional SVM on 10 data points! It is just SVM is over-fitting-aware, and it is much more robust against over-fitting compared to other ML techniques.
To solve the maximum-margin decision plane problem is to solve:
$$\begin{equation} \min_\mathbf{w} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} \end{equation},$$
with the constrains:
$$\begin{equation} y_i (\mathbf{w}^\intercal\mathbf{x}_i + b) \ge 1. \quad i = 1, 2, \ldots, N \end{equation}$$
We normally minimize $E_{out}$, however, Equation (4) only contains the regularization term, where is $E_{in}$? Here we consider a linear-separable case, therefore, we are only interested in the case where all training data points are correctly classified, i.e., $E_{in} = 0$! How do we make sure we only consider $\mathbf{w}$ and $b$, where all data points are correctly classified? The answer is Equation (5), as the constrains means all data points are on the correct side of the margin and are all on or outside the boundaries. The factor $\frac{1}{2}$ is just for convenience. In summary, Equation (5) makes sure we only consider decision hyperplanes where all training data are on the correct sides and Equation (4) makes sure we choose among such candidates the one that maximize the margin.
$$\begin{equation} \min_\mathbf{w} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} \end{equation},$$
with the constrains:
$$\begin{equation} y_i (\mathbf{w}^\intercal\mathbf{x}_i + b) \ge 1. \quad i = 1, 2, \ldots, N \end{equation}$$
We normally minimize $E_{out}$, however, Equation (4) only contains the regularization term, where is $E_{in}$? Here we consider a linear-separable case, therefore, we are only interested in the case where all training data points are correctly classified, i.e., $E_{in} = 0$! How do we make sure we only consider $\mathbf{w}$ and $b$, where all data points are correctly classified? The answer is Equation (5), as the constrains means all data points are on the correct side of the margin and are all on or outside the boundaries. The factor $\frac{1}{2}$ is just for convenience. In summary, Equation (5) makes sure we only consider decision hyperplanes where all training data are on the correct sides and Equation (4) makes sure we choose among such candidates the one that maximize the margin.
The above constrained problem can be solved by introducing Lagrangian
multipliers, i.e., the problem is equivalent to:
$$\begin{equation} \min_{\mathbf{w}, b} \; \max_{\alpha_i \ge 0} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} + \sum_{i=1}^{N} { \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b)}. \end{equation}$$
Let us try to understand why Lagrangian multiplier method works. Assume we consider an invalid solution candidate $(\mathbf{w}, b)$, i.e., there is at least one data point, say $i$, falls inside the two boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \gt 0$, $\alpha_i$ would go to $+\infty$ in order to maximize the new function. Then the function value would be $+\infty$ for such invalid $(\mathbf{w}, b)$, therefore, such invalid solution candidates have no chance to win, as there are other solutions give finite function values, thus invalid solutions are naturally excluded from solution pool. So the basic idea here is to cast our original function into a new form, so that those $(\mathbf{w}, b)$ violating the constrains has zero chance to be our final solution (Figure 5.4).
Let us try to understand why Lagrangian multiplier method works. Assume we consider an invalid solution candidate $(\mathbf{w}, b)$, i.e., there is at least one data point, say $i$, falls inside the two boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \gt 0$, $\alpha_i$ would go to $+\infty$ in order to maximize the new function. Then the function value would be $+\infty$ for such invalid $(\mathbf{w}, b)$, therefore, such invalid solution candidates have no chance to win, as there are other solutions give finite function values, thus invalid solutions are naturally excluded from solution pool. So the basic idea here is to cast our original function into a new form, so that those $(\mathbf{w}, b)$ violating the constrains has zero chance to be our final solution (Figure 5.4).
The winning solution to Equation (6) naturally satisfy the following constrains:
$$\begin{equation} \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) = 0. \end{equation}$$Why is that? We already prove the solution to Equation (6) cannot violate Equation (5), so data points can never fall within the boundaries of our solution. Considering data points outside the boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \lt 0$, therefore, any $\alpha_i \gt 0$ would decrease the value of the function in Equation (6), thus $\alpha_i = 0$ is the best case senario for $\max\alpha_i$ operator. Therefore in our final solution, only those data points fall exactly on the decision boundaries can have non-zero $\alpha_i$ values, i.e., they are the ones supporting the decision plane, therefore, also called support vectors. All other non-support vector data points must have $\alpha_i = 0$.
To solve Equation (6), mathematicians proof that we can swap the order of min and max, therefore, we will take a few partial derivatives against $b$ and $\mathbf{w}$ and set them to zero:
$$\begin{equation} \sum_{i=1}^N \; \alpha_i y_i = 0. \end{equation} $$
$$\begin{equation} \mathbf{w} = \sum_{i=1}^N \; \alpha_i y_i \mathbf{x}_i \end{equation}$$
Equation (8) implies each positively-labelled support vector contributes a positive force $\alpha_i$, each negatively-labeled support vector contributes a negative force, all forces add to zero and they are in balance. So there must be at least two support vectors, one positive and one negative. Equation (9) implies the optimal normal vector $\mathbf{w}^*$ is basically a linear combination of $N$ data vectors, which is known in our discussion of any linear model in Chapter 4 Equation (18). The difference is we only consider the contributions from those support vectors, as all other points has $\alpha_i$ equals zero.
Plug Equation (8) and (9) back to Equation (6), we obtain a constrained minimization problem for variable $\alpha_i$:
$$\begin{equation} \min_{\alpha_i \ge 0} \; \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \mathbf{x}_i^\intercal\mathbf{x}_j - \sum_{i=1}^{N} { \alpha_i }, \\ \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}$$
We will not go further into the math of how to solve exact values of optimal solution $\alpha_i^*$, that is a problem of Quadratic Programming (QP) problem already solved by great mathematical minds. What is important is our analysis above shows the solution of the maximum margin problem must rely on the identification of a few support vectors from the $N$ input data points. These support vectors determines the separation boundary, as only they have non-zero $\alpha_i^*$, all other data points have no contribution to the solution. Then we use Equation (10) to find $\alpha_i$ and Equation (9) to solve $\mathbf{w}^*$. $b^*$ can be calculated using any support vectors on the boundaries, say point $\mathbf{x}_s$, as:
$$\begin{equation} \begin{aligned} b^* &= y_s - \mathbf{w}^{*\intercal}\mathbf{x}_s \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \mathbf{x}_i^{\intercal}\mathbf{x}_s. \end{aligned} \end{equation} $$
The second step comes from Equation (9) and $y_s$ is either +1 or -1, depending on where $\mathbf{x}_s$ lies on.
$$\begin{equation} \begin{aligned} b^* &= y_s - \mathbf{w}^{*\intercal}\mathbf{x}_s \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \mathbf{x}_i^{\intercal}\mathbf{x}_s. \end{aligned} \end{equation} $$
The second step comes from Equation (9) and $y_s$ is either +1 or -1, depending on where $\mathbf{x}_s$ lies on.
Cross Validation of SVM
The observation that the solution of SVM only depends on a few support vectors lead to a very interesting result regarding its accuracy. Let us consider estimate the accuracy of SVM using LOOCV; each time we remove one data point and use it as the validation set. If there are $k$ support vectors, as long as the removed data point $\mathbf{x}_i$ is not a support vector, i.e., $\alpha_i = 0$ in the solution of the original problem, we know the solution will not change. We can also prove this by observing Equation (10), the solution of $\alpha_i$ in the original problem must be in the solution space for the new problem with point $\mathbf{x}_i$ removed. On the contrary, if the solution of the new problem is not the solution of the original problem, $\mathbf{x}_i$ must fall within the new decision boundary, then $\mathbf{x}_i$ should be a support vector to the old problem (the exact proof is to be worked out). Therefore, the optimal solution $\alpha^*$ on the orginal problem must also be the optimal solution when a non-support vector data point $\mathbf{x}_i$ is omitted. Since we are dealing with separable case, $y_i$ is already correctly predicted by our decision plane, therefore the validation error for such cases are zero. Only when the validation data point comes from a support vector, the decision plane might change and the validation data point might be misclassified. Therefore, the accuracy for a separably SVM model is:
$$\begin{equation} E_{cv} \le \frac{k}{N}. \end{equation}$$
In most problems, the number of support vectors $k \ll N$,
therefore the performance of SVM is generally very impressive, even true for high-dimensional data inputs. Because
even in high dimensions, SVM minimizes $\Vert \mathbf w \Vert$, therefore constrains the hypothesis search space into a small region, the resultant number of support vectors $k$ only grows
slowly as dimension increases, therefore, SVM is generally pretty robust.
The linearly separable case is called hard-margin SVM. In the most general case, data points are not linearly separable, we will have to allow points to fall into the sandwiched margin space, therefore called soft-margin SVM problem. The scoring function will include a penalty term for these points. We will not go into the details, as the logic and the solution forms are quite similar to the hard-margin SVM problem, except there is a penalty hyper-parameter. All those
points fall inside the margins become bounded support vectors, and they contribute equally, with the same $\alpha$ value. Those points on the boundaries are still margin support vectors that contribute varying $\alpha$ values. Points beyond margins all have zero contribution. In real applications, we always use the soft-margin SVM, never use hard-margin SVM. Although we do not discuss soft-margin SVM in details, because all the conclusions on hard-margin SVM applies to it as well (it is not clear if the LOOCV results applies though), it is important that we should study that topic, for example using LFD supplementary chapter 8 [2].
Non-linear SVM and Kernel
Follow the same non-linear idea outlined in Chapter 5, we can transform the $X$ space into $Z$ space using a non-linear transformation:
$$\begin{equation} \mathbf{z} = \phi(\mathbf{x}) \end{equation}.$$
Then in the $Z$ space, we assume a maximum margin linear model
would work well, as all non-linearity has been absorbed by the $\phi(\mathbf{x})$ transformation function.
We then run the SVM solution in $Z$ space and obtain our
solution for the problem:
$$\begin{equation}
\min_{\alpha_i \ge 0} \; \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \phi(\mathbf{x}_i)^\intercal\phi(\mathbf{x}_j) - \sum_{i=1}^{N} { \alpha_i }, \\ \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}.$$
$$\begin{equation}
\min_{\alpha_i \ge 0} \; \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \phi(\mathbf{x}_i)^\intercal\phi(\mathbf{x}_j) - \sum_{i=1}^{N} { \alpha_i }, \\ \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}.$$
The solution $\alpha^*$ enables us to find the linear separation plane in the $Z$ space as:
$$\begin{equation} \begin{aligned}
\mathbf{w}^* &= \sum_{i=1}^N \; \alpha_i^* y_i \mathbf{z}_i \\
&= \sum_{i=1}^N \; \alpha_i^* y_i \phi(\mathbf{x}_i) \\
b^* &= y_s - \mathbf{w}^{*\intercal}\phi(\mathbf{x}_s) \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \phi(\mathbf{x}_i)^{\intercal}\phi(\mathbf{x}_s).
\end{aligned} \end{equation}$$
$$\begin{equation} \begin{aligned}
\mathbf{w}^* &= \sum_{i=1}^N \; \alpha_i^* y_i \mathbf{z}_i \\
&= \sum_{i=1}^N \; \alpha_i^* y_i \phi(\mathbf{x}_i) \\
b^* &= y_s - \mathbf{w}^{*\intercal}\phi(\mathbf{x}_s) \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \phi(\mathbf{x}_i)^{\intercal}\phi(\mathbf{x}_s).
\end{aligned} \end{equation}$$
Then for a given new data point $\mathbf{x}_{new}$, our prediction will be based on:
$$\begin{equation} \begin{aligned}
y &= {\rm sign } ( \mathbf{w}^*\phi(\mathbf{x}_{new}) + b^*) \\
&= {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i \phi(\mathbf{x}_i)^{ \intercal } \phi(\mathbf{x}_s)\right).
\end{aligned} \end{equation}$$
We basically project all $\mathbf{x}$ and $\mathbf{x}_{new}$ into the $Z$ space, then do the dot product in the $Z$ space to obtain the classification prediction. In typically cases, if we need to project $X$ to a high-order polynomial space, the dimensionality of the $Z$ space grows very quickly. The computational cost of coordinates in $Z$ and dot products in $Z$ space is going to be very expensive. However, the reason we listed the long Equations (14) and (16) is not because of the math itself, but rather to point out a very important observation: both the Equation (14) for $\alpha$ and the final classifier in Equation (16) only depends on $\mathbf{x}$ through the form of their dot products: $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j)$, $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new})$ and $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_s)$. If somehow, we are able to efficiently compute dot products among all pairs of $\phi(\mathbf{x})$, we can limit our SVM algorithm to running time of $O(N^2)$ regardless of the dimensionality of $Z$. If we choose transformation $\phi$ in such a special way that its dot product in the high-dimensional $Z$ space can be calculated by some function, which called a "kernel function", in the lower-dimensional $X$ space, we are in business, i.e.,
$$\begin{equation} K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j) \end{equation}.$$
Be aware that $\phi(\cdot)$ is a vector, not a scalar. Once we have $K(\cdot, \cdot)$ defined, we can use the dot products to solve $\alpha$ and calculate $y_{new}$ for $\mathbf{x}_{new}$ according to Equation (18) below, without leaving the $X$ space. If you insist on carrying out $X$ to $Z$ transform, and solve for $\alpha$, then do dot products in $Z$ space, you would end up with exactly the same results, but just wasting more computer memories to hold the higher-dimension coordinates and CPU cycles to calculate the dot products.
$$\begin{equation}
y = {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i K(\mathbf{x}_i, \mathbf{x}_s)\right).
\end{equation}$$
$$\begin{equation} \begin{aligned}
y &= {\rm sign } ( \mathbf{w}^*\phi(\mathbf{x}_{new}) + b^*) \\
&= {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i \phi(\mathbf{x}_i)^{ \intercal } \phi(\mathbf{x}_s)\right).
\end{aligned} \end{equation}$$
We basically project all $\mathbf{x}$ and $\mathbf{x}_{new}$ into the $Z$ space, then do the dot product in the $Z$ space to obtain the classification prediction. In typically cases, if we need to project $X$ to a high-order polynomial space, the dimensionality of the $Z$ space grows very quickly. The computational cost of coordinates in $Z$ and dot products in $Z$ space is going to be very expensive. However, the reason we listed the long Equations (14) and (16) is not because of the math itself, but rather to point out a very important observation: both the Equation (14) for $\alpha$ and the final classifier in Equation (16) only depends on $\mathbf{x}$ through the form of their dot products: $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j)$, $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new})$ and $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_s)$. If somehow, we are able to efficiently compute dot products among all pairs of $\phi(\mathbf{x})$, we can limit our SVM algorithm to running time of $O(N^2)$ regardless of the dimensionality of $Z$. If we choose transformation $\phi$ in such a special way that its dot product in the high-dimensional $Z$ space can be calculated by some function, which called a "kernel function", in the lower-dimensional $X$ space, we are in business, i.e.,
$$\begin{equation} K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j) \end{equation}.$$
Be aware that $\phi(\cdot)$ is a vector, not a scalar. Once we have $K(\cdot, \cdot)$ defined, we can use the dot products to solve $\alpha$ and calculate $y_{new}$ for $\mathbf{x}_{new}$ according to Equation (18) below, without leaving the $X$ space. If you insist on carrying out $X$ to $Z$ transform, and solve for $\alpha$, then do dot products in $Z$ space, you would end up with exactly the same results, but just wasting more computer memories to hold the higher-dimension coordinates and CPU cycles to calculate the dot products.
$$\begin{equation}
y = {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i K(\mathbf{x}_i, \mathbf{x}_s)\right).
\end{equation}$$
Kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ is the dot product $\phi(\mathbf{x}_i)$ and $\phi(\mathbf{x}_j)$ in the $Z$ space, so kernel function carries the meaning of the
similarity between vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ and it should be defined as both symmetric and positive. It is theoretically guaranteed
that as only as $K$ is symmetrically and positively defined, there always exists a
mapping $\phi: X \to Z$, where dot product in $Z$ is the same as what is computed from $K$. We do not know how to prove this, just know some math genius has proven this, so we can pretty much define any distance metrics we are using, such as Pearson, Manhattan and Tanimoto, etc. We will use the Gaussian kernel next to do a deep dive into how kernel tricks enable us to introduce non-linearity, while remaining computationally efficient.
RBF Kernel and One Kernel for All
A very popular kernel is Gaussian kernel or Radial Basis Function (RBF) kernel,
where:
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(- \gamma {\Vert \mathbf{x} - \mathbf{x}^\prime \Vert }^2) \\
&= \exp(- \gamma \mathbf{x}^2)\exp(- \gamma \mathbf{x}^{\prime 2}) \exp(2 \gamma \mathbf{x}\mathbf{x}^\prime)
\end{aligned} \end{equation},$$
where $\gamma$ is a hyper-parameter determining how sensitive the similarity score is against distance of two points. Let us actually work out the mapping function $\phi$, as it will help us understand this kernel inside out, although the math below are a bit messy, the idea is simply, so do not let the face of the equations intimidate you.
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(- \gamma {\Vert \mathbf{x} - \mathbf{x}^\prime \Vert }^2) \\
&= \exp(- \gamma \mathbf{x}^2)\exp(- \gamma \mathbf{x}^{\prime 2}) \exp(2 \gamma \mathbf{x}\mathbf{x}^\prime)
\end{aligned} \end{equation},$$
where $\gamma$ is a hyper-parameter determining how sensitive the similarity score is against distance of two points. Let us actually work out the mapping function $\phi$, as it will help us understand this kernel inside out, although the math below are a bit messy, the idea is simply, so do not let the face of the equations intimidate you.
First, let us just focus on the simple 1-$d$ case for $\mathbf{x}$, as the math quickly gets really complex as dimensionality increases, but we will
explain later why the conclusions obtained from 1-$d$ hold for higher dimensions.
It is not hard to guess $\phi$ should be the form of $\exp(- \gamma x^2) \cdot \psi(x)$, so that the dot product between $\phi(x)$ and $\phi(x^\prime)$ contains $\exp(-\gamma x^2)\exp(- \gamma x^{\prime 2})$ from the first factor. $\psi(x)$ needs to be chosen in a way that the sum of $\psi(x) \cdot \psi(x^{\prime})$ across all $Z$ dimensions equals $\exp(2 \gamma xx^\prime)$. How?
It is not hard to guess $\phi$ should be the form of $\exp(- \gamma x^2) \cdot \psi(x)$, so that the dot product between $\phi(x)$ and $\phi(x^\prime)$ contains $\exp(-\gamma x^2)\exp(- \gamma x^{\prime 2})$ from the first factor. $\psi(x)$ needs to be chosen in a way that the sum of $\psi(x) \cdot \psi(x^{\prime})$ across all $Z$ dimensions equals $\exp(2 \gamma xx^\prime)$. How?
Remember the Taylor expansion:
$$\begin{equation} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \ldots + \frac{x^n}{n!} + \ldots \end{equation}$$
We can expand:
$$\begin{equation} \exp(2 \gamma xx^\prime) = 1 \cdot 1 + \frac{(\sqrt{2 \gamma}x)(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x)}^2{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \frac{{(\sqrt{2 \gamma}x)}^3 {(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}\sqrt{3!}} \ldots \end{equation}$$
Now we should be able to write out $\phi$ as:
$$\begin{equation} \phi(x) = \exp(-\gamma x^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x)}^3}{\sqrt{3!}}, \ldots \right] \\ \phi(x^\prime) = \exp(-\gamma {x^\prime}^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$
Therefore, $\phi(x)\phi(x^\prime)$ using Equation (22) and (21) produces the same result as the kernel function! The $Z$ space here is actually an infinite-dimensional space, in this infinite space, if you diligently carry out dot products between $x$ and $x^\prime$, you will end up with $K(x, x^\prime)$, which can be easily computed using $\exp\left(-\gamma {(x-x^\prime)}^2\right)$, so the $Z$-space was only a hypothetical space that no actually computations have to take place, we can accomplish everything in the $X$ space, thanks to the kernel function.
$$\begin{equation} \exp(2 \gamma xx^\prime) = 1 \cdot 1 + \frac{(\sqrt{2 \gamma}x)(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x)}^2{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \frac{{(\sqrt{2 \gamma}x)}^3 {(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}\sqrt{3!}} \ldots \end{equation}$$
Now we should be able to write out $\phi$ as:
$$\begin{equation} \phi(x) = \exp(-\gamma x^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x)}^3}{\sqrt{3!}}, \ldots \right] \\ \phi(x^\prime) = \exp(-\gamma {x^\prime}^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$
Therefore, $\phi(x)\phi(x^\prime)$ using Equation (22) and (21) produces the same result as the kernel function! The $Z$ space here is actually an infinite-dimensional space, in this infinite space, if you diligently carry out dot products between $x$ and $x^\prime$, you will end up with $K(x, x^\prime)$, which can be easily computed using $\exp\left(-\gamma {(x-x^\prime)}^2\right)$, so the $Z$-space was only a hypothetical space that no actually computations have to take place, we can accomplish everything in the $X$ space, thanks to the kernel function.
Without losing generality, let’s assume our $\mathbf{x}$ is within $[0, 1)$ (we can always do this through data normalization during pre-processing
step), most of the coordinates in our infinite dimensions quickly approach zero as $x^n$ converges to zero. So effectively
we only have a few lower-order dimensions, for most of the higher-order dimension, all data points have
coordinates near zero, therefore, they are not informative. Since linear model can automatically do
feature selection by weighting, those dimensions will have weight approximately zero. So we are not really doing classification in
an infinite dimension space, the effective VC dimension remain finite and SVM
remains robust.
With the kernel function, to solve the SVM problem in Equation (14), the cost is $O(N^2)$. Then to calculate $y$ for a data point to be predicted, we use Equation (16), which only depends on the number of support vectors, at most $O(N)$. Therefore, the kernel enables SVM computation to be limited by $N$, not by the dimensionality. The effective VC dimension is approximately proxied by the number of model parameters $\alpha$, which is the number of support vectors. No matter how many features $X$ contains, SVM is computationally bounded.
With the kernel function, to solve the SVM problem in Equation (14), the cost is $O(N^2)$. Then to calculate $y$ for a data point to be predicted, we use Equation (16), which only depends on the number of support vectors, at most $O(N)$. Therefore, the kernel enables SVM computation to be limited by $N$, not by the dimensionality. The effective VC dimension is approximately proxied by the number of model parameters $\alpha$, which is the number of support vectors. No matter how many features $X$ contains, SVM is computationally bounded.
Now let us look at two extreme cases of $\gamma$ to understand the power of RBF kernel.
If $\gamma \to 0$, the $Z$ space described by Equation (22) looks like:
$$\begin{equation} \exp(-\gamma x^2) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x, O(\gamma) \right] \end{equation}$$
We omit the higher dimensions, as data points are close to the origin due to small $\gamma$. Well, this is simply the original linear space $X$. Therefore, RBF kernel under very small $\gamma$ degenerates into a linear transformation, e.g., it is basically a linear SVM model in this extreme case.
We omit the higher dimensions, as data points are close to the origin due to small $\gamma$. Well, this is simply the original linear space $X$. Therefore, RBF kernel under very small $\gamma$ degenerates into a linear transformation, e.g., it is basically a linear SVM model in this extreme case.
For the other extreme when $\gamma$ is large. As the coordinate in each higher dimension increases by a factor of $\sqrt{2 \gamma}x/\sqrt{n}$, the coordinates first increase, then eventually converge to zero. The dimension where it peaks depends on the value of $\gamma$, the larger $\gamma$, the more higher-dimension $x$ terms are included. That is we can control the polynomial order in the model by tuning $\gamma$.
By tunning $\gamma$ from very small values to larger values, our RBF kernel adjust the hypothsis set $\mathcal H$ to include linear models to higher-order polynomial models. Therefore, we say the RBF kernel covers a broad spectrum of models, it is indeed one kernel for all! For any given learning problem, we treat $\gamma$ as a
regularization hyper-parameter and use CV to obtain its optimal value (Chapter 3), i.e., we tune our hypothesis space to identify the best model that
contains enough complexity (polynomial model), but yet not too complex. The RBF kernel is so powerful that it should be considered as the default kernel for your SVM classifier [3].
RBF Kernel for Higher-dimensional Input
Just to have a peace of mind, let us exam the case where the input $\mathbf{x}$ is 2-$d$ and make sure everything we discussed for the KBF kernel above still holds true. Since the complexity of the math expressions grows exponentially, we will just do enough to convince you it is indeed the case, without carrying out all the details.
Following the similar spirit of decomposing the kernel function using Taylor expansion, our kernel for 2-$d$ input can be written as:
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(-\gamma ({(x_1-x_1^\prime)}^2 + {(x_2-x_2^\prime)}^2)) \\
&= \exp(-\gamma (x_1^2+x_2^2)) \cdot \exp(-\gamma (x_1^{\prime 2}+x_2^{\prime 2}))) \cdot \exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime)
\end{aligned} \end{equation}$$
We have learned that the trick to constructure $\phi$ is to decouple $\mathbf{x}$ and $\mathbf{x^\prime}$ in the kernel function above; the first two terms have already taken care of themselves, we need to rewrite the last two terms as a dot product.
$$\begin{equation}
\exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime) = \\
\left[ 1 + \frac{(\sqrt{2 \gamma}x_1)(\sqrt{2 \gamma}x_1^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_1)}^2{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right] \cdot
\left[ 1 + \frac{(\sqrt{2 \gamma}x_2)(\sqrt{2 \gamma}x_2^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_2)}^2{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right]
\\ \end{equation}$$
Now we are ready to figure out $\phi$ by cross multiple the two terms into the sum of multiples:
$$\begin{equation}
\phi(\mathbf{x}) = \exp(-\gamma (x_1^2+x_2^2)) \cdot
\left[
1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1,
\frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2,
\frac{{(\sqrt{2 \gamma}x_1)}^2}{\sqrt{2!}},
2 \gamma x_1x_2,
\frac{{(\sqrt{2 \gamma}x_2)}^2}{\sqrt{2!}},
o(x_1^2 + x_1x_2 + x_2^2) \right]
\\ \phi(\mathbf{x}^\prime) = \exp(-\gamma ({x_1^\prime}^2+{x_2^\prime}^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1^\prime, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2^\prime, \frac{{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}}, 2 \gamma x_1^\prime x_2^\prime, \frac{{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}}, o({x_1^\prime}^2 + x_1^\prime x_2^\prime + {x_2^\prime}^2) \right] \end{equation}$$
As $\gamma$ approaches zero, the key $Z$ dimensions are $\left\{ 1, \sqrt{2 \gamma}x_1, \sqrt{2 \gamma}x_2 \right\}$, which is the original linear space. As $\gamma$ getting larger, higher order terms $x_1^2$, $x_2^2$, $x_1x_2$, and even higher-higher-order polynomial terms come into play, $\phi$ becomes a polynomial mapping. Same conclusion as the 1-$d$ case. Also notice in polynomial terms $x_1x_2$, where the features in the input $X$ space are mingled in most of the $Z$ dimensions. If in the input space $X$, there is a feature that is noise, it will contaminate many other features in the $Z$-space. The ability of a linear model is able to do interpretable feature weighting and feature selection, however, this now happens in the $Z$ space. The weights cannot be interpreted in the original $X$ space. Therefore, to do feature selection, we should do feature selection in $X$ space, before we use the kernel SVM. Otherwise, one would rely on iteratively removing one feature at a time within $X$ and repeat the SVM process, get rid of the features, which do not degrade the performance. This feature selection process can be a bit computationally expensive.
To summarize, we explain SVM with RBF kernel is not only a very robust classifier with built-in generalization capability, but also a very powerful classifier, as its hyper-parameter $\gamma$ allows us to continuously cover linear model, higher-order polynomial models and models in between. This is a beautiful ML technique.
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(-\gamma ({(x_1-x_1^\prime)}^2 + {(x_2-x_2^\prime)}^2)) \\
&= \exp(-\gamma (x_1^2+x_2^2)) \cdot \exp(-\gamma (x_1^{\prime 2}+x_2^{\prime 2}))) \cdot \exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime)
\end{aligned} \end{equation}$$
We have learned that the trick to constructure $\phi$ is to decouple $\mathbf{x}$ and $\mathbf{x^\prime}$ in the kernel function above; the first two terms have already taken care of themselves, we need to rewrite the last two terms as a dot product.
$$\begin{equation}
\exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime) = \\
\left[ 1 + \frac{(\sqrt{2 \gamma}x_1)(\sqrt{2 \gamma}x_1^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_1)}^2{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right] \cdot
\left[ 1 + \frac{(\sqrt{2 \gamma}x_2)(\sqrt{2 \gamma}x_2^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_2)}^2{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right]
\\ \end{equation}$$
Now we are ready to figure out $\phi$ by cross multiple the two terms into the sum of multiples:
$$\begin{equation}
\phi(\mathbf{x}) = \exp(-\gamma (x_1^2+x_2^2)) \cdot
\left[
1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1,
\frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2,
\frac{{(\sqrt{2 \gamma}x_1)}^2}{\sqrt{2!}},
2 \gamma x_1x_2,
\frac{{(\sqrt{2 \gamma}x_2)}^2}{\sqrt{2!}},
o(x_1^2 + x_1x_2 + x_2^2) \right]
\\ \phi(\mathbf{x}^\prime) = \exp(-\gamma ({x_1^\prime}^2+{x_2^\prime}^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1^\prime, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2^\prime, \frac{{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}}, 2 \gamma x_1^\prime x_2^\prime, \frac{{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}}, o({x_1^\prime}^2 + x_1^\prime x_2^\prime + {x_2^\prime}^2) \right] \end{equation}$$
As $\gamma$ approaches zero, the key $Z$ dimensions are $\left\{ 1, \sqrt{2 \gamma}x_1, \sqrt{2 \gamma}x_2 \right\}$, which is the original linear space. As $\gamma$ getting larger, higher order terms $x_1^2$, $x_2^2$, $x_1x_2$, and even higher-higher-order polynomial terms come into play, $\phi$ becomes a polynomial mapping. Same conclusion as the 1-$d$ case. Also notice in polynomial terms $x_1x_2$, where the features in the input $X$ space are mingled in most of the $Z$ dimensions. If in the input space $X$, there is a feature that is noise, it will contaminate many other features in the $Z$-space. The ability of a linear model is able to do interpretable feature weighting and feature selection, however, this now happens in the $Z$ space. The weights cannot be interpreted in the original $X$ space. Therefore, to do feature selection, we should do feature selection in $X$ space, before we use the kernel SVM. Otherwise, one would rely on iteratively removing one feature at a time within $X$ and repeat the SVM process, get rid of the features, which do not degrade the performance. This feature selection process can be a bit computationally expensive.
To summarize, we explain SVM with RBF kernel is not only a very robust classifier with built-in generalization capability, but also a very powerful classifier, as its hyper-parameter $\gamma$ allows us to continuously cover linear model, higher-order polynomial models and models in between. This is a beautiful ML technique.
Classification with Probability
One limitation with SVM is it only predicts the label of $y$ without providing a probabilistic estimation. We have seen in Chapter 5, where probability can be estimated using a linear model called logistic regression (LR) model, which minimizes a cross-entropy error plus regularization error. The disadvantage of LR is the solution depends on all $N$ input data points, which SVM depends on only a few support vectors. Therefore, SVM has much better generalization property. Can we estimate class probability, while still taking advantage of the support vectors?
Yes. The idea is to run SVM first, then take the $d_i = \mathbf{w}^\intercal \phi(\mathbf{x}_i) + b$, i.e., the distance of $\mathbf{x}_i$ to the decision hyper-plane in the $Z$ space as the input features $(d_i, y_i)$ to train a one-dimensional LR model:
$$\begin{equation} f(+1 \vert d) = \frac{e^{\alpha d + \beta}}{1+e^{\alpha d + \beta}}. \end{equation}$$
There are some nuances in exactly how to computer $d_i$ [4], e.g., using cross-validation to obtain $d_i$ in an unbiased manner, but the main idea is outlined above. For points within and near the margin, their probability vary from $\sim 0$ to $\sim 1$; for most points further away from the decision boundary, their probability quickly goes to zero (for -1 class) or one (for +1 class).
It is also worth mentioning that although most literature states multi-class classification problems can be solved by SVM using either $N$ one-vs-all SVMs or $\frac{N(N-1)}{2}$ one-vs-one SVMs, the problem can also be directly solved by minimizing a hinge loss function in the original $X$ space (cannot use Kernel) [5]. This enables linear SVM to be used in Deep Learning as an alternative to multi-class Logistic Regression classifier.
Reference
- KDS, page 173-175
- LFD, eChapter 8, page 40-45.
- https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, Page 4.
- http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf
- http://cs231n.github.io/linear-classify (see multi-class support vector machine loss)
1 comment:
It is nice blog Thank you provide important information and i am searching for same information to save my time Data Science online Training
Post a Comment