Saturday, May 27, 2017

Chapter 6. The Path Forward

Review


For machines to learn from data, it explores a set of hypotheses $\mathcal{H}$ provided by us and identifies the best hypothesis $g$ based on the training data set $\mathcal{D}$, which contains $N$ pieces of data records.  The upper confidence interval bound on the performance of the solution in test sets, $E_{out}$, differs from its within-sample performance $E_{in}$ (Figure 6.1):
$$\begin{equation} \begin{aligned}
E_{out} &= E_{in} + E_{reg} \\ &\approx E_{in} + \epsilon\left(\sqrt{\frac{\ln M}{N}} \right), \\
&\approx  E_{in} + E_{reg}(\lambda).
\end{aligned} \end{equation}$$
where $M$ is the effective number of independent hypotheses and $\lambda$ is employed to adjust this number.

Figure 6.1.  What we care about is $E_{out}$, i.e., the upper bound of true performance with a high confidence $1-\delta$, where $\delta$ usually is some small value.  To minimize $E_{out}$, no longer should we minimize $E_{in}$ but also the confidence interval $\epsilon$, which depends on $N$ and $M$.  Increasing sample size $N$ reduces $\epsilon$, reducing hypothesis space, i.e., reducing $d_{vc}$ and effective hypothesis count $M$ reduces $\epsilon$.  As the functional form of $\epsilon$ is generally unknown, we typically model $E_{reg}$ using a penalty term to continuously adjust the effective size of our hypothesis space.

To minimize $E_{in}$, we need to employ sophisticated models that is close to the target function $f$ and increase the hypothesis search space.  To minimize $E_{reg}$, we need to either reduce the effective model counts $M$ or increase the sample size $N$, while the latter is often not an option.  From the Bayesian camp, $E_{in}$ is the evidence carried by the data set, $E_{reg}$ is our prior knowledge, $\lambda$ tunes our stubbornness in learning from data.

To find the balance of the two opposing requirements on model complexity, we modulate $M$ by adjusting a hyper-parameter $\lambda$.  The optimal value of $\lambda$, $\lambda^*$, can be empirically determined by a cross validation protocol, where estimated $E_{out}$ score is minimized.  $\lambda^*$ which subsequently determines the model $g$, which is expected to achieve the optimal $E_{out}^*$.  This is basically the core principle of learning theory introduced in Chapter 2 & 3.

Standing on top of the learning theory, we show how linear or non-linear models can be applied to both classification and regression problems, which rely on a hyper-parameter $\gamma$ to modulate the complexity of the hypothesis space.  The resultant model space could be as simple as only including some linear models or as sophisticated as including higher-order polynomial models.  SVM only rely on a few support vectors, therefore have much fewer model parameters compared to other models, thus it has extremely robust generalization properties.

Bayesian and Graphical Models


We have covered a wide range of learning theories for ML in this note, nevertheless, there are enormous amount of ML knowledge yet to be studied on our own.  For instance, we have not covered Ensemble methods, where multiple ML models can be pooled together to form a decision committee.  There are examples where a high-quality committee can be formed by rather crude individual models.  For this reason, people generally do not compare non-ensemble methods to ensemble method directly [1].  We should read about additional ML topics and one good resource is the excellent Machine Learning Technique lecture [2] (in Chinese though).

We also did not cover graphical model, or cases where input is not a well-structured feature matrix $X$.  For example, when ${\rm Xbox Kinect}^{TM}$ classifies a player’s movement into either kick or jump, the input is a series of images (or poses), where there are varying number of frames serving as the input.  Using hidden-Markov chain to describe the transition between poses would be more effective [3].  For problems based on complex inputs such as computer vision, voice recognition, nature language processing, techniques such as SVM no longer works well.  So one future direction is to tackle the learning problem from Bayesian approach or graphical approach.  In the regularization discussion, we mentioned how the Bayesian approach avoid the over-fitting topic, and replace it with a prior belief, so a ridge regularization is basically a Gaussian prior and a Lasso regularization is basically a Laplace prior.  To proceed along this direction of Bayesian learning, a good books might be "Pattern Recognition and Machine Learning" by Christopher Bishop.  To study graphical models, a good but challenging online course might be “Probabilistic Graphical Model” Coursera series by Daphne Koller (although I do not recommend the book, use Bishop book instead).

Figure 6.2.  A action is consisted of a series of states described by a hidden-Markov Model (HMM).  In this example, a low-kick action consists of three state transitions: from stand to foot lifted, to kick, and then back to stand.  At each time point, the body position $X(t)$ is observed as a sample from its underlying state $S(t)$.  The goal is to predict the action label $A$ (low-kick, high-kick, sit, jump, etc.) based on a time-series of observations $X(t)$.  One would need to maximize the likelihood: $p(X|A=k) = \sum_{S_k(t_0), S_k(t_1), \ldots} \; \left\{ p(S_k(t_0))p(X(t_0)|S_k(t_0)) \prod_{i=1,N} \; p(S_k(t_i)|S_k(t_{i-1})p(X(t_i)|S_k(t_i) \right\}$.


Deep Learning for Feature Learning


The learning theory above also heavily depends on the quality of features in the $X$ matrix.  Often times, informative features are very hard to obtain.  In cellular imaging, one typically has to carry out challenging segmentation task (a computer vision task) to identify individual cells, then apply programs such as CellProfiler [4] to compute several hundred features for each cell.  There are custom-designed features to describe the size, intensity, textural and the morphology of cells.  Some nice image features, such as Scale-invariant Feature Transform (SIFT) [5], requires decades of hard research and are still may not be very effective.  Is there a way to ask machine to do feature learning, i.e., having machine automatically construct informative feature vectors, is a very useful area of research.  Feature learning can also be treated as the goal of finding a mapping function $f: X \to Z$ by machine.  In this regard, Deep Learning (DL) has great advantage, in fact, some experts consider DL equivalent to feature learning, because its Convolutional Neural Network (CNN) is particularly apt at extracting a hierarchical set of features ranging from rudiment edge and corners to complex elements like wheels and faces.  So another future path is Deep Learning.  To that end the good book is DLB and a good course probably is “Neural Network for Machine Learning” by the DL godfather Geoffrey Hinton on Coursera.
Figure 6.3. Deep Learning can automatically learn features, without experts going through tedious manual feature construction process.  Here a DL network extracts lower-level features such as edge, corners, then gradually builds up higher-level features represents eyes, nose and finally faces. [https://s3.amazonaws.com/datarobotblog/images/deepLearningIntro/013.png]

Data Engineering: Big Data and Cloud


The learning theory we covered here obviously has very strong emphasis on over-fitting: “the ability to deal with over-fitting is what separates professionals from amateurs in the field of learning from data” according to LFD [6].  Based on the learning curve and VC theory, over-fitting will no longer be a concern if one has lots of data.  When $N$ is 100 or 1000 times larger than the number of model parameters, $E_{in}$ and $E_{out}$ will be very close.  However, keep in mind that although we live in a big data era, only some problems have the luxury of having big data as training data set.  The majority of the problems we are deal with as data scientists in non-Internet-data companies (such as Facebook, LinkedIn, Google, Amazon, etc.) are still very much under the constrains of limited $N$ and all the theories we have discussed in this note are crucial.  The word “big” data can be very misleading.  In biology next-generation sequencing (NGS) data could be a few GB, but it is not big due to its lack of complexity.  Such data file can be post-processed into sequence counts for twenty thousand human genes.  If we think each NGS sample is a record and each of the twenty thousand gene is a feature, one thousand NGS samples (TB of data) still does not qualify it as a big data problem.  You can image a cat with enormous resolution, therefore, generate a huge picture file, the size of file does not make this a big data.  Big data is not just about volume of data.

Nevertheless, let us pretend we are data scientists in Google for a moment and we trying to solve a real big data problem with basically infinite $N$ in our hands, our worry is no longer over-fitting.  We have to worry about how to store our training data – a big data storage problem, how to carry out the computation to find the optimal hypothesis in $\mathcal H$ – a big data computation problem, and how to retrieve particular future records efficiently for prediction – a big data transaction problem.  ML then becomes a real engineering challenge, and we start to rely more and more on data engineering technologies and less and less on data science knowledge.  Infinite $N$ means the variance term in $E_{out}$ is no longer the issue, the performance bottleneck is the bias term.  To reduce bias one should introduce more sophisticated models.  This is why DL, with tens of thousands, millions, or billions of model parameters, is only possible when firstly there are nearly infinite amount of training data, secondly there is a vast computational infrastructure, which is enabled by both big data and by GPU hardware.  Therefore, we see big data has strong ties to Deep Learning, and to data engineering.  Please bear in mind that when DL is applied to small data size, we will see jargon such as regularization, over-fitting being mentioned again and again in the DLB; so all we have learned in this note will not be wasted in your DL journey.

Data engineers study the three big data problems raised above: storage, computation, and transaction.  Due to the engineering nature of such challenges, the result is not a set of new theory, but a set of technical solutions, more precisely, are many alternative technical systems, where the industry standard is practically determined by whichever has the dominant user base.  The answer to big data storage originated from Hadoop File System (HDFS), an open source implementation of Google File System (GFS).  The answer to big computation was first proposed again by Google - its Map-Reduce (MR) framework.  The answer to big data transaction is the numerous NoSQL systems (Google's Big Table is a pioneer).  In the early days, one has to establish local big data infrastructures using these components, where fortunately most are open source and free.  Nowadays, we are more likely to focus on their utility and let Cloud providers such as Amazon Web Service (AWS) take care of setting up these infrastructure for us.  In this aspect, big data and Cloud do go hand-in-hand.  Amazon’s solution to big storage is S3, to big computation is EC2 cluster or EMR, to big transaction is DynamoDB, etc.  So our third future development path is data engineer/big data engineering/cloud engineering, I consider them nearly synonyms due to their tight connections.  They emphasize on the programming and automation aspect, compare to the data science’s emphasize on the statistics skill sets.  In reality, people tend to confuse data science with data engineering, hopefully there will be more clarification as the fields develop further.
Figure 6.4.  As $N$ grows to $\inf$, over-fitting is no longer a concern and $E_{out}$ basically equals $E_{in}$.  Our focus will switch from the science topics to practical engineering challenges.  Engineering challenges include big data storage, computing and transaction.  Currently Amazon Web Services (AWS) is the most popular big-data platform in the Cloud, so that we do not have to set up such an environment on our own.

Summary


In this note, we focus on how machine can select the best hypothesis from a hypothesis set $\mathcal H$, given a limited number of $N$ input training record $X$.  We need to take into account both within-sample error $E_{in}$ and generalization error (determined by VC dimension) to avoid under-fitting and more importantly over-fitting.  An effective way to determine the optimal balance of the two error terms is through hyper-parameter determination based on cross validation.  Support Vector Machines is a huge success of machine learning theory with all these concepts in combine.  The Gaussian kernel provides a continuous hypothesis space ranging from linear to high-order polynomial hypothesis, while still preserving decent generalization capability.  Once we have a solid theoretical foundation, we are ready to understand many machine learning techniques.  An excellent online course is “Machine Learning Techniques” on YouTube [2].


Reference

  1. DLB, Page 251: Model averaging is an extremely powerful and reliable method for reducing generalization error. Its use is usually discouraged when benchmarking algorithms for scientific papers, because any machine learning algorithm can benefit substantially from model averaging at the price of increased computation and memory. For this reason, benchmark comparisons are usually made using a single model.
  2. Machine learning technique video (in Chinese) by Hsuan-Tien Lin.
    https://www.youtube.com/watch?v=A-GxGCCAIrg&index=1&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2
  3. Probabilistic Graphical Models, Coursera online course by Daphne Koller.  Part III. PGM Programming Assignment: Learning with Incomplete Data.
  4. http://cellprofiler.org
  5. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
  6. LFD, page 119

Friday, May 26, 2017

Chapter 5. Support Vector Machines

We talk about VC dimension theory and explain how to improve the regularization property by reducing the hypothesis space $\mathcal{H}$.  We also talk about linear models and how to take advantage of the linear model framework to learn with non-linear learning models.  Building on the fundation of all the theories we have covered so far, a very impressive machine learning technique called Support Vector Machines (SVM) was invented.  For problems of limited data set size, SVM has routinely been shown to outperform other learning algorithms, so if you can only choose one machine learning model, you generally will not regret by going with SVM.  Only in recently years, SVM is surpassed by deep learning models, but that is only when large data set is available.  As this note only considers small data set, SVM, for us, is the king of the castle.

Maximum Margin Linear Classifier


SVM has built-in regularization property!  It relies on an intuitive idea of maximum classification margin.  Let us only consider the case where the $N$ input data points in $\mathcal{D}$ are linearly separable, i.e., there exists a hyper plane that can classify all $N$ points without error.  Although there are many solutions that can fit the bill (Figure 5.1a), SVM says we would like to go with the one that has the maximum margin, or we say it is the fattest decision plane with two margin planes called boundaries (Figure 5.1b), as this one is most robust against measurement uncertainty.  Why is that? Each data point $\mathbf{x}$ in the real physical world carries some degree of uncertainty.  When we measure the height of a person, uncertainty comes from the tool we used for the measuring, comes from the height itself varying during different times of the day, comes from the measuring protocol whether the person is shoes-on or off, etc.   So it is best to assume each $\mathbf{x}$ is only an estimation of its true value, in another word, its true value is within a halo around $\mathbf{x}$.  If we let the halo expand and eventually its true value will cross the linear separating plane and it means the point will be misclassified.  The plane with the maximum margin is able to handle the largest halo size, therefore, can accommodate the largest degree of uncertainty in $\mathbf{x}$.  Figure 5.1c & d compares the error-tolerance of a small-margin classifier and the maximum-margin classifier, the later is more robust against error in $\mathbf{x}$.
Figure 5.1. The blue decision plane is the fattest one we can find for this example (a & b).  It can tolerate the largest amount of uncertainty in input data points $\mathbf{x}$ (c & d).

The maximum margin concept is also supported by the VC theory.  Given $N$ data points, if we try to place in a plane with certain margin size.  We can see in Figure 5.2 that the larger the margin, the smaller the possible space the separator can occupy, i.e., the smaller the hypothesis set space, therefore the smaller the effective VC dimension is.  If a few margins are sorted ascendingly: $m_1 \subset m_2 \subset \ldots \subset m_n$, then their VC dimensions: $d_1^{vc} \ge d_2^{vc} \ge \ldots \ge d_n^{vc}$ [1].  Among all the linear separation planes, the one with the maximum margin has the best regularization property.
Figure 5.2. A smaller-margin decision plane outlined in green can occupy a larger hypothesis space, compared to the maximum-margin decision place outlined in blue.


SVM Engine


We here only consider the case where $N$ data points are linearly separable.  A linear separating plane is the model:
$$\begin{equation} y = {\rm sign}(\mathbf{w} \cdot \mathbf{x} + b). \end{equation}$$
For $\mathbf{x}$ on the separating plane, $y = 0$.  For other points, $y$ is either $+1$ or $-1$.  Notice here $\mathbf{x}$ does not contains the constant element 1 in its first dimension, we explicitly write $\mathbf{w}_0$ as $b$, so that math would be simpler this way.  If our boundaries are maximum margin, they must touch at least two data points, one on each side, otherwise, the margin cannot be the maximum.  Since we can always linearly scale $b$ and $\mathbf{w}$ by multiplying them with a positive scaling factor without changing the classification results given by Equation (1), so that we can choose the scaling factor in such a way, so that $y = -1$ and $y = +1$ for the data points touching the two boundaries.

In Figure 5.3, $\mathbf{x}^+$ is on the +1 boundary plane, $\mathbf{x}^-$ is on the -1 boundary plane, we have:

$$\begin{equation} \left\{ \begin{aligned} \mathbf{w} \cdot \mathbf{x}^+ + b & = 1, \\ \mathbf{w} \cdot \mathbf{x}^- + b & = -1. \end{aligned} \right. \end{equation}$$
Figure 5.3.  Illustration on how to calculate the size of a margin.

The thickness of the margin equals the projection of $\mathbf{x}^+ - \mathbf{x}^-$ (Figure 5.3, green arrow) along the normal direction of the planes ($\mathbf{w}$ in Figure 5.3):
$$\begin{equation} \begin{aligned} {\rm margin} &= (\mathbf{x}^+ - \mathbf{x}^-) \cdot \frac{\mathbf{w}}{\Vert{\mathbf{w}}\Vert} \\ &= \frac{2}{\Vert{\mathbf{w}}\Vert} \end{aligned} \end{equation}. $$
The second step uses Equation (2).  Therefore, maximizing margin is equivalent to minimizing $\mathbf{w}^\intercal\mathbf{w}$.   Recall in linear models, $\mathbf{w}^\intercal\mathbf{w}$ has been used as a regularization term to describe the complexity of the linear hypothesis (with $w_0$ already excluded), therefore, we say SVM has built-in regularization, as it prefers simpler hypotheses.  Just to be aware that SVM has anti-over-fitting built in does not mean it is free from over-fitting, it will over-fit if we run a 100-dimensional SVM on 10 data points! It is just SVM is over-fitting-aware, and it is much more robust against over-fitting compared to other ML techniques.

To solve the maximum-margin decision plane problem is to solve:
$$\begin{equation} \min_\mathbf{w} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} \end{equation},$$
with the constrains:
$$\begin{equation} y_i (\mathbf{w}^\intercal\mathbf{x}_i + b) \ge 1. \quad i = 1, 2, \ldots, N \end{equation}$$
We normally minimize $E_{out}$, however, Equation (4) only contains the regularization term, where is $E_{in}$?  Here we consider a linear-separable case, therefore, we are only interested in the case where all training data points are correctly classified, i.e., $E_{in} = 0$!  How do we make sure we only consider $\mathbf{w}$ and $b$, where all data points are correctly classified? The answer is Equation (5), as the constrains means all data points are on the correct side of the margin and are all on or outside the boundaries.  The factor $\frac{1}{2}$ is just for convenience.  In summary, Equation (5) makes sure we only consider decision hyperplanes where all training data are on the correct sides and Equation (4) makes sure we choose among such candidates the one that maximize the margin.

The above constrained problem can be solved by introducing Lagrangian multipliers, i.e., the problem is equivalent to:
$$\begin{equation} \min_{\mathbf{w}, b} \; \max_{\alpha_i \ge 0} \;  \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} + \sum_{i=1}^{N} { \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b)}.  \end{equation}$$
Let us try to understand why Lagrangian multiplier method works.  Assume we consider an invalid solution candidate $(\mathbf{w}, b)$, i.e., there is at least one data point, say $i$, falls inside the two boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \gt 0$, $\alpha_i$ would go to $+\infty$ in order to maximize the new function.  Then the function value would be $+\infty$ for such invalid $(\mathbf{w}, b)$, therefore, such invalid solution candidates have no chance to win, as there are other solutions give finite function values, thus invalid solutions are naturally excluded from solution pool.  So the basic idea here is to cast our original function into a new form, so that those $(\mathbf{w}, b)$ violating the constrains has zero chance to be our final solution (Figure 5.4).
Figure 5.4.  The basic idea of Lagrangian multiple is to convert the original problem, minimizing the function on the left with a constrain $\mathbf{x} \in [lb, ub]$ into an unconstrained minimization problem of a new function on the right.  The introduction of $\alpha$ helps to construct a new function where its value is $+\infty$ everywhere $\mathbf{x}$ is outside the constrained region, therefore, there is no chance for bad $\mathbf{x}$ to be the final solution.  In our example, the new function is at a higher dimensional space, including addition $\alpha$ dimensions.

The winning solution to Equation (6) naturally satisfy the following constrains:
$$\begin{equation} \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) = 0. \end{equation}$$
Why is that?  We already prove the solution to Equation (6) cannot violate Equation (5), so data points can never fall within the boundaries of our solution.  Considering data points outside the boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \lt 0$, therefore, any $\alpha_i \gt 0$ would decrease the value of the function in Equation (6), thus $\alpha_i = 0$ is the best case senario for $\max\alpha_i$ operator.  Therefore in our final solution, only those data points fall exactly on the decision boundaries can have non-zero $\alpha_i$ values, i.e., they are the ones supporting the decision plane, therefore, also called support vectors.  All other non-support vector data points must have $\alpha_i = 0$.

To solve Equation (6), mathematicians proof that we can swap the order of min and max, therefore, we will take a few partial derivatives against $b$ and $\mathbf{w}$ and set them to zero:
$$\begin{equation} \sum_{i=1}^N \; \alpha_i y_i = 0. \end{equation} $$
$$\begin{equation} \mathbf{w} = \sum_{i=1}^N \; \alpha_i y_i \mathbf{x}_i \end{equation}$$
Equation (8) implies each positively-labelled support vector contributes a positive force $\alpha_i$, each negatively-labeled support vector contributes a negative force, all forces add to zero and they are in balance.  So there must be at least two support vectors, one positive and one negative.  Equation (9) implies the optimal normal vector $\mathbf{w}^*$ is basically a linear combination of $N$ data vectors, which is known in our discussion of any linear model in Chapter 4 Equation (18).  The difference is we only consider the contributions from those support vectors, as all other points has $\alpha_i$ equals zero.

Plug Equation (8) and (9) back to Equation (6), we obtain a constrained minimization problem for variable $\alpha_i$:
$$\begin{equation} \min_{\alpha_i \ge 0} \;  \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \mathbf{x}_i^\intercal\mathbf{x}_j - \sum_{i=1}^{N} { \alpha_i }, \\   \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}$$
We will not go further into the math of how to solve exact values of optimal solution $\alpha_i^*$, that is a problem of Quadratic Programming (QP) problem already solved by great mathematical minds.  What is important is our analysis above shows the solution of the maximum margin problem must rely on the identification of a few support vectors from the $N$ input data points.  These support vectors determines the separation boundary, as only they have non-zero $\alpha_i^*$, all other data points have no contribution to the solution.  Then we use Equation (10) to find $\alpha_i$ and Equation (9) to solve $\mathbf{w}^*$.  $b^*$ can be calculated using any support vectors on the boundaries, say point $\mathbf{x}_s$, as:
$$\begin{equation} \begin{aligned} b^* &= y_s - \mathbf{w}^{*\intercal}\mathbf{x}_s \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \mathbf{x}_i^{\intercal}\mathbf{x}_s. \end{aligned} \end{equation} $$
The second step comes from Equation (9) and $y_s$ is either +1 or -1, depending on where $\mathbf{x}_s$ lies on.

Cross Validation of SVM


The observation that the solution of SVM only depends on a few support vectors lead to a very interesting result regarding its accuracy.  Let us consider estimate the accuracy of SVM using LOOCV; each time we remove one data point and use it as the validation set.  If there are $k$ support vectors, as long as the removed data point $\mathbf{x}_i$ is not a support vector, i.e., $\alpha_i = 0$ in the solution of the original problem, we know the solution will not change.  We can also prove this by observing Equation (10), the solution of $\alpha_i$ in the original problem must be in the solution space for the new problem with point $\mathbf{x}_i$ removed.  On the contrary, if the solution of the new problem is not the solution of the original problem, $\mathbf{x}_i$ must fall within the new decision boundary, then $\mathbf{x}_i$ should be a support vector to the old problem (the exact proof is to be worked out).  Therefore, the optimal solution $\alpha^*$ on the orginal problem must also be the optimal solution when a non-support vector data point $\mathbf{x}_i$ is omitted.  Since we are dealing with separable case, $y_i$ is already correctly predicted by our decision plane, therefore the validation error for such cases are zero.  Only when the validation data point comes from a support vector, the decision plane might change and the validation data point might be misclassified.  Therefore, the accuracy for a separably SVM model is:
$$\begin{equation} E_{cv} \le \frac{k}{N}. \end{equation}$$
In most problems, the number of support vectors $k \ll N$, therefore the performance of SVM is generally very impressive, even true for high-dimensional data inputs.  Because even in high dimensions, SVM minimizes $\Vert \mathbf w \Vert$, therefore constrains the hypothesis search space into a small region, the resultant number of support vectors $k$ only grows slowly as dimension increases, therefore, SVM is generally pretty robust.

The linearly separable case is called hard-margin SVM.  In the most general case, data points are not linearly separable, we will have to allow points to fall into the sandwiched margin space, therefore called soft-margin SVM problem.  The scoring function will include a penalty term for these points.  We will not go into the details, as the logic and the solution forms are quite similar to the hard-margin SVM problem, except there is a penalty hyper-parameter.  All those points fall inside the margins become bounded support vectors, and they contribute equally, with the same $\alpha$ value.  Those points on the boundaries are still margin support vectors that contribute varying $\alpha$ values.  Points beyond margins all have zero contribution.  In real applications, we always use the soft-margin SVM, never use hard-margin SVM.  Although we do not discuss soft-margin SVM in details, because all the conclusions on hard-margin SVM applies to it as well (it is not clear if the LOOCV results applies though), it is important that we should study that topic, for example using LFD supplementary chapter 8 [2].

Non-linear SVM and Kernel


Follow the same non-linear idea outlined in Chapter 5, we can transform the $X$ space into $Z$ space using a non-linear transformation:
$$\begin{equation} \mathbf{z} = \phi(\mathbf{x}) \end{equation}.$$
Then in the $Z$ space, we assume a maximum margin linear model would work well, as all non-linearity has been absorbed by the $\phi(\mathbf{x})$ transformation function.

We then run the SVM solution in $Z$ space and obtain our solution for the problem:
$$\begin{equation}
\min_{\alpha_i \ge 0} \;  \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \phi(\mathbf{x}_i)^\intercal\phi(\mathbf{x}_j) - \sum_{i=1}^{N} { \alpha_i }, \\  \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}.$$
The solution $\alpha^*$ enables us to find the linear separation plane in the $Z$ space as:
$$\begin{equation} \begin{aligned}
\mathbf{w}^* &= \sum_{i=1}^N \; \alpha_i^* y_i \mathbf{z}_i \\
&= \sum_{i=1}^N \; \alpha_i^* y_i \phi(\mathbf{x}_i)  \\
b^* &= y_s - \mathbf{w}^{*\intercal}\phi(\mathbf{x}_s) \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \phi(\mathbf{x}_i)^{\intercal}\phi(\mathbf{x}_s).
\end{aligned} \end{equation}$$
Then for a given new data point $\mathbf{x}_{new}$, our prediction will be based on:
$$\begin{equation} \begin{aligned}
y &= {\rm sign } ( \mathbf{w}^*\phi(\mathbf{x}_{new}) + b^*) \\
&= {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i \phi(\mathbf{x}_i)^{ \intercal } \phi(\mathbf{x}_s)\right).
\end{aligned} \end{equation}$$
We basically project all $\mathbf{x}$ and $\mathbf{x}_{new}$ into the $Z$ space, then do the dot product in the $Z$ space to obtain the classification prediction.  In typically cases, if we need to project $X$ to a high-order polynomial space, the dimensionality of the $Z$ space grows very quickly.  The computational cost of coordinates in $Z$ and dot products in $Z$ space is going to be very expensive.  However, the reason we listed the long Equations (14) and (16) is not because of the math itself, but rather to point out a very important observation: both the Equation (14) for $\alpha$ and the final classifier in Equation (16) only depends on $\mathbf{x}$ through the form of their dot products: $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j)$, $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new})$ and $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_s)$.  If somehow, we are able to efficiently compute dot products among all pairs of $\phi(\mathbf{x})$, we can limit our SVM algorithm to running time of $O(N^2)$ regardless of the dimensionality of $Z$.   If we choose transformation $\phi$ in such a special way that its dot product in the high-dimensional $Z$ space can be calculated by some function, which called a "kernel function", in the lower-dimensional $X$ space, we are in business, i.e.,
$$\begin{equation} K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j) \end{equation}.$$
Be aware that $\phi(\cdot)$ is a vector, not a scalar.  Once we have $K(\cdot, \cdot)$ defined, we can use the dot products to solve $\alpha$ and calculate $y_{new}$ for $\mathbf{x}_{new}$ according to Equation (18) below, without leaving the $X$ space.  If you insist on carrying out $X$ to $Z$ transform, and solve for $\alpha$, then do dot products in $Z$ space, you would end up with exactly the same results, but just wasting more computer memories to hold the higher-dimension coordinates and CPU cycles to calculate the dot products.
$$\begin{equation}
y = {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i K(\mathbf{x}_i, \mathbf{x}_s)\right).
\end{equation}$$
Kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ is the dot product $\phi(\mathbf{x}_i)$ and $\phi(\mathbf{x}_j)$ in the $Z$ space, so kernel function carries the meaning of the similarity between vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ and it should be defined as both symmetric and positive.  It is theoretically guaranteed that as only as $K$ is symmetrically and positively defined, there always exists a mapping $\phi: X \to Z$, where dot product in $Z$ is the same as what is computed from $K$.  We do not know how to prove this, just know some math genius has proven this, so we can pretty much define any distance metrics we are using, such as Pearson, Manhattan and Tanimoto, etc.  We will use the Gaussian kernel next to do a deep dive into how kernel tricks enable us to introduce non-linearity, while remaining computationally efficient.

RBF Kernel and One Kernel for All


A very popular kernel is Gaussian kernel or Radial Basis Function (RBF) kernel, where:
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(- \gamma {\Vert \mathbf{x} - \mathbf{x}^\prime \Vert }^2) \\
&= \exp(- \gamma \mathbf{x}^2)\exp(- \gamma \mathbf{x}^{\prime 2}) \exp(2 \gamma \mathbf{x}\mathbf{x}^\prime)
\end{aligned} \end{equation},$$
where $\gamma$ is a hyper-parameter determining how sensitive the similarity score is against distance of two points.  Let us actually work out the mapping function $\phi$, as it will help us understand this kernel inside out, although the math below are a bit messy, the idea is simply, so do not let the face of the equations intimidate you.

First, let us just focus on the simple 1-$d$ case for $\mathbf{x}$, as the math quickly gets really complex as dimensionality increases, but we will explain later why the conclusions obtained from 1-$d$ hold for higher dimensions.

It is not hard to guess $\phi$ should be the form of $\exp(- \gamma x^2) \cdot \psi(x)$, so that the dot product between $\phi(x)$ and $\phi(x^\prime)$ contains $\exp(-\gamma x^2)\exp(- \gamma x^{\prime 2})$ from the first factor.  $\psi(x)$ needs to be chosen in a way that the sum of $\psi(x) \cdot \psi(x^{\prime})$ across all $Z$ dimensions equals $\exp(2 \gamma xx^\prime)$. How?

Remember the Taylor expansion:
$$\begin{equation} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \ldots + \frac{x^n}{n!} + \ldots \end{equation}$$
We can expand:
$$\begin{equation} \exp(2 \gamma xx^\prime) = 1 \cdot 1 + \frac{(\sqrt{2 \gamma}x)(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x)}^2{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \frac{{(\sqrt{2 \gamma}x)}^3 {(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}\sqrt{3!}} \ldots \end{equation}$$
Now we should be able to write out $\phi$ as:
$$\begin{equation} \phi(x) = \exp(-\gamma x^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x)}^3}{\sqrt{3!}}, \ldots \right] \\ \phi(x^\prime) = \exp(-\gamma {x^\prime}^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$
Therefore, $\phi(x)\phi(x^\prime)$ using Equation (22) and (21) produces the same result as the kernel function!  The $Z$ space here is actually an infinite-dimensional space, in this infinite space, if you diligently carry out dot products between $x$ and $x^\prime$, you will end up with $K(x, x^\prime)$, which can be easily computed using $\exp\left(-\gamma {(x-x^\prime)}^2\right)$, so the $Z$-space was only a hypothetical space that no actually computations have to take place, we can accomplish everything in the $X$ space, thanks to the kernel function.

Without losing generality, let’s assume our $\mathbf{x}$ is within $[0, 1)$ (we can always do this through data normalization during pre-processing step), most of the coordinates in our infinite dimensions quickly approach zero as $x^n$ converges to zero.  So effectively we only have a few lower-order dimensions, for most of the higher-order dimension, all data points have coordinates near zero, therefore, they are not informative.  Since linear model can automatically do feature selection by weighting, those dimensions will have weight approximately zero.  So we are not really doing classification in an infinite dimension space, the effective VC dimension remain finite and SVM remains robust.

With the kernel function, to solve the SVM problem in Equation (14), the cost is $O(N^2)$.  Then to calculate $y$ for a data point to be predicted, we use Equation (16), which only depends on the number of support vectors, at most $O(N)$.  Therefore, the kernel enables SVM computation to be limited by $N$, not by the dimensionality.  The effective VC dimension is approximately proxied by the number of model parameters $\alpha$, which is the number of support vectors.  No matter how many features $X$ contains, SVM is computationally bounded.

Now let us look at two extreme cases of $\gamma$ to understand the power of RBF kernel.

If $\gamma \to 0$, the $Z$ space described by Equation (22) looks like:
$$\begin{equation} \exp(-\gamma x^2) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x, O(\gamma) \right] \end{equation}$$
We omit the higher dimensions, as data points are close to the origin due to small $\gamma$.  Well, this is simply the original linear space $X$.  Therefore, RBF kernel under very small $\gamma$ degenerates into a linear transformation, e.g., it is basically a linear SVM model in this extreme case.

For the other extreme when $\gamma$ is large.  As the coordinate in each higher dimension increases by a factor of $\sqrt{2 \gamma}x/\sqrt{n}$, the coordinates first increase, then eventually converge to zero.  The dimension where it peaks depends on the value of $\gamma$, the larger $\gamma$, the more higher-dimension $x$ terms are included.  That is we can control the polynomial order in the model by tuning $\gamma$.

By tunning $\gamma$ from very small values to larger values, our RBF kernel adjust the hypothsis set $\mathcal H$ to include linear models to higher-order polynomial models.  Therefore, we say the RBF kernel covers a broad spectrum of models, it is indeed one kernel for all!  For any given learning problem, we treat $\gamma$ as a regularization hyper-parameter and use CV to obtain its optimal value (Chapter 3), i.e., we tune our hypothesis space to identify the best model that contains enough complexity (polynomial model), but yet not too complex.  The RBF kernel is so powerful that it should be considered as the default kernel for your SVM classifier [3].

RBF Kernel for Higher-dimensional Input


Just to have a peace of mind, let us exam the case where the input $\mathbf{x}$ is 2-$d$ and make sure everything we discussed for the KBF kernel above still holds true.  Since the complexity of the math expressions grows exponentially, we will just do enough to convince you it is indeed the case, without carrying out all the details.

Following the similar spirit of decomposing the kernel function using Taylor expansion, our kernel for 2-$d$ input can be written as:
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(-\gamma ({(x_1-x_1^\prime)}^2 + {(x_2-x_2^\prime)}^2)) \\
&= \exp(-\gamma (x_1^2+x_2^2)) \cdot \exp(-\gamma (x_1^{\prime 2}+x_2^{\prime 2}))) \cdot \exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime)
\end{aligned} \end{equation}$$
We have learned that the trick to constructure $\phi$ is to decouple $\mathbf{x}$ and $\mathbf{x^\prime}$ in the kernel function above; the first two terms have already taken care of themselves, we need to rewrite the last two terms as a dot product.
$$\begin{equation}
\exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime) = \\
\left[ 1 + \frac{(\sqrt{2 \gamma}x_1)(\sqrt{2 \gamma}x_1^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_1)}^2{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right] \cdot
\left[ 1 + \frac{(\sqrt{2 \gamma}x_2)(\sqrt{2 \gamma}x_2^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_2)}^2{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right]
\\ \end{equation}$$
Now we are ready to figure out $\phi$ by cross multiple the two terms into the sum of multiples:
$$\begin{equation}
\phi(\mathbf{x}) = \exp(-\gamma (x_1^2+x_2^2)) \cdot
\left[
1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1,
\frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2,
\frac{{(\sqrt{2 \gamma}x_1)}^2}{\sqrt{2!}},
2 \gamma x_1x_2,
\frac{{(\sqrt{2 \gamma}x_2)}^2}{\sqrt{2!}},
o(x_1^2 + x_1x_2 + x_2^2) \right]
\\ \phi(\mathbf{x}^\prime) = \exp(-\gamma ({x_1^\prime}^2+{x_2^\prime}^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1^\prime, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2^\prime, \frac{{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}}, 2 \gamma x_1^\prime x_2^\prime, \frac{{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}}, o({x_1^\prime}^2 + x_1^\prime x_2^\prime + {x_2^\prime}^2) \right] \end{equation}$$
As $\gamma$ approaches zero, the key $Z$ dimensions are $\left\{ 1, \sqrt{2 \gamma}x_1, \sqrt{2 \gamma}x_2 \right\}$, which is the original linear space.  As $\gamma$ getting larger, higher order terms $x_1^2$, $x_2^2$, $x_1x_2$, and even higher-higher-order polynomial terms come into play, $\phi$ becomes a polynomial mapping.  Same conclusion as the 1-$d$ case.  Also notice in polynomial terms $x_1x_2$, where the features in the input $X$ space are mingled in most of the $Z$ dimensions.  If in the input space $X$, there is a feature that is noise, it will contaminate many other features in the $Z$-space.  The ability of a linear model is able to do interpretable feature weighting and feature selection, however, this now happens in the $Z$ space.  The weights cannot be interpreted in the original $X$ space.   Therefore, to do feature selection, we should do feature selection in $X$ space, before we use the kernel SVM.  Otherwise, one would rely on iteratively removing one feature at a time within $X$ and repeat the SVM process, get rid of the features, which do not degrade the performance.  This feature selection process can be a bit computationally expensive.

To summarize, we explain SVM with RBF kernel is not only a very robust classifier with built-in generalization capability, but also a very powerful classifier, as its hyper-parameter $\gamma$ allows us to continuously cover linear model, higher-order polynomial models and models in between.  This is a beautiful ML technique.

Classification with Probability


One limitation with SVM is it only predicts the label of $y$ without providing a probabilistic estimation.  We have seen in Chapter 5, where probability can be estimated using a linear model called logistic regression (LR) model, which minimizes a cross-entropy error plus regularization error.  The disadvantage of LR is the solution depends on all $N$ input data points, which SVM depends on only a few support vectors.  Therefore, SVM has much better generalization property.  Can we estimate class probability, while still taking advantage of the support vectors?

Yes.  The idea is to run SVM first, then take the $d_i = \mathbf{w}^\intercal \phi(\mathbf{x}_i) + b$, i.e., the distance of $\mathbf{x}_i$ to the decision hyper-plane in the $Z$ space as the input features $(d_i, y_i)$ to train a one-dimensional LR model:
$$\begin{equation} f(+1 \vert d) = \frac{e^{\alpha d + \beta}}{1+e^{\alpha d + \beta}}. \end{equation}$$
There are some nuances in exactly how to computer $d_i$ [4], e.g., using cross-validation to obtain $d_i$ in an unbiased manner, but the main idea is outlined above.  For points within and near the margin, their probability vary from $\sim 0$ to $\sim 1$; for most points further away from the decision boundary, their probability quickly goes to zero (for -1 class) or one (for +1 class).

It is also worth mentioning that although most literature states multi-class classification problems can be solved by SVM using either $N$ one-vs-all SVMs or $\frac{N(N-1)}{2}$ one-vs-one SVMs, the problem can also be directly solved by minimizing a hinge loss function in the original $X$ space (cannot use Kernel) [5].  This enables linear SVM to be used in Deep Learning as an alternative to multi-class Logistic Regression classifier.

Reference


  1. KDS, page 173-175
  2. LFD, eChapter 8, page 40-45.
  3. https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, Page 4.
  4. http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf
  5. http://cs231n.github.io/linear-classify (see multi-class support vector machine loss)

Thursday, May 25, 2017

Chapter 4. Linear Models

We have already seen linear models in previous chapters.  Here we are going to study linear models more systematically, as it is the foundation of understanding more complex models.  In many cases, non-linear models can be simplified into linear models and we will see how that works at the end of the chapter.

Linear Classification – Accuracy


Consider a classification problem, where inputs are $X$ and $y$. $X$ is a $N \times (d+1)$ matrix and $\mathbf{y}$ is an $N \times 1$ vector with elements $\in \left\{-1, +1\right\}$.  Notice we add element 1 to the first dimension of each input vector, so that the math later can be written more concisely.  Please see the end of Chapter 1 for the spell out of the $X$ and $\mathbf{y}$ as matrix forms (Chapter 1. Equation (1)-(4)).

A linear classifier uses model:
$$\begin{equation} \mathbf{y} = {\rm sign}(\mathbf{w}^\intercal X ) \end{equation},$$
where $\mathbf{w}$ is the $d + 1$ element weight vector: $[ w_0, w_1, w_2, \ldots, w_d ]^\intercal$.  We incorporated the extra first dimension to $X$, so that $w_0$ can be part of $\mathbf{w}$.

A natural choice of the cost function to maximize accuracy is:
$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation}. $$
This is not quite right, as we are now ML-educated, we should take into account the regularization error:
$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation} + \lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}}, $$
where $\tilde{\mathbf{w}} = [ w_1, w_2, \ldots, w_d ]^\intercal$.  We here use ridge regularization for simplicity, but it can certainly be other regularization forms you prefer.  Because the ${\rm sign}(\cdot)$ function has no analytical derivative, we can only solve this equation numerically and iteratively.

At a certain step $n$, our guess of the answer is $\mathbf{w}_n$.  Let us assume point $\mathbf{x}_1$ is classified correctly and $\mathbf{x}_2$ is misclassified.  We consider making the next guess $\mathbf{w}_{n+1}$ around a small neighborhood of $\mathbf{w}_n$. Let us assume the change in the prediction for $\mathbf{x}_1$ is small and ignore that part for now, as $\mathbf{w}_n$ is already good from its perspective.  We need to make $\mathbf{w}_{n+1}$ a better predictor for $\mathbf{x}_2$.  If $y_2$ is currently $+1$, it means $\mathbf{w}_n^\intercal \mathbf{x}_2$ is currently negative (as it is misclassified), so it makes sense to have $\mathbf{w}_{n+1}$ to incorporate a positive component, i.e., $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 > 0$.  Otherwise, if $y_2$ is -1, we would like $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 < 0$, so that we have a better chance of correctly predicting $\mathbf{x}_2$ in the next iteration.

Considering both cases, we can design:
$$\begin{equation} \mathbf{w}_{n+1} = \mathbf{w}_n + \eta y_2 \mathbf{x}_2, \end{equation}$$
so that $\mathbf{w}_{n+1}^\intercal \mathbf{x}_2$ has an additional term $\eta y_2 \mathbf{x}_2^\intercal \mathbf{x}_2$, which always improves $\mathbf{w}$ in a way that the outcome for $\mathbf{x}_2$ classification becomes more favorably.  $\eta$ is a positive parameter called learning rate.  $\eta$ should be large enough so that we can explore $\mathbf{w}$ space quickly, while it should also be small enough that we do not jump over and miss the minimum.  Empirically let us assume $\eta$ is 0.1 without going off topic [1].

So the learning algorithms goes like the following: it starts with $\mathbf{w}_0$, if there are points that are misclassified, it randomly picks one of those and use it to alter $\mathbf{w}$ according to Equation (4).  Then loops this process.  If data points are indeed linearly separable, it can be proven that our algorithm will eventually converge and all data points will be $100\%$ correctly classified, if there is no  regularization error term, so the minimization algorithm works.  In reality, the algorithm may not stop as not all data points can be all correctly classified, we can memorize the solution associated with the minimum cost identified so far, and use that solution as our final answer $\mathbf{w}^*$.

What would be a good starting guess for $\mathbf{w}_0$?  If data are reasonably separable, we expect data points to form two clusters, points are expected to be closer together within the same labels and further away from  points of opposite labels (Figure 4.1).  Therefore, a reasonable decision plane has its norm vector $\mathbf{w}$ pointing from the center of $-1$ cluster to the center of $+1$ cluster.  A good initial guess is:
$$\begin{equation} \mathbf{w}_0 = \frac{1}{N^{+1}} \sum \mathbf{x} \vert_{y = +1} - \frac{1}{N^{-1}} \sum \mathbf{x} \vert_{ y = -1}. \end{equation}$$
$\mathbf{w}_0$ can certainly be scaled to adjust the initial regularization cost.  The optimal hyper-parameter $\lambda$ will be determined through a cross-validation routine described in the previous chapter.
Figrue 4.1.  A reasonable choice of $\mathbf{w}_0$ is the one that points from $\mu_{-}$ to $\mu_{+}$.

Logistic Regression – Cross Entropy Lost Function


The truth is not always black and white in the classification problems.  Applicants with the same salary, age, and other feature values could behave differently, some turned out to be money makers for the bank and some to be money losers.  For those who are credit worthy, some we are extremely confidence they are money makers and the bank could offer more attractive terms and start with a generous credit limit, and others are marginal and should be either deprioritized or approved under somewhat more stringent terms including lower credit limit.  Therefore, it would be nice to model the probability of a data point being $+1$ or $-1$.  The probability prediction could be handy, e.g., for a cell phone company to retain customers.  Those predicted to churn with high probability should be prioritized first for retention department to hammer on.  Even if we do not care about probability, minimizing probability cost function could lead to better classification results, as it enourages classes to be pushed fruther part and results in a more robust classifier [2].

This probabilistic classification problem can be formulated as: the truth is $p(y \vert \mathbf{x})$.  We are given $N$ records sampled from this distribution $p(\mathbf{x}, y)$ and we would like to model $p(y = +1 \vert \mathbf{x})$ with a family of hypotheses $f(\mathbf{x})$.   The probability of the $-1$ class is $p(y = -1 \vert \mathbf{x}) = 1-f(\mathbf{x})$.  The optimal hypothesis $g$ is the one that gives the maximum likelihood based on the given data, subject to the constrain of regularization error.  We can use linear models, called Logistic Regression (LR) to solution this problem.

Our LR hypothesis space for modeling $p(+1 |\mathbf{x})$ is:
$$\begin{equation}
f(s) = {e^s}/(1+e^s),
\end{equation}$$
where
$$\begin{equation}
s = \mathbf{w}^\intercal \mathbf{x}.
\end{equation}$$
The logistic function is convenient, because the probability of obtaining $-1$ label is:
$$\begin{equation} \begin{aligned}
1-f(s) &= 1/(1+e^s) \\ & = f(-s)
\end{aligned} \end{equation}.$$
In another words, the probability of a data point with label $y$ is:
$$\begin{equation}
p(y_i \vert \mathbf{x}_i) = f(y_i \mathbf{w}^\intercal \mathbf{x}_i).
\end{equation}$$

That is we try to first calculate a linear score $s$, and hope the more positive the score $s$ is, the more likely the label $+1$ is; the more negative the score, the more likely the label $-1$ is.  We then arbitrary convert this linear score in range $(-\infty, + \infty)$ into range $(0, 1)$ using Equation (6), so that it has a chance to be interpreted as a probability score.

Let us use the Bayesian approach to find $\mathbf{w}^*$:
$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\max_{\mathbf{w}} \; p(\mathbf{w} \vert X, \mathbf{y}) \\
&= \arg\max_{\mathbf{w}} \; \prod_{i=1}^N \; \frac{p(y_i \vert \mathbf{x}_i, \mathbf{w})p(\mathbf{w})}{p(y_i|\mathbf{x}_i)p(\mathbf{x}_i)}. \end{aligned} \end{equation}$$
The denominator is independent of $\mathbf{w}$ and can be ignored.  We can then minimize the negative log likelihood of the numerator:
$$\begin{equation}
\mathbf{w}^* =  \arg\min_{\mathbf{w}} \; - \sum_{y_i=+1}\log(f(\mathbf{w}^\intercal \mathbf{x}_i)) - \sum_{y_i=-1}\log(1-f(\mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
Using Equation (9), it can be simplified into:
$$\begin{equation}
\mathbf{w}^* = \arg\min_{\mathbf{w}} \; - \sum_{i=1}^N \; \log(f(y_i \mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
The first terms is what is so-called cross-entropy loss function, which is an important error measurement when we would like to model a probability distribution.  This term is basically the sum of negative log correct probabilities of all data points, i.e., we want the probability of each data point belonging to its true class to be maximized.  The second term basically gives a regularization error term, containing a parameter $\lambda$, which we have already discussed in Chapter 3.  In addition, $\mathbf{w}$ should be replaced by $\tilde{\mathbf{w}}$ in the second term.

Equation (12) has no analytical solution, we have to solve it numerically.  Let us introduce the idea of gradient descend.  To minimize a general function $f(\mathbf{x})$, we start with an initial guess $\mathbf{x}_0.$  The gradient at $\mathbf{x}_0$ is $\nabla f(\mathbf{x}_0)$, pointing at the direct where $f(\mathbf{x})$ increases fastest.  Therefore, we should search in the opposite direction.  Our next guess therefore would be:
$$\begin{equation} \mathbf{x}_1 = \mathbf{x}_0 \; – \; \eta \nabla f(\mathbf{x}_0) \end{equation}.$$
Taylor expansion shows $f(\mathbf{x}_1) - f(\mathbf{x}_0)$ would be negative.  We can iterate this process to find $\mathbf{x}_2$, $\mathbf{x}_3$, …, until the process converges or we use up computational quotas.  Searching for smaller $f$ values along negative gradient line explains the name gradient descend (GD).  We have heard about many optimization algorithm, GD is popular in machine learning because a variant of GD called Stochastic Gradient Descent (SGD) enables us to scale up the algorithm to deal with large training data sets [3].  It has been proven by math genius that Equation (12) only has one minimum, which makes it very straightforward to solve and we will not get into it.

A question we may have is we simply coin a function $e^s/(1+e^s)$ and do the parameter fitting on $\mathbf{w}$, then we argue the value of this function has the meaning of the probability of $p(+1|\mathbf{x})$.  To us it is a huge leap of faith in associating a function with range $(0, 1)$ to a probability interpretation.  The range $(0, 1)$ is only a necessary condition for a function to carry a probability meaning, but not a sufficient condition.  At least this is something bothers me for quite a while, indeed some authors argue the $p$ values given by logistic regression should not be treated as probability, but I now believe they should be.  We could agree $f(\mathbf{w}^\intercal \mathbf{x})$ can be used for ranking data points, but why its value is an estimation of probability?  The answer lies in the first two terms in Equation (11) are cross entropy terms.

Imagine there are plenty of $n_i$ data points within the neighborhood of $\mathbf{x}_i$, where $n_{+}$ and $n_{-}$ are generated from the two underlying classes, respectively, i.e., the probability of being positive $p_{+}$ is $\frac{n_{+}}{n_{+}+n_{-}}$.  For these subset of data points, the first two terms in Equation (11) become:

$$\begin{equation} - n_{+} \log f(\mathbf{x}_i) - n_{-}  \log(1-f(\mathbf{x}_i)) \end{equation}, $$

It is straightforward to prove this cross entropy term reaches maximum, when $f = p_{+}$, therefore,  our optimal solution $f$ is molded to be as close to $p_{+}$ as possible.  The logistic model is one naive guess to mimic $p_{+}$, but if someone comes up with a better guess $f^\prime$, his model would be scored more favorably.  So by minimizing our error function in Equation (11), $f$, the best winning model, is approaching $p_{+}$, therefore, can be interpreted as a proxy for $p_{+}$.  Certainly $f$ might not be an accurate proxy, if the true shape of $p_{+}$ is far away from a logistic shape.

Linear Regression – Maximum Likelihood


Linear regression problem is to model an unknown true value function $y = f(\mathbf{x})$ using a linear model.  That is we aim to search within the hypothesis set containing hyper planes described by $y = \mathbf{w}^\intercal \mathbf{x}$ for a hypothesis $g$, so that $g(\mathbf{x}) \approx f(\mathbf{x})$.

The quality of $g$ can be measured by least square error (discussed in Chapter 2):
$$\begin{equation} \begin{aligned}
e(\mathbf{w}) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i – y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} - \mathbf{y})^\intercal (X\mathbf{w} - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned} \end{equation}$$
Again, the second term is a regularization term, although we use ridge regression form, you can feel free to use Lasso or other forms.  However, ridge regularization form gives a beautiful analytical solution by setting derivative to $\mathbf{w}$ to zero in Equation (15):
$$\begin{equation} (X^\intercal X + \lambda I)\mathbf{w} = X^\intercal \mathbf{y} \end{equation},$$
$$\begin{equation} \mathbf{w}^* = {(X^\intercal X + \lambda I)}^{-1}X^\intercal \mathbf{y} \end{equation}.$$
Notice:
$$\begin{equation} X^\intercal \mathbf{y} = \sum_i \; y_i \mathbf{x}_i \end{equation},$$
the final solution is a linear combination of all input vectors, i.e., all data points contribute.

In practice, we do not use Equation (17).  This is because the regularization term is actually $\lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}},$  we either have to use GD method, or need to centered input vectors in order for the analytical solution to work [4]. Let us resort to the form where $w_0$ is taken out of vector $\mathbf{w}$ and written as $b$:
$$\begin{equation} \begin{aligned}
e(\mathbf{w}, b) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i + b - y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} + b - \mathbf{y})^\intercal (X\mathbf{w} + b - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned}. \end{equation}$$
To minimize $e$, we set partial derivatives to zero for $\mathbf{w}$ and $b$, and obtain the following: $$\begin{equation} \begin{aligned} b &= \frac{1}{N} \sum(y_i - \mathbf{w}^\intercal \mathbf{x}_i) \\ &= \frac{1}{N} \sum y_i \end{aligned} \end{equation}$$ $$\begin{equation} \begin{aligned} (X^\intercal X + \lambda I)\mathbf{w} &= y - b. \end{aligned} \end{equation}$$ The value of $b$ in Equation (20) takes advantage of the assumption that $\mathbf{x}$ have been zero-centered. Equation (21) is basically what we saw before, with $y$ being replaced by $y - b$. Therefore, linear regression can still be solved analytically with the regularization term.
One interesting question is since we can find the optimal hypothesis $g$ by the above analytical form without evaluating all the other infinite number of linear hypotheses, does it mean our hypothesis set $\mathcal H$ only contains one hypothesis $g$ and we therefore do not have to worry about over-fitting?  Obviously too good to be true.

Imagine there are three algorithms: the first one painstakingly goes through a rather tedious grid search in the parameter space, explores a vast set of hypotheses in $\mathcal H$, and eventually finds $\mathbf{w}^*$.  The second one uses the GD method in Equation (13) and identified $\mathbf{w}^*$ after multiple iterations. The third one takes advantage of Equation (18) and lands on $\mathbf{w}^*$ in a single step.  They basically explore the same hypothesis space, except the third one is the smartest in avoiding the searches it knows will not come out favorably (or consider the third one has explored all hypotheses virtually).  Our hypothesis set still contains infinity number of linear hypotheses, the reason we are not testing them individually is because we happen to be able to foresee their performs are not going to be better than the analytical solution.  This "super vision" only comes after we see $\mathcal{D}$.  In another word, all linear hypotheses in $\mathcal H$ have a chance to be our analytical solution, the best hypothesis $g$ is not known beforehand, but only after $\mathcal D$ is given.  Therefore, what determines VC dimension here is not how we reach the final answer, but what is the effective total hypothesis candidate space.

For more complex models, where we do not have an analytical solution and rely on the GD algorithm explore the hypothesis space like the second algorithm. We are not just evaluating one hypothesis per iteration, we have effectively explore many hypotheses that are in the vicinity of the trial hypothesis under evaluation, it is just we have the knowledge to believe that those other hypotheses not along the negative gradient direction are not very likely to be have better performance compared to the next hypothesis  along the opposite gradient direction.  So even our solution is found after $n$ iteration steps, our hypothesis space is much larger than $n$.  In those cases, the effective hypothesis space explored by GD is indeed smaller than the total hypothesis space, as they are probably other independent local minimum hypothesis missed by GD.  In the linear regression example, the space that GD does not explore happens to contain no local minimum.  The bottom line is having an analytical solution does not eliminate the need for regularization.
Figure 4.2.  Effective hypothesis space explored by Gradient Descend algorithm is more than those individual red hypothesis points.

Feature Selection Capability of Linear Models


Often only a subset of features are informative, the rest are irrelevant to the learning problem.  Many algorithms cannot distinguish informative features from noisy ones and be confused by that.  For example, if only a small portion of features are informative and we use $k$-nearest neighbor algorithm and our similarity metrics is defined as Euclidean distance, where all features are weighted equally, then the distance would be dominate by noise and the performance would be poor.  The goal of feature selection is to identify and eliminate a subset of features without too much sacrifice of  $E_{in}$. Then by using fewer features, our model requires fewer parameters, therefore, smaller VC complexity, therefore smaller $E_{reg}$ eventually leads to overall smaller $E_{out}$.

A linear model is interesting in a way that it automatically does feature prioritization.  Consider the two-dimensional two-class example, the weight vector $\mathbf{w}^*$ approximately points from the center of the $-1$ cluster to the center of the $+1$ cluster (Figure 4.1).  In this example, $x_1$ is more informative than $x_2$, and we have $w_1 \gg w_2$, because:
$$\begin{equation} w_i = \frac{\partial{y}}{\partial{x_i}} \end{equation}.$$
So the linear plane points at the direction where gradient of $y$ tend to be the largest, and the weight vector, the norm of the hyper plane, basically describes how important each feature dimension is and weights them accordingly.

Non-linear Classification


In Figure 4.3, the boundary between the two classes are obviously non-linear.  One way is use a hypothesis set including non-linear boundary models.  Another way, which we prefer, is to transform this problem into a linear problem!

The general idea is we introduce a mapping function from origin $X$ space into a new space $Z$:
$$\begin{equation} X \to Z: Z = \phi(X).\end{equation}$$
The mapping is non-linear, e.g., Z is a second order polynomial space, $Z = \left[ 1, x_1, x_2, x_1^2, x_1x_2, x_2^2 \right]$.  We here project a three dimensional data point $[1, x_1, x_2]$ into six dimensions.  Hopefully, the non-linear transformation has absorb the non-linearity in the data, so that when we exam data points $[z_0, z_1, z_2, \ldots , z_m]$ in the $Z$ space (with dimensionality $m$), a linear model works well there:
$$\begin{equation} \begin{aligned} y &= {\rm sign} (\mathbf{w}^\intercal \mathbf{z}) \\ &= {\rm sign} (\mathbf{w}^\intercal \phi(\mathbf{x})). \end{aligned} \end{equation}$$
In $Z$ space, we use the linear solvers discussed above to find the optimal weights $\mathbf{w}^* = \left[ w_0, w_1, \ldots,  w_m \right]$.  Let us assume in our particular example, the optimal solution is $\mathbf{w}^* = (1, 0, 0, -1, 0, -1)$. The decision function in the $X$ space will be non-linear:
$$\begin{equation} \begin{aligned} y &= {\rm sign}(\mathbf{w}^{*\intercal}\mathbf{z}) \\ &={\rm sign}(1-z_3-z_5) \\ &= {\rm sign}(1-x_1^2 - x_2^2). \end{aligned} \end{equation}$$
We show how to obtain a good non-linear model, where the core computation of our learning algorithm takes place in a linear space $Z$.  We will discuss more on how to automatically include non-linearity using a Support Vector Machines classifier in the next chapter.  Please read corresponding LFD Chapters for more examples [5].

Figure 4.3. Relying on a non-linear mapping function, linear hyper-plane works well in the $Z$ space.  The decision plane corresponds to a non-linear decision boundary in the original $X$ space.



Regularization Again


A puzzling question regarding the regularization term is why models with larger $\mathbf{w}$ are considered more complex.  If our model is non-linear, or our linear model is in the transformed $Z$ space, some weight parameters represent higher-order $\mathbf{x}$ terms, i.e., more complex hypotheses, reducing those weights does reduce the model complexity, which is not hard to understand.  The challenge is if we are in the original linear $X$ space, why a hyperplane with larger $\mathbf{w}$ is considered more complex?

For a linear regression problem, if both $\mathbf{x}$ and $\mathbf{y}$ are normalized, so that they are centered around zero, we could appreciate lines with $\mathbf{w}$ closer to zero might be more likely, as they are rather straightforward hypotheses.  A better way to think about this is given two hypoethesis of similar $E_{in}$, i.e., their predictions areabout the same: $y\approx  \mathbf{w_1}^\intercal \mathbf{x} \approx \mathbf{w_2}^\intercal \mathbf{x}$, since the gradient $\nabla_{\mathbf{x}} y = \mathbf{w}$, the hypothesis with smaller $\mathbf{w}$ has smaller gradient against $\mathbf{x}$.  This means we can tolerate larger input noise $\sigma_{\mathbf{x}}$ and prediction remains essentially unchanged, such hypotheses should be more generalizable.  Also notice this argument naturally excludes $w_0$ from regularization, as it is not present in the gradient expression.

For a linear classification problem, one can scale $\mathbf{w}$ by a positive value and retain the same accuracy, as $y = \rm{sign}(\mathbf{w}^\intercal \mathbf{x})$ remains unchanged for all $c\cdot \mathbf{w}$.  On one hand, this means our hyphothesis space is redundant, therefore, reducing the redundancy by introducing regularization makes hopythesis search easier.  On the other hand, similar to the regression case, smaller $\mathbf{w}$ means the transition from $-1$ to $+1$ as we move from one class of $\mathbf{x}$ to the opposite class is smoother.  When the data points are linearly separable in the logistic regression case, the optimal solution without regularization is $\mathbf{w} \to \infty$, where probability function is a sharp step function, this basically is an overfit [6].  Constraining $\mathbf{w}$ to be small essentially excludes such step functions.  The small $\mathbf{w}$ means $y$ changes slowly, which corresponds to the maximum margin of the decision boundary, as we will in the next chapter.

Extra - Multiclass Classifer

What we described here are binary classifiers, in reality many problems contain multiple target labels, such as identifying the object for a given picture, where the object can be one of 1000 candidates.  Popular approaches for multiclass classifiers are described in another blog.

Reference


  1. LFD, page 94-95
  2. DLB, page 270
  3. LFD, page 97-98
  4. The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Second Edition. page 63-64.
  5. LFD, page 99-104
  6. Christopher M. Bishop. Pattern Recognition and Machine Learning. p.206.
    It is worth noting that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to $\sigma = 0.5$., equalivalent to $\mathbf{w}^\intercal \mathbf{x} = 0$, separates the two classes and the mangitude of $\mathbf{w}$ goes to infinity. $\cdots$ Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable.  The singularity can be avoided by inclusion of a prior and finding a MAP solution for $\mathbf{w}$, or equivalently by adding a regularization term to the error function.

Wednesday, May 24, 2017

Chapter 3. Regularization & Cross Validation

Bayesian Interpretation of Regularization


Although the main conclusion from Chapter 2 is simple, i.e., we need to minimize the sum of $E_{in}$ and $E_{reg}$, the derivation process, also known as statistical learning theory is somewhat complicated with many assumptions.  It relies the concepts of Hoeffding Inequality, Bonferroni correction, VC dimension, etc.  It also depends on ideal experiments, where virtual data sets $\mathcal D$ of size $N$ are sampled again and again in order to study the connection between $E_{in}$ and $E_{out}$; this is a frequentist's way of thinking.   In addition, we only show the learning theory using a classification problem and did not prove the same conclusion works for regression problems.  Let us reconstruct the regularization problem from a Bayesian point of view, where the whole VC argument can be avoided.  The Bayesian camp is proud that they do not have to struggle with over-fitting concept, however, we know there is no free lunch, they have to defend something instead – prior.  Bayesian approach is important as this is a more popular way of deriving the same conclusion as seen in many ML books; it also offers a much cleaner explanation of the regularization error and gives us confidence on when we need to use customized $E_{reg}$.

When a Bayesian person looks at our model, he has a prior belief on $\mathbf{w}$ without see any data points.  For most real-life problems, we do have some common sense, e.g., common sense tells us for many models, especially those with extremely large $\mathbf{w}$ components in high-order polynomial terms are not worth evaluating.  For a classification problem, large $\mathbf{w}$ can be scaled down without changing classification results.  For a regression problem,  large $\mathbf{w}$ is unlikely to be the optimal solution, as $y$ tends to be finite.

Now let us consider a general case, where our prior belief is $\mathbf{w}$ follows a prior distribution $p(\mathbf{w})$. The best $\mathbf{w}^*$ is the one that maximizes the conditional probability according to Bayesian theorem: $$\begin{equation} p(\mathbf{w} \vert \mathcal{D}) = \frac{p(\mathcal{D} \vert \mathbf{w})p(\mathbf{w})}{p(\mathcal{D})}, \end{equation}$$ The probability of observing our data set of $N$ data points: $$\begin{equation} \begin{aligned} p(\mathcal{D}) &= \prod_{i=1}^N \; p(\mathbf{x}_i, y_i), \\ &= \int \left[ \prod_{i=1}^N \;  p(\mathbf{x}_i, y_i \vert \mathbf{w}) \right] p(\mathbf{w})d\mathbf{w}, \\ &= \int \left[ \prod_{i=1}^N \; p(y_i \vert \mathbf{x}_i, \mathbf{w})p(\mathbf{x}_i) \right] p(\mathbf{w}) d\mathbf{w}, \end{aligned} \end{equation}$$ where $p(\mathcal{D})$ integrates over all possible values of model parameter $\mathbf{w}$ according to our prior belief. While frequentists think about the distribution of data set $\mathcal{D}$, Bayesians think about the distribution of $\mathbf{w}$. $p(\mathcal{D})$ is independent of $\mathbf{w}$, therefore, maximizing Equation (1) is equivalent to maximizing the following:
$$\begin{equation} \begin{aligned} p(\mathcal{D} \vert \mathbf{w})p(\mathbf{w}), \end{aligned} \end{equation}$$ and thus minimizing its negative log transformation:
$$\begin{equation} -\log (p(\mathcal{D} \vert \mathbf{w})) - \log(p(\mathbf{w})). \end{equation}$$
The first term is the term that describes how close our model explains training data points.  Since the VC theory was derived based on classification problem, we here use a regression problem to reproduce the conclusion of VC theory, where the noise follows a Gaussian distribution:
$$\begin{equation} p(\mathcal{D} \vert \mathbf{w}) = \prod_{i=1}^N \; \frac{1}{\sqrt{2 \pi} \sigma}\exp\left(-\frac{{(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2}{2 \sigma^2}\right)p(\mathbf{x}_i). \end{equation}$$
$$\begin{equation} -\log(p(\mathcal{D} \vert \mathbf{w})) \sim \sum_{i=1}^N \; {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 \end{equation},$$
which is the familiar least square error cost function.

If the data points we expect to see are within certain radius of the origin in a regression problem, there is no need to test separation planes outside that radius, right?  Let us translation this to a soft constrain, i.e., our belief is $\mathbf{w}$ should be reasonably close to the origin:
$$\begin{equation} p(\mathbf{w}) = \frac{1}{\sqrt{2 \pi} \sigma} \exp( - \frac{{\Vert \mathbf{w} \Vert}^2}{2 \sigma^2}) \end{equation}.$$

$\sigma$ is a parameter reflecting how strong our prior believe is.  Then the log prior term becomes:
$$\begin{equation} -\log p(\mathbf{w}) \sim \lambda {\Vert \mathbf{w} \Vert}^2 \end{equation}.$$
That is if our prior belief on $p(\mathbf{w})$ is a Gaussian prior, the regularization term is the familiar ridge regression term, $\lambda$ is a convenient parameter replaces $\sigma$.

If $p(\mathbf{w})$ is a Laplace prior:
$$\begin{equation} p(\mathbf{w}) = \frac{1}{2 \sigma} \exp( - \frac{{\Vert \mathbf{w} \Vert}^1}{\sigma}) \end{equation}.$$
The log term gives a Lasso regression form  $-\lambda \sum_{i} \; \vert w_i \vert$.

So the Bayesian camp reaches Rome through a different route.  This further strengthens our argument that regularization term is not a hack, there is strong reason to have it as part of our cost function, so that the cost function can lead to a better hypothesis with smaller $E_{out}$.  This is true either from Bayesian prior belief or from the previous VC theory based on multiple tests.  Please be aware that for linear models with zero-centered features, the $\mathbf{w}$ in regularization error should be replaced by $\tilde{\mathbf{w}}$, i.e., $w_0$ should be excluded in most cases, as discussed in Chapter 2 Equation (19).

Bias and Variance


Yet another interesting way to look at the difference between $E_{in}$ and $E_{out}$ is to look at the regression problem.  The closeness between our picked hypothesis $g^\mathcal{D}$ and the target function $f$ averaged over all possible virtual data sets $\mathcal{D}$:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {\Vert f(\mathbf{x})-g^\mathcal{D}(\mathbf{x})\Vert}^2 & = \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {\Vert (f(\mathbf{x})-\hat{g}(\mathbf{x})) + (\hat{g}(\mathbf{x})-g^\mathcal{D}(\mathbf{x}))\Vert}^2 \end{aligned} \end{equation},$$
where
$$\begin{equation} \hat{g}(\mathbf{x}) = \mathbb{E}_{\mathcal{D}}\; g^\mathcal{D}(\mathbf{x}). \end{equation}$$
$\hat{g}$ is a hypothetical hypothesis, obtained by averaging individual $g^\mathcal{D}$ over all imaginary possible input $\mathcal{D}$.  $\hat{g}$ does not have to be in the original hypothesis space (average of planes may not be a plane), but does represent an expected typical $g$ by getting rid of the randomness introduced by training set $\mathcal{D}$.

Equation (10) can be written as:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathbf{x}} \; {(f(\mathbf{x})-\hat{g}(\mathbf{x}))}^2 + \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {(\hat{g}(\mathbf{x})-g^{\mathcal{D}}(\mathbf{x}))}^2 \end{aligned} \end{equation}.$$
Notice a cross product term has been removed due to Equation (11).  In Equation (12), the first term is how close our data-set-agnostic hypothesis $\hat{g}$ is to the target function $f$, which reflects the capability of our hypothesis in mimicking $f$.  This is called a bias term, as it is a systematic error, i.e., for a given $\mathbf{x}$, $\hat{g}(\mathbf{x})$ is either consistently lower or higher than the target value.  The bias will not go away, even if we have infinite $N$.  The second term is the statistical fluctuation in using a particular $\mathcal{D}$ to pick $g.$  If we had encountered a different data set $\mathcal{D}^\prime$, $g^{\mathcal{D}^\prime}$ would have been somewhat different.  Hypothetically given enough $\mathcal{D}$, our picks will converge to a stable average model $\hat{g}.$  So the second term is a variance term, reflects the uncertainty of our model due to the randomness of seeing a particular training data.  If $N$ of individual training set is sufficiently large, the variance can be expected to be small, as $g$ is a fit on $N$ points.

When $\mathcal{H}$ containing only simple hypotheses, variance is very small, the bias term dominate, as our simple hypothesis $\hat{g}$ is far away from $f$, we underfit.  On the other hand, when $\mathcal{H}$ only contains complex hypotheses, although such the mean of hypotheses $\hat{g}$ is probably closer to $f$, therefore, bias is small. Due to limited $N$, each time given a new data set, our learning algorithm picks a different $g$ that are far away from $\hat{g}$ and from each other, i.e., variance is large.  If $N$ is small, variance error typically dominates and it is better off to use a simpler model, as the error originated from bias is often smaller than error originated from variance.  So this analysis gives the same conclusion as the VC theory.  Figure 3.1 summarizes what we discuss here [1].
Figure 3.1. For a simple model, bias is large, variance is small.  For a complex model, bias is small, but variance is large, for most given $\mathcal{D}$, we unlikely will land on somewhere close to $\hat{g}$.


Learning Theory Review


We have covered a lot of ground in learning theory.  If you did not follow all the math, do not worry, as those equations are not used to solve real machine learning problems.  What will be used is the ML guidelines derived from these math.

First, the VC theory tells us due to the limited sample size $N$ and the large number of hypotheses in our search space $\mathcal{H}$, the real performance $E_{out}$ is expected to be $E_{in} + E_{reg}$, where $E_{reg}$ can be reduced by having larger $N$ and using smaller hypothesis space.  The guideline is we cannot simply minimize $E_{in}$, which would lead to over-fitting.  By tuning the size of our hypothesis space, we could reach optimal learning.

Second, the Bayesian framework tells us our prior knowledge on model parameters is important.  The particular training data set we see could be noisy, so this data set should only be used to refine our prior knowledge, not to override our prior knowledge.  For example, we believe the fundamental dynamics laws should be simple, so when we receive a set of experimental data recording the speed $s$ and time $t$ of a moving object, we should stick to simpler hypotheses where the $s \sim w_0+w_1t$ or $s \sim w_0+w_1t + w_2t^2$, instead of some ${20}^{th}$ order polynomial hypothesis, which will over-interpret the noise in our measurement.  So ML needs to weight both the information carried by our data set $E_{in}$ and our prior knowledge $E_{reg}$.  How heavy our prior knowledge should be weighted?  Too much, we are too stubborn to learn from data, too little, we lose basic principle and are easily misled.  The optimal balance is probably something called "experience".  Personally, I like the Bayesian explanation, as it is simple and intuitive and it applies to both classification problems (see logistic regression in the next Chapter) and regression problems.

Third, the effect of model complexity can also be explained by bias and variance model.  Models too complex minimize bias, however, it is misled by the noise in the particular data set we see, so the variance is huge.  Models too simple is robust against noise in the data set, but unable to depict the truth in enough details.  Unlike the other two, this explaination does not offer a mathematically workable scoring function, as we only have one data set.  A technique called bagging could be used to generate multiple virtual data sets from $\mathcal{D}$, therefore generate an approximate form of $g^{\mathcal{D}}$ to reduce variance.  We will not get into bagging in this note.

To implement ML in practice, we have to turn the guidelines into an empirical form of cost function.  Let us use the Bayesian approach to derive a popular form:
$$\begin{equation} g = \arg\min_{h \in \mathcal{H}} \; E_{in} + \lambda \cdot L(\mathbf{w}) \end{equation}, $$
where $L(\mathbf{w})$ can be thought of as our prior belief without see the data.  $\lambda$ tunes how strong our belief is.  $E_{in}$ is the evidence observed from the particular training data set.  The second term is $E_{reg}$.  The best hypothesis could be found by minimizing the sum of these two terms.  One may wonder $E_{reg}$ is a function of $N$ according to Equation (16) in Chapter 2, however, here $\lambda$ originated from $\sigma$ is independent of $N$.  This is because $E_{in}$ here is the sum over all $N$ data points, therefore is proportional to $N$, thus $E_{in}$ increases as $N$ increases.  The regularization term therefore becomes relatively less important as $E_{in}$ increases.  In another word, if we normal $E_{in}$ so that it represents the average error per data point, the log-prior term would need to be divided by $N$.  If $E_{in}$ is written in an normalized form, the optimal $\lambda$ would actually decreases as $N$ increases.


Unbiased $E_{out}$ Estimation with Leave-One-Out Cross Validation


We currently understand the necessity of including the regularization error term in the learning algorithm.  Our regularization error contains a parameter $\lambda$.  Notice here $\lambda$ is not a model parameter, our model does not contain $\lambda$!  $\lambda$ here is call a hyper-parameter, its role is to guide our learning algorithm to find a better $g$, but its not a parameter of $g$, but a parameter of the learning algorithm.  A question carried over from the previous Chapter is we cannot derive $\lambda$ from first principle model analysis then how to choose the right $\lambda$?

The ultimate goal for $\lambda$ is to produce a model $g$ that gives the best $E_{out}$.  Therefore, if somehow we can get our hands on a test data set $\mathcal{D}^\prime$, we can directly estimate $E_{out}$, then we can pick the appropriate $\lambda$ that produces the $g$ corresponding to the smallest $E_{out}$.  Given only a data set of $N$ training data, the only option is to carve out $K$ records for this test purpose and call them $\mathcal{D}^\prime$ and use the remaining $N-K$ records as our new training set $\mathcal{D}$.  Thus for a given $\lambda$, our $E_{out}$ estimation routine goes like (using a regression problem as an example):

$\text{split $N$ records into $(N-K)$ training records $\mathbf{x}_{train}$ and $K$ validation records $\mathbf{x}_{validate}$} \\$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{\mathbf{x}_{train}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda {\Vert \mathbf{w} \Vert}^2 \\$
$E_{out} \vert_{\lambda} =\sum_{\mathbf{x}_{validate}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2$

Basically we fit the $N-K$ training records to find the optimal $\mathbf{w}$, and evaluate its performance $E_{out}$ using the remaining $K$ records withheld for validation.  Although we use  ridge regression in the above procedure, it is just for convenience, the idea applies to any model and any regularization form.
The catch is previously our final hypothesis for a given $\lambda$ was trained using all $N$ available records: $g^N$.  However, in order to estimate $E_{out}(g^N)$, we reserve $K$ records as our validation set, we now have $K$ less records for training.  So the model we have performance estimation on is actually $g^{N-K}$, we hope:

$$\begin{equation} E_{out}(g^N) \approx E_{out}(g^{N-K}) \approx E_{out}^{validate}(g^{N-K}) \end{equation}.$$
That is to estimate $E_{out}$ for our final hypothesis $g^N$, we first would need the hypothesis obtained from training set $g^{N-K}$ to be close to it, which means $K$ should be small, i.e., the first part of Equation (14).  Then we need to estimate $E_{out}(g^{N-K})$ using $K$ validation records, so $K$ has to be large to obtain an accurate estimation (Hoeffding Inequality), i.e., the second part of Equation (14).  These are two contradictory requirements.  If $K$ is small, the hypothesis $g^{N-K}$ is closer to our final hypothesis, but its $E_{out}$ is not reliable.  If $K$ is large, $E_{out}$ is reliable, but it is for a very different hypothesis based on much fewer training data $N-K$.

In the extreme case, consider we only have one record for training, our resultant hypothesis will be very noisy.  Its performance, which can be measured by $N-1$ validation records pretty accurately, but it states our hypothesis is a poor one with high confidence.  On the other extreme, if we keep $N-1$ records for training, we get a decent hypothesis, but then the $E_{out}$ based on one validation record is highly unreliable, which does not reflect true $E_{out}$ of our hypothesis, in this case, we have a hypothesis with unknown performance.  There is actually a clever way out of this dilemma.

We are actually going to adopt one of the extreme scenarios described above, keep $N-1$ for training and 1 for validation.  How do we deal with the small validation set?  There are actually $N$ ways of splitting the records into training and validation sets, therefore, we are going to loop through all $N$ splits, have the model trained on $N-1$ data points, resulting $N$ hypothesis, i.e., $g_i^-$ is the final hypothesis with the $i^{th}$ point excluded.  Then measure the performance of $g_i^-$ on the $i^{th}$ record.  Since the $N$ hypotheses obtained share the same algorithm and same hyper-parameters, just obtained on different $N-1$ data points, therefore, we can assume their performance is about the same and very close to the performance of the final hypothesis that will be trained on the full $N$ records.  Now we can average over the $N$ $E_{out}$ scores and obtain an estimation of our hypothesis performance.  For each, $g_i^-$, it has never seen the $i^{th}$ record, therefore the $E_{out}$ based on the $i^{th}$ record is an unbiased estimation (neither systematically higher nor lower).  Given the $N$ independent measurements of the $g^-$ performance, their average is a decent one.  The more formal proof of the Leave-One-Out Cross Validation (LOOCV) is outlined in LFD [2].  Basically, the expected error of the hypothesis $g_i^-$, over multiple virtual data sets $\mathcal{D}^\prime$ of size $N$ is:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathcal{D}^\prime} \; \mathbb{E}_{i \in \mathcal{D}^\prime} \;{\rm error} (g_i^-(\mathbf{x}_i), y_i) &= \mathbb{E}_{\mathcal{D}^\prime} \; {\rm error}(g_\mathcal{D^\prime}^-), \\ &= E_{out} (N-1). \end{aligned} \end{equation}$$

Imagine we repeatedly measure the performance of $g_i^-$ over records that the model has never seen, even though one at a time, we are effectively testing the hypothesis in a true test data set.  Therefore, our estimation of hypothesis performance is unbiased.  The LOOCV workflow for $E_{out}$ estimation works as the following:

$\text{for each record $j$ in $N$ records} \\$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {(y_k-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda {\Vert \mathbf{w} \Vert}^2 \\$
$\quad E_{out}(j) \vert_{ \lambda} =  {(y_j-f(\mathbf{x}_i, \mathbf{w}))}^2 \\$
$E_{out}\vert_{ \lambda} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{\lambda} \\$

Once we know how to estimate $E_{out}$, we can then find the optimal $\lambda$ value by putting a loop over candidates of $\lambda$, the complete $\lambda$ selection routines are outlined below.

For the K-validation approach:

$\text{for each $\lambda_i$ in candiates } \{\lambda_1, \lambda_2, \ldots, \lambda_z\} \\$
$\quad \text{split $N$ records into $(N-K)$ training records $\mathbf{x}_{train}$ and $K$ validation records $\mathbf{x}_{validate}$} \\$
$\quad \mathbf{w}_i^* = \arg\min_{\mathbf{w}} \sum_{\mathbf{x}_{train}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda_i {\Vert \mathbf{w} \Vert}^2 \\$
$\quad E_{out} \vert_{\lambda_i} =\sum_{\mathbf{x}_{validate}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 \\$
$\lambda^* = \arg\min_{\lambda \in \{\lambda_i\}} E_{out}\vert_{\lambda_i}$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{i=1}^N {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda^* {\Vert \mathbf{w} \Vert}^2.$
Final model: $y = f(\mathbf{x}, \mathbf{w}^{*})$.

For the LOOCV approach:

$\text{for each $\lambda_i$ in candiates } \{\lambda_1, \lambda_2, \ldots, \lambda_z\} \\$
$\quad \text{for each record $j$ in $N$ records} \\$
$\qquad \mathbf{w}_i^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {(y_k-f(\mathbf{x}_k, \mathbf{w}))}^2 + \lambda_i {\Vert \mathbf{w} \Vert}^2 \\$
$\qquad E_{out}(j) \vert_{ \lambda_i} =  {(y_j-f(\mathbf{x}_j, \mathbf{w}_i))}^2 \\$
$\quad E_{out}\vert_{ \lambda_i} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{\lambda_i} \\$
$\quad \lambda^* = \arg\min_{\lambda \in \{\lambda_i\}} E_{out}\vert_{\lambda_i}$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{i=1}^N {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda^* {\Vert \mathbf{w} \Vert}^2.$
Final model: $y = f(\mathbf{x}, \mathbf{w}^{*})$.

The last two lines of both workflow state once we obtain $\lambda^*$, we can do a final training using all $N$ records and hopefully be guided to our best estimation of $\mathbf{w}^*$.  The reason we retrain with all $N$ record, instead of using the model associated with the $\lambda^*$ obtained from $N-K$ records within the loop is because of learning curve theory.  The model/procedure that is trained based on larger data set conceptually has a better $E_{out}$, so the model trained using all $N$ records is expected to give a $g$ that performs better.
$$\begin{equation} E_{out}^N \lt E_{out}^{N-K} \end{equation}.$$
In the LOOCV procedure, we actually do not evaluate the performance of a particular hypothesis, but rather evaluate the average performance of $N$ different hypotheses, where each hypothesis only has one data point.  Here the assumption is these $N$ hypotheses are obtained using the same learning process, just with slightly different input data set, therefore the average reflect the performance of the learning process, or the performance of the average hypothesis can be obtained using this learning process.  You can either consider we are sort of evaluating $\mathbb{E}_{\mathcal{D}}\; g^\mathcal{D}$ or better simply our learning process itself.

LOOCV allows us to use as many as $N-1$ records for model training, while still obtain an estimation of the performance of our final hypothesis, the trade-off is we need to computer the model $N$ times instead of once.  When $N$ is a large number, instead of doing leave-one-out, we could split the data set into $k$-fold, and loop through them using one fold for validation one at a time, in much the same way we are doing LOO.  This is called $k$-fold cross validation, which estimates the performance of hypothesis $g^{N\frac{k-1}{k}}$, a slight under-estimation of our final model $g^{N}$, but we reduce the computational cost to a fraction of $\frac{k}{N}$.

Just like Hoeffding Inequality, we have a multiple-test problem here in our $\lambda$ selection workflow.  For each given $\lambda$ candidate, the training-validation provides an unbiased estimation of its $E_{out}$, however, if we test $z$ $\lambda$ values and pick the one gives the least $E_{out}^*$, we are more likely to obtain a low $E_{out}^*$ by chance.  Recall the coin tossing example in Chapter 2, the best $E_{out}^*$ achieved in this procedure, $\min(E_{out})$, is a optimistically "biased" estimation of its true $E_{out}$, where bias means it is systematically smaller than its true value.  Worst case scenario $\max(E_{out})$ probably is a pessimistically biased towards the other end. The true performance $E_{out}$ should be (using the $K$-validation set as an example):
$$\begin{equation} E_{out}^{*N-K} = \min_{\lambda_i}(E_{out}^{\lambda_i,N-K}) \le E_{out}^{N-K} \le \max_{\lambda_i}(E_{out}^{\lambda_i,N-K}). \end{equation}$$

Combining Equation (16) and (17), the exact relationship between our final hypothesis $E_{out}^N$ and our estimation $E_{out}^{*N-K}$ is uncertain.  In general, we probably expect the multiple test effect is more significant and $E_{out}^{*N-K}$ probably is an over-estimation of our final hypothesis performance.  The true performance will have to wait for test sets in future real applications, or estimated using a nested-cross-validation procedure discussed in the next section.

The idea of using a validation set to estimate $E_{out}$ and then provides feedback to the selection of hyper-parameter $\lambda$ is a very powerful one.  We can also generalize this idea to select other model hyper-parameters or even models themselves.  The models here refers to a linear model, polynomial models, SVM, random forest, etc.  Each model has its own hyper-parameters (including $\lambda$), e.g., polynomial model can have its highest order as a hyperparameter, SVM with Gaussian kernel can have $\gamma$ as a hyper-parameter (Chapter 5).  We can consider the most general case, where each unique model-hyper-parameter combination is a new $\lambda$ and go through the above $\lambda$ selection process.  The outcome of that is the selection of a specific model-hyper-parameter combination that gives $\min(E_{out}(\lambda))$.  We will see how this works in practical cases next.


Hyper-parameter Selection and Model Selection


The cross-validation analysis above is extremely important for us to assess different learning designs.  We can come up with creative architectures and be able to evaluate whether they are legit.  Let us look at some learning architectures we probably have debated in our past ML practice and see how they work under the light of the cross valiation theory we just learned.


Say we participate in a ML competition for one-million dollar award and we have every motivation to make sure our final hypothesis perform well in the secret set of test records locked in a safe by our campaign host.  So we are given $N$ records and there is no way for us to get any more records.

Learning Architecture A, which is the architecture we use to produce our submission hypothesis.
$\text{for each (Model, hyper-parameter) in  } \{(M_1, p_1^1), (M_1, p_2^1), \ldots, (M_n, p_1^n), (M_n, p_2^n), \ldots, (M_n, p_m^n) \} \\$ $\quad \text{pick the } (M^*, p^*) \text{ gives $\min(E_{out}(N-1))$ under LOOCV (in gray)}$ $\color{gray}{ \qquad \text{for each record $j$ in $N$ records} \\ \quad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \quad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \qquad E_{out}\vert_{ M,p} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$ $\text{Final Hypothesis: retrain } (M^*, p^*) \text{ on $N$ records and get } M^*(\mathbf{w}^*) \vert_{p^*}.$
We basically use LOOCV for model and hyper-parameter selection, then train it on all $N$ records to obtain the parameters (e.g., $\mathbf{w}$) for the best possible $E_{out}$ performance.

Due to the multiple-test problem, the true performance of our final hypothesis $E_{out}(N)$ is most likely worse than $E_{out}(N-1)$ archived by $(M^*, p^*)$ within the selection loop, because we intentionally pick the model/hyper-parameter that gives the best $E_{out}$, which is biased.  The running time is determined by the total number of model-hyper-parameter combinations $O(M \times P \times N)$.  The true performance of our final hypothesis can only be obtained by the host opening up his safe and test our hypothesis on the secret test set.  Assume we do not want to wait for the notification email from our host, let us see how we can estimate $E_{out}$ ourselves using Architecture B:


$\text{save $K$ records as test set} \\$
$\quad \text{run Architecture A on the remaining $(N-K)$ records in gray } \\$
$\color{gray}{ \qquad \text {loop through all $(M, P)$ candidates} \\ \quad \qquad \text{for each record $j$ in $N-K$ records} \\ \qquad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \qquad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \quad \qquad E_{out}\vert_{ M,p} = \frac{1}{N-K} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$
$\quad \text{obtain best model $M^*(\mathbf{w}^*) \vert_{p^*}$ by training on $N-K$ records}\\$
$\text{calculate $E_{out}(M^*(\mathbf{w}^*) \vert_{p^*})$ using $K$ test records } \\$

Here, we basically carve out $K$ records and pretend we have a test set, so that we know $E_{out}$. However, the model we are evaluating was trained based on $(N-K)$ records, $g^{N-K}$, is conceptually less accurate compared to the hypothesis in Architecture A $g^N$.  So we should still stick to the Architecture A's output as our submission hypothesis.  The $E_{out}$ here is an unbiased estimation of hypothesis $g^{N-K}$, which is a conservative estimation of $g^N$ according to Learning Curve theory.  There we submission hypothesis is conceptually better than our $E_{out}$ estimation.  Again Architecture B is just for $E_{out}$ estimation, it should not be used for final model submission.  The computation time for Architectu B is similar to A, i.e., $O(M \times P \times (N-K))$.  The disadvantage of Architecture B is what we talked about before, $K$ needs to be large in order to evaluate $E_{out}$, but large $K$ eats away our training data set.  Okay, we know the solution to this problem, that is LOOCV, which leads us to Architecture C.

$\text{for each $i$ in $N$ records, keep $i$ as test set } \\$
$\quad \text{run Architecture A on the remaining $(N-1)$ records in gray } \\$
$\color{gray}{ \qquad \text {loop through all $(M, P)$ candidates} \\ \quad \qquad \text{for each record $j$ in $N-1$ records} \\ \qquad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \qquad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \quad \qquad E_{out}\vert_{ M,p} = \frac{1}{N-1} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$
$\quad \text{obtain best model $M_i^*(\mathbf{w}^*) \vert_{p^*}$ }\\$
$\quad \text{calculate $E_{out}^i(M_i^*(\mathbf{w}^*) \vert_{p^*})$ using record $i$ } \\$
$E_{out}(N-1) = \frac{1}{N}\sum_i \; E_{out}^i(M_i^*(\mathbf{w}^*) \vert_{p^*}).$

Here, our $E_{out}$ estimation is unbiased and probably is very the close to final model $g^N$.  The computational cost is huge though $O(M \times P \times N^2)$.  We have $N$ different picked hypotheses within the loop running Architecture A (line 2-4), which one to choose as our final answer?  The one with the best estimated $E_{out}$?  No, because that could well be a hypothesis happens to archive the good performance due to multiple-test luck (multiple test on model+hyper-parameter combinations).  Actually, we do not pick any candidates from Architecture C, but only use it to estimate hypothesis performance!  Our final submission hypothesis is still the one obtained from Architecture A.

Architecture C is known as the nested cross validation scheme, as it runs CV twice, the outer CV is for an unbiased $E_{out}$ estimation of the whole learning process as described in Architecture A (i.e., the learning process includes the model and hyper-parameter selection logic), the inner CV is for model and hyper-parameter selection.  This is the most computationally expensive design, but gives the best $E_{out}$ estimation.  The multiple model+hyper-parameters selected within each fold of Architecture C are useful in understanding the stability of a model+hyper-parameter being the best choice.  If we see the model+hyper-parameter identified by Architecture A is consistently ranked lower compared to other choices in Architecture C, we know there probably is an over-fitting in the model selection process (i.e., our pool of candidate model+hyper-parameters pool is too large).  Notice since we evaluate $E_{out}$ for only one hypothesis in the most outer CV loop of both Architecture B and C, there is no multiple-test problem we need to worry about and the $E_{out}$ obtained is unbiased.  Also bear in mind that we are evaluating the whole learning process described as Architecture A, not any of those hypotheses appears inside Architecture A.  When $N$ is too large, we can use $k$-fold validation too replace the LOOCV loop to reduce computational cost.

To summarize, we construct our submission hypothesis using Architecture A.  Its $E_{out}$ should be estimated using Architecture C, however, the computational cost there is $O(N^2)$.  Our theoretical knowledge on cross validation and learning curve enable us to analyze these three different architectures and understand their advantages and disadvantages. This is a key skill probably distinguishes data scientists from casual tool users.

There are a few links that are useful to further understand model selection, we collect them here [3][4][5].

Reference

  1. LFD, page 64.
  2. LFD, page 147.

  3. While the nested CV (split data into train, validate, test) is used to calculate performance of the model found above. The normal CV is used to find the best model (hyper-parameter determination)
  4. We don not choose the final model from nested CV.