Tuesday, May 23, 2017

Chapter 2. Learning Theory

Evaluating a Single Hypothesis - Hoeffding Inequality


Before we start, let us define two concepts: hypothesis and model.  A hypothesis is a specific function with no unknown parameters.  In the previous credit card approval example, approval based on income using $y = {\rm sign(salary \gt 30000)}$ is a specific hypothesis, where we approve applicants whose annual salary is above 30k.  $y = {\rm sign(salary \gt 50000)}$ is yet another hypothesis.  A hypothesis is a fixed form before even seeing any training data.  These two hypotheses are instances of the model $y = {\rm sign(salary} \gt w)$, where $w$ is a model parameter.  The model can be a nebula concept, e.g., we can say a hypothesis set can be a union of two models including both $y = {\rm sign(salary} \gt w_1)$ and $y = {\rm sign(saving} \gt w_2)$, or we say there is only one complex model $y = \delta \cdot {\rm sign(salary} \gt w_1) + (1-\delta) \cdot {\rm sign(saving} \gt w_2), \delta = \{0, 1\}$, with parameters $\delta$, $w_1$ and $w_2$.   The learning process, as described in Chapter 1, is to select the best hypothesis, i.e., determine the best values of model parameters, after seeing training data set $\mathcal{D}$.

Let us consider a classification problem $y \in \{-1, +1\}$.  If our hypothesis set $\mathcal{H}$ only contains one hypothesis $h$, i.e., we already have the model and its parameters in mind based on our prior knowledge before seeing the training data set $\mathcal{D}$.  In this case, the knowledge $h$ is given by the programmer, so no learning happens.  When given $N$ training data, we can measure the error of the model within the given samples as $E_{in}$ ("in" stands for within-sample error).  From statistics, we know that the true accuracy of the model, $E_{out}$, (the error measured by applying the exact same model on future never-seen out-of-sample data) follows a binomial distribution.  The true $E_{out}$ will be measured by a large number of records to be encountered in the future, in practice, it is always estimated based on a finite set.  For convenience, let us assume we will use another $N$ new records to estimate $E_{out}$, when the need occurs to in later context.

With reasonable size $N$, true $E_{out}$ follows a Gaussian distribution centered on $E_{in}$, and a standard deviation $\sigma$ based on binomial distribution, where
$$\begin{equation} \sigma = \sqrt{E_{in}(1-E_{in})/N}. \end{equation}$$
The probability distribution of $E_{out}$ is:
$$\begin{equation} p(E_{out}) = \frac{1}{\sqrt{2 \pi}\sigma} \exp\left(-\frac{{(E_{out}-E_{in})}^2}{2 \sigma^2}\right). \end{equation}$$
As $N$ grows, $\sigma$ shrinks, our estimation of $E_{out}$ is closer and closer to $E_{in}$.  We are all familiar with this in election poll studies, where $E_{in}$ is the candidate support rate within $N$ people who participated in our pool, $E_{out}$ is the true support rate in the population we try to predict.   When $N$ is large and thus $\sigma$ is small, $E_{in} \approx E_{out}$.

For the convenience of later discussion, we nevertheless need to go one step further to derive a mathematical form describing how close the two terms are.  Let us calculate the probability of undesirable case ("undesirable" here refers to true $E_{out}$ is far worse than in-sample estimation $E_{in}$), i.e., $E_{out} - E_{in} \ge \epsilon$, $\epsilon$ is a positive tolerance value.  According to Equation (2), we have:
$$\begin{equation} \begin{aligned} p(E_{out} - E_{in} \ge \epsilon) &= \int_{x = \epsilon}^{+\infty} \frac{1}{\sqrt{2 \pi}\sigma} \exp\left(-\frac{x^2}{2 \sigma^2}\right) dx, \\ &= \frac{1}{\sqrt{2 \pi}\sigma} {\left[ \int_{y = \epsilon}^{+\infty} \int_{x = \epsilon}^{+\infty} \exp\left(-\frac{x^2+y^2}{2 \sigma^2}\right) dx\; dy \right]}^{1/2}, \\ &\le \frac{1}{\sqrt{2 \pi}\sigma} {\left[ \frac{\pi}{2}\int_{r = \epsilon}^{+\infty} \exp\left(-\frac{r^2}{2 \sigma^2}\right) r\;dr \right]}^{1/2}, \\ &= \frac{1}{4}\exp\left(-\frac{\epsilon^2}{2 \sigma^2}\right), \\ &\le \frac{1}{4}\exp(-2\epsilon^2N). \end{aligned} \end{equation}$$
The last step came from Equation (1), where the maximum value of $\sigma$ is archived at $E_{in} = \frac{1}{2}$.  This is the Hoeffding Inequality:
$$\begin{equation} p(E_{out} - E_{in} \ge \epsilon) \le \frac{1}{4}\exp(-2\epsilon^2N), \end{equation}$$ which states the performance of a model in $E_{out}$ and $E_{in}$ are exponentially close as $N$ increases.  Therefore, if a given hypothesis works well in training data set $\mathcal{D}$ given sufficient $N$, it is likely to do well in test set as well, and a poor-performing hypothesis in $\mathcal{D}$ is likely to perform poorly in test set.  All make sense.  Just remember this conclusion, if it is challenging to follow the math in this section.  Equation (4) here applies to a single hypothesis, does it apply to a model or a hypothesis set?

Hypothesis Set and Learning Algorithm


In reality, the hypothesis set always contains multiple hypotheses.  If we have only one hypothesis, nothing to learn.  Let us consider a hypothesis set of $M$ independent hypotheses $\{h_1, h_2, \ldots, h_M\}$ and pretend they all have about similar performance $E_{out}$.  Given $N$ training samples, $E_{in}$  and $E_{out}$ for each individual hypothesis follows the Hoeffding Inequality, i.e.:
$$\begin{equation} p(E_{out}(h_i) - E_{in}(h_i) \ge \epsilon) \le \frac{1}{4}\exp(-2\epsilon^2N), \end{equation}$$
Since we test the training set through $M$ hypotheses to pick a hypothesis $g$ that has minimum $E_{in}$ as our final answer:
$$\begin{equation}
E_{in}(g) = \min_{h \in \mathcal{H}}(E_{in}(h)),
\end{equation}$$
the probability for obtaining an impressive $E_{in}(g)$ that is $\epsilon$ away (smaller) from its true $E_{out}$ increases by $M$ times!  In another word, $E_{in}(g)$ tends to be an over-optimistic estimation of $E_{out}(g)$.

This is also known as multiple test problem in statistics and can be explained with a coin-tossing experiment.  Taking $M$ = 1024 fair coins as an example, we throw each of them $N$ = 10 times to estimate how often they give a head.  Although their true probability of head is always 50%, we expect on average one of the $M$ coins will give a ten-head-in-a-roll result.  If fact, the chance that we do observe ten-head-in-a-roll is $1-{\left(\frac{1023}{1024}\right)}^{1024} = 63\%$.  Here among the $M = 1024$ coin tossers, very likely one person found a 10-head coin.  If he thinks  that coin is very likely give more heads in the future, he makes a mistake, as the probability for the coin to be unbiased is nearly one.  Even though a ten-head-in-a-roll happens with very small chance 1/1024, if we only test one coin, the probability roughly grows by a factor of 1024, if we test 1024 coins.

Applies the coin-tossing analogy to our case, although each of our $M$ hypotheses only has a small probability that their $E_{in}$ and $E_{out}$ is too far off (BAD) according to Hoeffding Inequality Equation (5), as we check on large enough $M$, the likelihood of such BAD case is multiplied by $M$.
$$\begin{equation} p(E_{out}(g) - E_{in}(g) \ge \epsilon) \le \frac{M}{4}\cdot\exp(-2\epsilon^2N). \end{equation}$$
If $M$ is large enough, it offsets the exponential term in Equation (7), then $E_{in}(g)$ and $E_{out}(g)$ are way off, learning becomes infeasible.  Just like we cannot find a most head-occurring coin after 1024 trial series and claim it is a head-biased coin with high confidence, we cannot ignore the bias introduced by $M$.  Hoeffding Inequality only works for one pre-determined hypothesis, it does not work for the best performing hypothesis out of a big hypothesis set.  The practice of multiplying $M$ to correct the multiple test probability in Equation (8) is called Bonferroni correction, which we will see later, is a very conservative correction.

The necessity of Bonferroni correction in Equation (7) can be explained in another very clear manner [1].  Imagine we repeatedly sample many $N$-record data sets from the source, $\{\mathcal{D}_1, \mathcal{D}_2, \ldots, \}$, and we apply a hypothesis to each of these data sets individually.  In figure 2.1, each row in the table represents a particular hypothesis and each column represents a data set sampled. We use the $BAD$ to flag the case, where a training set $\mathcal{D}_1$ gives hypothesis $h_1$ an over-optimistic $E_{in}$ (i.e., its $E_{in}$ no longer is a good estimate of its $E_{out}$), we flag the cell corresponding to $h_1$ and $\mathcal{D}_1$ as $BAD$ (Figure 2.1).  Hoeffding Inequality tells us for each given hypothesis in $\{h_1, h_2, \ldots, h_M\}$, i.e, for all cells in a row in Figure 2.1 except the last tally row, the portion of $BAD$ data set should be exponentially small as described by Equation (5).  $GOOD/BAD$ here does not refer to whether a hypothesis is accurate or not, $GOOD/BAD$ only refers whether $E_{out} \approx E_{in}$.  A random guessing model is always $GOOD$, because its performance in the within-sample data set and out-of-sample data sets are basically the same.
 
hypothesis$\mathcal{D}_1$$\mathcal{D}_2$$\mathcal{D}_3$$\ldots$$\mathcal{D}_n$$\ldots$
$h_1$$BAD$$\ldots$$\ldots$
$h_2$$BAD$$\ldots$$\ldots$
$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$
$h_M$$BAD$$\ldots$$BAD$$\ldots$
$\text{any } h$$BAD$$BAD$$\ldots$$BAD$$\ldots$
Figure 2.1 Each hypothesis, one per row, contains exponentially few BAD cells according to Equation (5). The last row is the union of BAD for all previous $M$ rows, so it could have as many BAD as descried by Equation (8).  If we pick any one of them (the bottom row) as $g$, it is guaranteed to hit less than $M$ times the expected number of BAD cells in one row.

The reality is we are only given one training data set and we will pick one of the $M$ hypotheses as the final hypothesis $g$.  If the data set we see happens to be a column in Figure 2.1 containing at least one cell flagged as $BAD$, we have at least one hypothesis whose $E_{in}$ does not represents $E_{out}$ well, and there is always a non-zero chance this is the hypothesis picked by our learning algorithm as our final hypothesis $g$.  When this happens, the in-sample performance of our $g$ is too far from its real performance.  Only when the given $\mathcal{D}$ happens to be a column that contains zero $BAD$ cell, it is safe for us to pick any hypothesis as our answer $g$, as it contains no $BAD$ cell in that column, i.e., $E_{out}$ is always predictable by $E_{in}$.  How often the given data set $\mathcal{D}$ falls into a column containing zero $BAD$ cell?  In the worst case scenario, if each $BAD$ cell occupies a different column, so that the last tally row contains the maximum possible number of $BAD$ columns described by Equation (7), that is the maximum probability of our encountering a unlucky $\mathcal{D}$.  This is where Bonferroni-correction for Hoeffding Inequality comes from.

With Equation (7) in hand, we are heading for an immediate disaster.  Think about a popular model, a linear classifier, where $2{\rm sign}(\mathbf{w}^\intercal \mathbf{x})-1$ determines whether we predict the $y$ label of a data point to be $+1$ or $-1$.  Weight vector $\mathbf{w} \in \mathbb{R}^{d+1}$, therefore, there are infinite number of $\mathbf{w}$, i.e., our hypothesis set $\mathcal{H}$ is a continuous hypothesis space, $M \to \infty$.  Therefore $E_{in}$ and $E_{out}$ are no longer close according to Equation (7).  In another word, the performance of the best linear hypothesis within the training data set no longer reflects its performance in future new sample sets.  Machine cannot learn from data using linear models!  However, our gut feeling is linear model works well in many cases, what went wrong in our reasoning?

There are a number of arguments against $M$ being infinity.  Many hypotheses are extremely close or nearly equivalent in the continuous hypothesis space (Figure 2.2).  Therefore, infinite number of linear decision planes that are so close that they almost always give identical classification results for most $N$-point data sets, then they should not be counted as two independent hypothesis, as their $BAD$ cell patterns in Figure 2.1 are basically identical.  Thus we can expect the “effective” number of independent hypotheses for continuous models are not infinite and there is hope that $\frac{M}{4} \cdot \exp(-2\epsilon^2N)$ actually converges to zero as $N$ grows.  Such effective hypothesis count is solved by the VC dimension theory.
Figure 2.2. Two linear models $h_A$ and $h_B$ are very close in practice, as they give the same classification results for the majority of data sets.  The $BAD$ pattern for these two models in Figure 2.1 would be largely identical.  If there are tons of $BAD$ cells occupying the same column, the last tally row would not have $M$ times more $BAD$ cells than a single row.


VC Dimension


Let us consider a two-dimensional linear model for a two-class classification problem.  Given $N = 3$, we can always find a plane that correctly separates $+$ and $-$ data points, no matter how we label them; this is called we can shatter $N = 3$ (Figure 2.3).  Given $N$ = 4 data points, it is no longer the case, Figure 2.3 shows an example, where no linear hypothesis could have generated labels for all four points correctly.  So for $N \ge 4$, even if we exhaustively test all possible linear hypotheses in $\mathcal{H},$ we still could not generate all $2^N$ possible labels.  This implies if we look through the holes of $N$ data points, the total effective number of hypotheses in $\mathcal{H}$ is less than $2^N$.  For example, in Figure 2.2, we really do not care about the difference between two close models $h_A$ and $h_B$, as they produce nearly the same output for all data points in the problem.

Figure 2.3. Linear hypothesis set can generate all $2^3$ possible labels for 3 given data points.  But it cannot generate $2^4$ possible labels for 4 given data points.  Here is an example of 4 data points where no linear hypothesis can separate.  This means some linear hypotheses have to produce identical labels across $N$ data points as N grows.  This redundancy will trim down the effective number of hypotheses heavily.

When we count effective hypotheses through the holes of $N$ data points, we can see how a linear model can only produce approximately $N^2$ independent hypotheses without rigorous proof.  We can roughly pick any two data points and draw a line (a plane), there are about $\frac{N(N-1)}{2}$ such planes that can serve as the bases of independent hypotheses (label patterns).  We do not need to be exact here, the key is we have a sense that all decision planes can only produce $\sim O(N^2)$ distinct labels for an $N$-record data set.  Even if we consider $N$ samples we use for model training and another independent $N$ test samples for $E_{out}$ estimation, there are at most $(2N)^2$ possible unique label assignments to the $2N$ total data points.  Any two hypotheses share the same label output are effectively counted as only one hypothesis in the view of our training and test data sets, as they produce the same $BAD$ patterns, therefore should only contribute once in the Bonferroni correction.  As long as the effective number of hypotheses in $\mathcal{H}$ is polynomial on $N$ and not exponential on $N$, the Bonferroni-corrected Hoeffding Inequality in Equation (7) will eventually converge at large enough $N$.

When $N$ is small and models can shatter $N$ records, our hypothesis space could in principle contain $2^N$ different models, which could offset the exponential term in Equation (7) and the gap between $E_{in}$ and $E_{out}$ are not bound. But as $N$ grows further, $M$ becomes a polynomial term on $N$, the exponential term will eventually dominate and the problem becomes machine learnable. The maximum $N$ our model can shatter is called the VC dimension (Vapnik–Chervonenkis) of our model $d_{VC}$. Our new bound on $E_{out}$ is: $$\begin{equation} p(E_{out}(g) - E_{in}(g) \ge \epsilon) \le \frac{N^{d_{vc}}}{4} \exp(-2\epsilon^2N) \end{equation}$$
Let us work on an exercise to determine the VC dimension of the linear model space, where $\mathbf{w} = ( w_0, w_1, \ldots,  w_d)$, consisting of a total of $d+1$ model parameters.  Given $N=d+1$ training data points of arbitrary $\{ +1, -1\}$ labels, we are guaranteed (ignore singularity case) to find a $\mathbf{w}$ vector that can shatter them.  A $\mathbf{w}$ that can correctly predict all possible label assignments can be found by solving the following linear equation:
$$\begin{equation} X\mathbf{w} = \mathbf{y}. \end{equation}$$
$$\begin{equation} \mathbf{w} = (X^\intercal X)^{-1}X^\intercal \mathbf{y}. \end{equation}$$
So a linear model with $d+1$ parameters can shatter any $d+1$ points.

If we are given $d+2$ points, in general there are $d+2$ equations, when we expand $X\mathbf{w} = \mathbf{y}$ for $d+1$ unknown weight components.  We may first solve $\mathbf{w}$ using the first $d+1$ equations.  Then say our solution $\mathbf{w}$ predicts the $d+2$ data point as $+1$, the model cannot shatter the a data set when the label for the $d+2$ data point is actually $-1$, vice versa.  This proves the VC dimension for a linear classification model with $d+1$ parameters is $d+1$.  The bounds for the linear model therefore is:
$$\begin{equation} p(E_{out}(g) – E_{in}(g) \ge \epsilon) = \frac{N^{d+1}}{4} \exp(-\epsilon^2N). \end{equation}$$
Unlike the above linear model, for most machine learning models, their exact VC dimensions are very hard to find.  However, all we need to know is as long as the VC dimension for the model is not $N$ but a finite value $d_{vc}$, the effective model counts is some polynomial term on $N^{d_{vc}}$, the learning problem is feasible given sufficiently large number of $N$.  The conclusion from the linear model can nevertheless serve as a rule-of-thumb that VC dimension is roughly the degree of freedom, i.e., the number of independent model parameters.  This is just an empirical observation, not a theory.

Are there learning models with exponential number of effective hypotheses $2^N$, i.e., they can shatter any given $N$ data points? Certainly.  My favorite example is a cheating polynomial classification/regression model.  Given $N$ data points $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)$, we can locate a classification/regression hypothesis that exactly explains all $y$ value as the following:
$$\begin{equation} \begin{aligned} y(\mathbf{x}) = &y_1\frac{|\mathbf{x}-\mathbf{x}_2|\cdot|\mathbf{x}-\mathbf{x}_3| \cdots |\mathbf{x}-\mathbf{x}_N |}{|\mathbf{x}_1-\mathbf{x}_2 | \cdot |\mathbf{x}_1-\mathbf{x}_3 |\cdots |\mathbf{x}_1-\mathbf{x}_N|}+ \\ & y_2\frac{|\mathbf{x}-\mathbf{x}_1|\cdot|\mathbf{x}-\mathbf{x}_3|\cdots |\mathbf{x}-\mathbf{x}_N|}{|\mathbf{x}_2-\mathbf{x}_1|\cdot|\mathbf{x}_2-\mathbf{x}_3|\cdots |\mathbf{x}_2-\mathbf{x}_N|}+ \\ & \cdots \\ & y_n\frac{|\mathbf{x}-\mathbf{x}_1|\cdot|\mathbf{x}-\mathbf{x}_2|\cdots |\mathbf{x}-\mathbf{x}_{N-1}|}{|\mathbf{x}_N-\mathbf{x}_1|\cdot|\mathbf{x}_N-\mathbf{x}_2|\cdots |\mathbf{x}_N-\mathbf{x}_{N-1}|}. \end{aligned} \end{equation}$$
This model will fit any $N$ input data points exactly, no matter how we assign their labels/values, because it basically memorizes the answers for the training data.  Therefore the VC dimension for our model is $N$.  Then there is no guarantee the adjusted Hoeffding inequality Equation (7) will converge to zero at large $N$.  In fact we will expect this model cannot learn from data and it should have no predicting power.  Some of us may realize this model actually should be the exact solution of $\mathbf{y} = X\mathbf{w}$ using $(N-1)^{th}$ order polynomial model (same solution as Equation (10)), except we keep on increasing the model complexity $N$ as input data points grows.

For most learnable models where VC dimension is bound, with good confidence, we know:
$$\begin{equation} p(E_{out}(g) - E_{in}(g) \le \epsilon) \approx \frac{N^{d_{vc}}}{4} \exp(-2\epsilon^2N) \end{equation}.$$
Assign
$$\begin{equation} \delta = p(E_{out}(g) - E_{in}(g) \le \epsilon), \end{equation}$$
we get:
$$\begin{equation} \epsilon = \sqrt{\frac{1}{N}\ln \left(\frac{N^{d_{vc}}}{4\sigma}\right)}. \end{equation}$$
This can be simplified into:
$$\begin{equation}\begin{aligned} E_{out} &\approx E_{in} + \sqrt{\frac{1}{N}\ln \left(\frac{M}{4\sigma}\right)}\\
&\approx E_{in} + \sqrt{\frac{1}{N}\ln \left(\frac{N^{d_{vc}}}{4\sigma}\right)}\\
&\approx E_{in} + O\left(\sqrt{d_{vc}\frac{\ln N}{N}}\right). \end{aligned}\end{equation}$$
Our ultimate goal is to have a model that works well in $E_{out}$, therefore, instead of finding a $g$ in $\mathcal{H}$ that givens the minimum $E_{in}$, we should look for the $g$ that gives the overall minimum score for $E_{out}$ in Equation (16).  The first term in the equation is the in-sample error $E_{in}$, the second term is a generalization term related to VC dimension, called $E_{reg}$.  If we ignore $E_{reg}$ and only focus on minimizing $E_{in}$, typically the model containing more parameters, i.e., larger $M$, has the advantage of exploring a larger hypothesis space, therefore, can find a smaller $E_{in}$.  The models such as the previous cheating polynomial model would be a winner (its $E_{in} = 0$), but then its marvelous success in $E_{in}$ cannot be extrapolated into $E_{out}$, the model sounds too misleadingly good in the sample data set, it only fail miserably in future test sets.

Figure 2.4 is probably the most important figure in ML [2].  It says if our model is too simple, it does not have enough power to describe training data well, therefore $E_{in}$ (orange arrow in Figure 2.4a) would be high and $\epsilon$ would be low (Equation (16)), then $E_{out}$ would be high and not useful.  However, if we adopt a rather complex model with large degrees of freedom in fitting, we can approach training data really well and $E_{in}$ is very small (blue arrow in Figure 2.4a), however, we lost the generalization capability ($\epsilon$ is huge), the second term in Equation (16) blows up, the $E_{in}$ would be far from $E_{out}$ due to “over-fitting” (as we test too many hypothesis and $g$ is extremely likely to be the one happens to have good $E_{in}$ by chance, but will perform poorly in $E_{out}$).  $E_{out}$ is depicted as upper bound of the confidence interval of the model error rate in Figure 2.4a.  This conclusion is also summarized in Figure 2.4b, where we expect there is an optimal model complexity for a given learning problem, neither too simple nor too complex.  The ideal model is the one minimizing Equation (16), not the one minimizing $E_{in}$ alone.

Figure 2.4.  The out-of-sample error $E_{out}$ consists of in-sample error $E_{in}$ and a regularization error $E_{reg}$.  For over-simplified models, $E_{in}$ is large $E_{reg}$ is small, this is called under-fitting.  For over-complicated models, $E_{in}$ is small but $E_{reg}$ is large, this is called over-fitting.  Neither is good, there is an optimal model complexity that balance the two terms and gives the optimal $E_{out}$.


There are book examples indeed minimize Equation (16), when $d_{vc}$ is known, although I cannot recall the reference to that example at this point. In practice, we do not use Equation (16) as our cost function, because $d_{vc}$ is often unknown.  More importantly, VC theory only estimates a lose bound, i.e., the generalization term in Equation (16) is far from being accurate.  The VC bound is extremely lose, as we have made huge approximations in order to get to this simple math form.  The real effective model counts are typically way less than $N^{d_{vc}}$.  E.g., according to VC theory, given a 2-$d$ linear model with $d_{vc}$=3, we would like the generalization error $\epsilon$ be at at most 0.1 with confidence 90% (i.e., $p \le 0.1)$, $N$ would be at about 30,000  according to Equation (16) [3].  In fact, experience shows “requiring $N$ be at least $\mathbb{10} \times d_{vc}$ to get decent generalization is a popular rule of thumb” [4].  Another example to convince us VC is an upper bound is the 1-nearest neighbor model.  This model can also correctly predict any given $N$ data points, therefore, it can shatter any $N$ points, with a VC dimension of $N$.   However, we know in fact 1-NN model works quite well in practice.  In fact it can be proven that for dense-enough training data, 1-NN has a performance at about half of what is achievable by the best possible model.  If the best possible classification model has an accuracy of 98%, 1-NN can archive an accuracy of 96% for a smooth target function![5].  So considering all the above arguments, we cannot come up with an analytical expression to capture generalization error, we need an alternative empirical approach to estimate $E_{reg}$.

Regularization


There are two lessons we learn from the VC theory: first, machine learning is feasible, as we do not really deal with a hypothesis set of infinity size.  Second, generalization error increases as model complexity increases, therefore, the more we can constrain the effective hypothesis counts in $\mathcal{H}$, the smaller the regularization error will be.  Along this line, if we consider a second-order polynomial regression model:
$$\begin{equation} y = w_0 + w_1x_1 + w_2x_2+ \ldots + w_nx_n + w_{11}x_1^2 + w_{12} x_1x_2 + \ldots + w_{nn}x_n^2. \end{equation}$$
This is a super hypothesis set of the linear model space, therefore, the VC dimension of this polynomial model is larger than linear model and $E_{reg}$ is also larger.  To reduce our model space, we either resort to a linear model, or we can stay within the second order polynomial model space but instead trim it down.  For polynomial models, we can archive this by constraining weight vector ${\Vert \mathbf{w} \Vert}^2$.  We therefore can minimize Equation (16) by the following proxy form:
$$\begin{equation} E_{out} \approx E_{in} + \lambda {\Vert \mathbf{w} \Vert}^2.\end{equation}$$
$\lambda$ is a parameter that we assume will allow us to best approximate the real generalization error.  This form of regularization is called ridge regression (or L2).  You can invent many other regularization forms, e.g., $\lambda \sum_{i} \vert \mathbf{w}_i \vert$, which is called Lasso regression (or L1).  For polynomial regression model $y= w_0 + w_1x +w_2x^2 + w_3x^3 + \ldots + w_qx^q$, we may also use $\lambda \sum_{i}\; e^q{(w_q)^2}$, so that weights for higher-order terms get penalized more, etc (Figure 2.5).   A mixture of Lasso and ridge is call elastic net.  In most cases, we do not know what the real analytical expression is for the regularization error function, but it does not matter too much, as we just use this term to adjust the size of our effective hypothesis space, so that we reach an optimal balance between $E_{in}$ minimization and $E_{reg}$ minimization.
Figure 2.5. Adding a regularization penalty on $\mathbf{w}$ effectively reduce the hypothesis space.  The schematic drawing outlines the original $\mathbf{w}$ space (first) and the reduced space under ridge regression (second), Lasso regression (third), and other forms.

In a linear model, $w_0$ is often used to serve as the data normalization term, e.g., if ${\rm mean}(\mathbf{y}) \neq 0$ (or $\mathbf{x}$ is not zero-centered) in a regression problem, this bias in input data can be absorbed by $w_0$.  Or if we shift $y$ by a constant, $w_0$ can offset that.   If we penalize $w_0$, the final hypothesis will be forced to be close to the origin.  Therefore, in most cases, we do not penalize $w_0$ and regularization error term should be modified as:
$$\begin{equation} \lambda {\Vert \tilde{\mathbf{w}} \Vert}^2 =  \lambda \sum_{i=1}^d \; w_i^2 \end{equation}.$$
Lasso and other regularization terms should also be modified accordingly.

Figure 2.6 is an example showing if we choose the right $\lambda$, we can significantly lower $E_{out}$, compared to the model where we only try to minimize $E_{in}$.  Regularization term is a necessity for all learning models!  An immediate question follows, how do we choose the hyper-parameter $\lambda$, because $E_{out}$ is only known in the future? We will answer this question in the next Chapter.  For now, we just need to understand if $\lambda$ is too small, we are not considering regularization error enough, we tend to produce models that are too complex and therefore over-fitting.  If $\lambda$ is too large, we worry too much about over-fitting, therefore resort to models that are too simple to explain the data (Figure 2.6).  But over-fitting is often more damaging than under-fitting.


Figure 2.6. The target function is a second order polynomial with Gaussian noise for $x \in [0, 1]$. Our input data set contains 8 points, with true function shown as the black curve on the right.  Our hypothesis space is $5^{th}$ order polynomial models.  The left plot shows how $E_{out}$ changes as $\lambda$ changes from $10^{-4}$ to $10^{4}$.  With $\lambda = 10^{-4}$, we obtain the red hypothesis, which tries to fit the noise in the input data points, but is certainly an over-fit.  With $\lambda = 10^{4}$, we obtain the blue hypothesis, which basically a flat line, does not capture the real trend.  The best hypothesis is obtained when $\lambda$ is around 1, where the corresponding green hypothesis is reasonably close to the true target function.  Therefore, introducing the regularization term could help us reach an overall better hypothesis.  Note: I made a mistake in creating this example, where my regularization term included $\mathbf{w}_0$, therefore the blue hypothesis is too low.  If we had used the correct regularization, the blue line should have gone through the mean of data points.

Where Does Over-fitting Come From?


It is generally reasonably straightforward to think over-fitting occurs when our model is more complicated than the true target.  If the true target function $f(y \vert \mathbf{x})$ is a second order polynomial, fitting it with a model of 5th order polynomial without regularization will produce over-fitting.  This is what we saw in Figure 2.6.

But what if our model captures the correct model?  We generate data points from a 5th order polynomial function with noise (Figure 2.7, left).  We either fit it with a 5th order polynomial or a 2nd order polynomial by minimizing $E_{in}$.  We then plot $E_{out}$ as a function of $N$ (Figure 2.7, right).  We see the 2nd order model actually performs better for $N  \le 10$.  Only for $N \ge 10$, the $5^{th}$ order model starts to out-perform.
Figure 2.7.  Even though the target function is $5^{th}$ order.  Due to noise, $2^{nd}$ order model actually performs better for small $N \le 10$.
What happened? The data contains noise.  If we are given enough data points, we would be able to tell what is the real trend (lower frequency trends) and what is noise (higher frequency patterns).  However, if there are not enough data points for the 5th order polynomial model to tell noise from real trend, it uses the extra power to model the noise and was misled.  Although the true answer is contained in $\mathcal H$, it is unable to locate the true model, due to the noise.  The 2nd order model ignores both high-frequency noise and higher-frequency trends, it only tries to capture the lower-frequency trends.  Yes, its best hypothesis is still quite some distance away from the truth, but it is overall closer to the truth.  Given sufficient number of training data ($N \ge 10$), the variance from a complex model is reduced and it is able to produce a solution closer to the truth and eventually win.

Over-fitting can also occur even when there is no noise.   Figure 2.8 is an experiment where the true model is 10th order polynomial without noise (black), but we only have 8 data points.  We know if we fit the data with $10^{th}$ order model, it will be under-determined, so we will not do that.  But even if we fit it with a 5th order polynomial (red), it can still be worse compared to a simpler 2nd order polynomial (blue).  Both examples show that over-fitting is not because our model is inaccurate or because the data contains noise, it is a disease of insufficient data considering the model complexity.  The rule of thumb is we should have at least 10 times more data points than the number of model parameters to avoid over-fitting.  If $N$ is insufficient, we should avoid the complex model, even if we know it is the true form of $f$!  Over-fitting is an important topic, please read LFD for more examples [5].
Figure 2.8. The target function is a $10^{th}$ order polynomial, however, we only have 8 data points.  The $5^{th}$ order model (red) over-fits, not because the data is noisy, but because there are too few data points.  The $2^{nd}$ order polynomial overall is more reliable for this example.


The Learning Curve


The learning curve describes how $E_{in}$ and $E_{out}$ behaves when $N$ increases.  First, $E_{out}$ is always greater than $E_{in}$ due to an extra term $E_{reg}$ in Equation (16). Second, $E_{in}$ generally increases as $N$ increases, however, plateau pretty rapidly.  Third, the gap between $E_{out}$ and $E_{in}$ decreases as $N$ grows, i.e., $E_{in}$ and $E_{out}$ converges to the same value.  Therefore, $E_{out}$ generally decreases as $N$ increases, it will eventually plateau as well.  The learning curve is very useful in practice, especially when you have influence on the sample size.  Given $N$ samples, we can apply our model to $m$ data points, then $m+1$ data points, then $m+2$ data points, and use the remaining data points to estimate $E_{out}$ and plot out the learning curve, i.e., $E_{out}$ as a function of $N$.  If we see $E_{out}$ is still on the significant slope of decreasing, when we use up the majority of $N$ samples for training, it implies the model will benefit significantly, if we are able to collect more samples.  If we already see $E_{out}$ plateaus, it means we already have sufficient number of samples for our model, there might be rooms to increase the model complexity to reduce $E_{in}$ further.

Let us discuss the behavior of $E_{in}$ and $E_{out}$ a bit further.  The larger the $N$, typically the larger $E_{in}$ becomes.  Think about a $(d+1)$-parameter linear model, it can perfectly classify $N \le d+1$ data points, can fit $d+2$ data points quite well, the fitting quality slightly decay, when we need to fit more points.  However, $E_{in}$ will soon converge and would hardly increase further as it tries to fit more and more data points.  On the other hand, the regularization term continuous to decrease as $N$ increases.  Therefore, we expect $E_{out}$, a combined result, to decrease as $N$ increases.  The learning curve behavior is summarized in Figure 2.9 based on LFD [6].
Figure 2.9 The learning curve for simple and complex models.  When $N$ is small, simple model wins due to smaller $E_{reg}$.  For a complex model, learning curve can tell us whether the model will significantly benefit from larger $N$.

Reference

  1. https://www.youtube.com/watch?v=MFL6xDn1lXM
  2. LFD, page 59
  3. LFD, page 57
  4. LFD, proof of 1-NN performance in e-Chapter: 6-6
  5. LFD, page 120-121, 123-125.
  6. LFD, page 67

No comments: