Wednesday, May 24, 2017

Chapter 3. Regularization & Cross Validation

Bayesian Interpretation of Regularization


Although the main conclusion from Chapter 2 is simple, i.e., we need to minimize the sum of $E_{in}$ and $E_{reg}$, the derivation process, also known as statistical learning theory is somewhat complicated with many assumptions.  It relies the concepts of Hoeffding Inequality, Bonferroni correction, VC dimension, etc.  It also depends on ideal experiments, where virtual data sets $\mathcal D$ of size $N$ are sampled again and again in order to study the connection between $E_{in}$ and $E_{out}$; this is a frequentist's way of thinking.   In addition, we only show the learning theory using a classification problem and did not prove the same conclusion works for regression problems.  Let us reconstruct the regularization problem from a Bayesian point of view, where the whole VC argument can be avoided.  The Bayesian camp is proud that they do not have to struggle with over-fitting concept, however, we know there is no free lunch, they have to defend something instead – prior.  Bayesian approach is important as this is a more popular way of deriving the same conclusion as seen in many ML books; it also offers a much cleaner explanation of the regularization error and gives us confidence on when we need to use customized $E_{reg}$.

When a Bayesian person looks at our model, he has a prior belief on $\mathbf{w}$ without see any data points.  For most real-life problems, we do have some common sense, e.g., common sense tells us for many models, especially those with extremely large $\mathbf{w}$ components in high-order polynomial terms are not worth evaluating.  For a classification problem, large $\mathbf{w}$ can be scaled down without changing classification results.  For a regression problem,  large $\mathbf{w}$ is unlikely to be the optimal solution, as $y$ tends to be finite.

Now let us consider a general case, where our prior belief is $\mathbf{w}$ follows a prior distribution $p(\mathbf{w})$. The best $\mathbf{w}^*$ is the one that maximizes the conditional probability according to Bayesian theorem: $$\begin{equation} p(\mathbf{w} \vert \mathcal{D}) = \frac{p(\mathcal{D} \vert \mathbf{w})p(\mathbf{w})}{p(\mathcal{D})}, \end{equation}$$ The probability of observing our data set of $N$ data points: $$\begin{equation} \begin{aligned} p(\mathcal{D}) &= \prod_{i=1}^N \; p(\mathbf{x}_i, y_i), \\ &= \int \left[ \prod_{i=1}^N \;  p(\mathbf{x}_i, y_i \vert \mathbf{w}) \right] p(\mathbf{w})d\mathbf{w}, \\ &= \int \left[ \prod_{i=1}^N \; p(y_i \vert \mathbf{x}_i, \mathbf{w})p(\mathbf{x}_i) \right] p(\mathbf{w}) d\mathbf{w}, \end{aligned} \end{equation}$$ where $p(\mathcal{D})$ integrates over all possible values of model parameter $\mathbf{w}$ according to our prior belief. While frequentists think about the distribution of data set $\mathcal{D}$, Bayesians think about the distribution of $\mathbf{w}$. $p(\mathcal{D})$ is independent of $\mathbf{w}$, therefore, maximizing Equation (1) is equivalent to maximizing the following:
$$\begin{equation} \begin{aligned} p(\mathcal{D} \vert \mathbf{w})p(\mathbf{w}), \end{aligned} \end{equation}$$ and thus minimizing its negative log transformation:
$$\begin{equation} -\log (p(\mathcal{D} \vert \mathbf{w})) - \log(p(\mathbf{w})). \end{equation}$$
The first term is the term that describes how close our model explains training data points.  Since the VC theory was derived based on classification problem, we here use a regression problem to reproduce the conclusion of VC theory, where the noise follows a Gaussian distribution:
$$\begin{equation} p(\mathcal{D} \vert \mathbf{w}) = \prod_{i=1}^N \; \frac{1}{\sqrt{2 \pi} \sigma}\exp\left(-\frac{{(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2}{2 \sigma^2}\right)p(\mathbf{x}_i). \end{equation}$$
$$\begin{equation} -\log(p(\mathcal{D} \vert \mathbf{w})) \sim \sum_{i=1}^N \; {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 \end{equation},$$
which is the familiar least square error cost function.

If the data points we expect to see are within certain radius of the origin in a regression problem, there is no need to test separation planes outside that radius, right?  Let us translation this to a soft constrain, i.e., our belief is $\mathbf{w}$ should be reasonably close to the origin:
$$\begin{equation} p(\mathbf{w}) = \frac{1}{\sqrt{2 \pi} \sigma} \exp( - \frac{{\Vert \mathbf{w} \Vert}^2}{2 \sigma^2}) \end{equation}.$$

$\sigma$ is a parameter reflecting how strong our prior believe is.  Then the log prior term becomes:
$$\begin{equation} -\log p(\mathbf{w}) \sim \lambda {\Vert \mathbf{w} \Vert}^2 \end{equation}.$$
That is if our prior belief on $p(\mathbf{w})$ is a Gaussian prior, the regularization term is the familiar ridge regression term, $\lambda$ is a convenient parameter replaces $\sigma$.

If $p(\mathbf{w})$ is a Laplace prior:
$$\begin{equation} p(\mathbf{w}) = \frac{1}{2 \sigma} \exp( - \frac{{\Vert \mathbf{w} \Vert}^1}{\sigma}) \end{equation}.$$
The log term gives a Lasso regression form  $-\lambda \sum_{i} \; \vert w_i \vert$.

So the Bayesian camp reaches Rome through a different route.  This further strengthens our argument that regularization term is not a hack, there is strong reason to have it as part of our cost function, so that the cost function can lead to a better hypothesis with smaller $E_{out}$.  This is true either from Bayesian prior belief or from the previous VC theory based on multiple tests.  Please be aware that for linear models with zero-centered features, the $\mathbf{w}$ in regularization error should be replaced by $\tilde{\mathbf{w}}$, i.e., $w_0$ should be excluded in most cases, as discussed in Chapter 2 Equation (19).

Bias and Variance


Yet another interesting way to look at the difference between $E_{in}$ and $E_{out}$ is to look at the regression problem.  The closeness between our picked hypothesis $g^\mathcal{D}$ and the target function $f$ averaged over all possible virtual data sets $\mathcal{D}$:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {\Vert f(\mathbf{x})-g^\mathcal{D}(\mathbf{x})\Vert}^2 & = \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {\Vert (f(\mathbf{x})-\hat{g}(\mathbf{x})) + (\hat{g}(\mathbf{x})-g^\mathcal{D}(\mathbf{x}))\Vert}^2 \end{aligned} \end{equation},$$
where
$$\begin{equation} \hat{g}(\mathbf{x}) = \mathbb{E}_{\mathcal{D}}\; g^\mathcal{D}(\mathbf{x}). \end{equation}$$
$\hat{g}$ is a hypothetical hypothesis, obtained by averaging individual $g^\mathcal{D}$ over all imaginary possible input $\mathcal{D}$.  $\hat{g}$ does not have to be in the original hypothesis space (average of planes may not be a plane), but does represent an expected typical $g$ by getting rid of the randomness introduced by training set $\mathcal{D}$.

Equation (10) can be written as:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathbf{x}} \; {(f(\mathbf{x})-\hat{g}(\mathbf{x}))}^2 + \mathbb{E}_{\mathcal{D}}\;\mathbb{E}_{\mathbf{x}} \; {(\hat{g}(\mathbf{x})-g^{\mathcal{D}}(\mathbf{x}))}^2 \end{aligned} \end{equation}.$$
Notice a cross product term has been removed due to Equation (11).  In Equation (12), the first term is how close our data-set-agnostic hypothesis $\hat{g}$ is to the target function $f$, which reflects the capability of our hypothesis in mimicking $f$.  This is called a bias term, as it is a systematic error, i.e., for a given $\mathbf{x}$, $\hat{g}(\mathbf{x})$ is either consistently lower or higher than the target value.  The bias will not go away, even if we have infinite $N$.  The second term is the statistical fluctuation in using a particular $\mathcal{D}$ to pick $g.$  If we had encountered a different data set $\mathcal{D}^\prime$, $g^{\mathcal{D}^\prime}$ would have been somewhat different.  Hypothetically given enough $\mathcal{D}$, our picks will converge to a stable average model $\hat{g}.$  So the second term is a variance term, reflects the uncertainty of our model due to the randomness of seeing a particular training data.  If $N$ of individual training set is sufficiently large, the variance can be expected to be small, as $g$ is a fit on $N$ points.

When $\mathcal{H}$ containing only simple hypotheses, variance is very small, the bias term dominate, as our simple hypothesis $\hat{g}$ is far away from $f$, we underfit.  On the other hand, when $\mathcal{H}$ only contains complex hypotheses, although such the mean of hypotheses $\hat{g}$ is probably closer to $f$, therefore, bias is small. Due to limited $N$, each time given a new data set, our learning algorithm picks a different $g$ that are far away from $\hat{g}$ and from each other, i.e., variance is large.  If $N$ is small, variance error typically dominates and it is better off to use a simpler model, as the error originated from bias is often smaller than error originated from variance.  So this analysis gives the same conclusion as the VC theory.  Figure 3.1 summarizes what we discuss here [1].
Figure 3.1. For a simple model, bias is large, variance is small.  For a complex model, bias is small, but variance is large, for most given $\mathcal{D}$, we unlikely will land on somewhere close to $\hat{g}$.


Learning Theory Review


We have covered a lot of ground in learning theory.  If you did not follow all the math, do not worry, as those equations are not used to solve real machine learning problems.  What will be used is the ML guidelines derived from these math.

First, the VC theory tells us due to the limited sample size $N$ and the large number of hypotheses in our search space $\mathcal{H}$, the real performance $E_{out}$ is expected to be $E_{in} + E_{reg}$, where $E_{reg}$ can be reduced by having larger $N$ and using smaller hypothesis space.  The guideline is we cannot simply minimize $E_{in}$, which would lead to over-fitting.  By tuning the size of our hypothesis space, we could reach optimal learning.

Second, the Bayesian framework tells us our prior knowledge on model parameters is important.  The particular training data set we see could be noisy, so this data set should only be used to refine our prior knowledge, not to override our prior knowledge.  For example, we believe the fundamental dynamics laws should be simple, so when we receive a set of experimental data recording the speed $s$ and time $t$ of a moving object, we should stick to simpler hypotheses where the $s \sim w_0+w_1t$ or $s \sim w_0+w_1t + w_2t^2$, instead of some ${20}^{th}$ order polynomial hypothesis, which will over-interpret the noise in our measurement.  So ML needs to weight both the information carried by our data set $E_{in}$ and our prior knowledge $E_{reg}$.  How heavy our prior knowledge should be weighted?  Too much, we are too stubborn to learn from data, too little, we lose basic principle and are easily misled.  The optimal balance is probably something called "experience".  Personally, I like the Bayesian explanation, as it is simple and intuitive and it applies to both classification problems (see logistic regression in the next Chapter) and regression problems.

Third, the effect of model complexity can also be explained by bias and variance model.  Models too complex minimize bias, however, it is misled by the noise in the particular data set we see, so the variance is huge.  Models too simple is robust against noise in the data set, but unable to depict the truth in enough details.  Unlike the other two, this explaination does not offer a mathematically workable scoring function, as we only have one data set.  A technique called bagging could be used to generate multiple virtual data sets from $\mathcal{D}$, therefore generate an approximate form of $g^{\mathcal{D}}$ to reduce variance.  We will not get into bagging in this note.

To implement ML in practice, we have to turn the guidelines into an empirical form of cost function.  Let us use the Bayesian approach to derive a popular form:
$$\begin{equation} g = \arg\min_{h \in \mathcal{H}} \; E_{in} + \lambda \cdot L(\mathbf{w}) \end{equation}, $$
where $L(\mathbf{w})$ can be thought of as our prior belief without see the data.  $\lambda$ tunes how strong our belief is.  $E_{in}$ is the evidence observed from the particular training data set.  The second term is $E_{reg}$.  The best hypothesis could be found by minimizing the sum of these two terms.  One may wonder $E_{reg}$ is a function of $N$ according to Equation (16) in Chapter 2, however, here $\lambda$ originated from $\sigma$ is independent of $N$.  This is because $E_{in}$ here is the sum over all $N$ data points, therefore is proportional to $N$, thus $E_{in}$ increases as $N$ increases.  The regularization term therefore becomes relatively less important as $E_{in}$ increases.  In another word, if we normal $E_{in}$ so that it represents the average error per data point, the log-prior term would need to be divided by $N$.  If $E_{in}$ is written in an normalized form, the optimal $\lambda$ would actually decreases as $N$ increases.


Unbiased $E_{out}$ Estimation with Leave-One-Out Cross Validation


We currently understand the necessity of including the regularization error term in the learning algorithm.  Our regularization error contains a parameter $\lambda$.  Notice here $\lambda$ is not a model parameter, our model does not contain $\lambda$!  $\lambda$ here is call a hyper-parameter, its role is to guide our learning algorithm to find a better $g$, but its not a parameter of $g$, but a parameter of the learning algorithm.  A question carried over from the previous Chapter is we cannot derive $\lambda$ from first principle model analysis then how to choose the right $\lambda$?

The ultimate goal for $\lambda$ is to produce a model $g$ that gives the best $E_{out}$.  Therefore, if somehow we can get our hands on a test data set $\mathcal{D}^\prime$, we can directly estimate $E_{out}$, then we can pick the appropriate $\lambda$ that produces the $g$ corresponding to the smallest $E_{out}$.  Given only a data set of $N$ training data, the only option is to carve out $K$ records for this test purpose and call them $\mathcal{D}^\prime$ and use the remaining $N-K$ records as our new training set $\mathcal{D}$.  Thus for a given $\lambda$, our $E_{out}$ estimation routine goes like (using a regression problem as an example):

$\text{split $N$ records into $(N-K)$ training records $\mathbf{x}_{train}$ and $K$ validation records $\mathbf{x}_{validate}$} \\$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{\mathbf{x}_{train}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda {\Vert \mathbf{w} \Vert}^2 \\$
$E_{out} \vert_{\lambda} =\sum_{\mathbf{x}_{validate}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2$

Basically we fit the $N-K$ training records to find the optimal $\mathbf{w}$, and evaluate its performance $E_{out}$ using the remaining $K$ records withheld for validation.  Although we use  ridge regression in the above procedure, it is just for convenience, the idea applies to any model and any regularization form.
The catch is previously our final hypothesis for a given $\lambda$ was trained using all $N$ available records: $g^N$.  However, in order to estimate $E_{out}(g^N)$, we reserve $K$ records as our validation set, we now have $K$ less records for training.  So the model we have performance estimation on is actually $g^{N-K}$, we hope:

$$\begin{equation} E_{out}(g^N) \approx E_{out}(g^{N-K}) \approx E_{out}^{validate}(g^{N-K}) \end{equation}.$$
That is to estimate $E_{out}$ for our final hypothesis $g^N$, we first would need the hypothesis obtained from training set $g^{N-K}$ to be close to it, which means $K$ should be small, i.e., the first part of Equation (14).  Then we need to estimate $E_{out}(g^{N-K})$ using $K$ validation records, so $K$ has to be large to obtain an accurate estimation (Hoeffding Inequality), i.e., the second part of Equation (14).  These are two contradictory requirements.  If $K$ is small, the hypothesis $g^{N-K}$ is closer to our final hypothesis, but its $E_{out}$ is not reliable.  If $K$ is large, $E_{out}$ is reliable, but it is for a very different hypothesis based on much fewer training data $N-K$.

In the extreme case, consider we only have one record for training, our resultant hypothesis will be very noisy.  Its performance, which can be measured by $N-1$ validation records pretty accurately, but it states our hypothesis is a poor one with high confidence.  On the other extreme, if we keep $N-1$ records for training, we get a decent hypothesis, but then the $E_{out}$ based on one validation record is highly unreliable, which does not reflect true $E_{out}$ of our hypothesis, in this case, we have a hypothesis with unknown performance.  There is actually a clever way out of this dilemma.

We are actually going to adopt one of the extreme scenarios described above, keep $N-1$ for training and 1 for validation.  How do we deal with the small validation set?  There are actually $N$ ways of splitting the records into training and validation sets, therefore, we are going to loop through all $N$ splits, have the model trained on $N-1$ data points, resulting $N$ hypothesis, i.e., $g_i^-$ is the final hypothesis with the $i^{th}$ point excluded.  Then measure the performance of $g_i^-$ on the $i^{th}$ record.  Since the $N$ hypotheses obtained share the same algorithm and same hyper-parameters, just obtained on different $N-1$ data points, therefore, we can assume their performance is about the same and very close to the performance of the final hypothesis that will be trained on the full $N$ records.  Now we can average over the $N$ $E_{out}$ scores and obtain an estimation of our hypothesis performance.  For each, $g_i^-$, it has never seen the $i^{th}$ record, therefore the $E_{out}$ based on the $i^{th}$ record is an unbiased estimation (neither systematically higher nor lower).  Given the $N$ independent measurements of the $g^-$ performance, their average is a decent one.  The more formal proof of the Leave-One-Out Cross Validation (LOOCV) is outlined in LFD [2].  Basically, the expected error of the hypothesis $g_i^-$, over multiple virtual data sets $\mathcal{D}^\prime$ of size $N$ is:
$$\begin{equation} \begin{aligned} \mathbb{E}_{\mathcal{D}^\prime} \; \mathbb{E}_{i \in \mathcal{D}^\prime} \;{\rm error} (g_i^-(\mathbf{x}_i), y_i) &= \mathbb{E}_{\mathcal{D}^\prime} \; {\rm error}(g_\mathcal{D^\prime}^-), \\ &= E_{out} (N-1). \end{aligned} \end{equation}$$

Imagine we repeatedly measure the performance of $g_i^-$ over records that the model has never seen, even though one at a time, we are effectively testing the hypothesis in a true test data set.  Therefore, our estimation of hypothesis performance is unbiased.  The LOOCV workflow for $E_{out}$ estimation works as the following:

$\text{for each record $j$ in $N$ records} \\$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {(y_k-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda {\Vert \mathbf{w} \Vert}^2 \\$
$\quad E_{out}(j) \vert_{ \lambda} =  {(y_j-f(\mathbf{x}_i, \mathbf{w}))}^2 \\$
$E_{out}\vert_{ \lambda} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{\lambda} \\$

Once we know how to estimate $E_{out}$, we can then find the optimal $\lambda$ value by putting a loop over candidates of $\lambda$, the complete $\lambda$ selection routines are outlined below.

For the K-validation approach:

$\text{for each $\lambda_i$ in candiates } \{\lambda_1, \lambda_2, \ldots, \lambda_z\} \\$
$\quad \text{split $N$ records into $(N-K)$ training records $\mathbf{x}_{train}$ and $K$ validation records $\mathbf{x}_{validate}$} \\$
$\quad \mathbf{w}_i^* = \arg\min_{\mathbf{w}} \sum_{\mathbf{x}_{train}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda_i {\Vert \mathbf{w} \Vert}^2 \\$
$\quad E_{out} \vert_{\lambda_i} =\sum_{\mathbf{x}_{validate}} {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 \\$
$\lambda^* = \arg\min_{\lambda \in \{\lambda_i\}} E_{out}\vert_{\lambda_i}$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{i=1}^N {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda^* {\Vert \mathbf{w} \Vert}^2.$
Final model: $y = f(\mathbf{x}, \mathbf{w}^{*})$.

For the LOOCV approach:

$\text{for each $\lambda_i$ in candiates } \{\lambda_1, \lambda_2, \ldots, \lambda_z\} \\$
$\quad \text{for each record $j$ in $N$ records} \\$
$\qquad \mathbf{w}_i^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {(y_k-f(\mathbf{x}_k, \mathbf{w}))}^2 + \lambda_i {\Vert \mathbf{w} \Vert}^2 \\$
$\qquad E_{out}(j) \vert_{ \lambda_i} =  {(y_j-f(\mathbf{x}_j, \mathbf{w}_i))}^2 \\$
$\quad E_{out}\vert_{ \lambda_i} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{\lambda_i} \\$
$\quad \lambda^* = \arg\min_{\lambda \in \{\lambda_i\}} E_{out}\vert_{\lambda_i}$
$\mathbf{w}^* = \arg\min_{\mathbf{w}} \sum_{i=1}^N {(y_i-f(\mathbf{x}_i, \mathbf{w}))}^2 + \lambda^* {\Vert \mathbf{w} \Vert}^2.$
Final model: $y = f(\mathbf{x}, \mathbf{w}^{*})$.

The last two lines of both workflow state once we obtain $\lambda^*$, we can do a final training using all $N$ records and hopefully be guided to our best estimation of $\mathbf{w}^*$.  The reason we retrain with all $N$ record, instead of using the model associated with the $\lambda^*$ obtained from $N-K$ records within the loop is because of learning curve theory.  The model/procedure that is trained based on larger data set conceptually has a better $E_{out}$, so the model trained using all $N$ records is expected to give a $g$ that performs better.
$$\begin{equation} E_{out}^N \lt E_{out}^{N-K} \end{equation}.$$
In the LOOCV procedure, we actually do not evaluate the performance of a particular hypothesis, but rather evaluate the average performance of $N$ different hypotheses, where each hypothesis only has one data point.  Here the assumption is these $N$ hypotheses are obtained using the same learning process, just with slightly different input data set, therefore the average reflect the performance of the learning process, or the performance of the average hypothesis can be obtained using this learning process.  You can either consider we are sort of evaluating $\mathbb{E}_{\mathcal{D}}\; g^\mathcal{D}$ or better simply our learning process itself.

LOOCV allows us to use as many as $N-1$ records for model training, while still obtain an estimation of the performance of our final hypothesis, the trade-off is we need to computer the model $N$ times instead of once.  When $N$ is a large number, instead of doing leave-one-out, we could split the data set into $k$-fold, and loop through them using one fold for validation one at a time, in much the same way we are doing LOO.  This is called $k$-fold cross validation, which estimates the performance of hypothesis $g^{N\frac{k-1}{k}}$, a slight under-estimation of our final model $g^{N}$, but we reduce the computational cost to a fraction of $\frac{k}{N}$.

Just like Hoeffding Inequality, we have a multiple-test problem here in our $\lambda$ selection workflow.  For each given $\lambda$ candidate, the training-validation provides an unbiased estimation of its $E_{out}$, however, if we test $z$ $\lambda$ values and pick the one gives the least $E_{out}^*$, we are more likely to obtain a low $E_{out}^*$ by chance.  Recall the coin tossing example in Chapter 2, the best $E_{out}^*$ achieved in this procedure, $\min(E_{out})$, is a optimistically "biased" estimation of its true $E_{out}$, where bias means it is systematically smaller than its true value.  Worst case scenario $\max(E_{out})$ probably is a pessimistically biased towards the other end. The true performance $E_{out}$ should be (using the $K$-validation set as an example):
$$\begin{equation} E_{out}^{*N-K} = \min_{\lambda_i}(E_{out}^{\lambda_i,N-K}) \le E_{out}^{N-K} \le \max_{\lambda_i}(E_{out}^{\lambda_i,N-K}). \end{equation}$$

Combining Equation (16) and (17), the exact relationship between our final hypothesis $E_{out}^N$ and our estimation $E_{out}^{*N-K}$ is uncertain.  In general, we probably expect the multiple test effect is more significant and $E_{out}^{*N-K}$ probably is an over-estimation of our final hypothesis performance.  The true performance will have to wait for test sets in future real applications, or estimated using a nested-cross-validation procedure discussed in the next section.

The idea of using a validation set to estimate $E_{out}$ and then provides feedback to the selection of hyper-parameter $\lambda$ is a very powerful one.  We can also generalize this idea to select other model hyper-parameters or even models themselves.  The models here refers to a linear model, polynomial models, SVM, random forest, etc.  Each model has its own hyper-parameters (including $\lambda$), e.g., polynomial model can have its highest order as a hyperparameter, SVM with Gaussian kernel can have $\gamma$ as a hyper-parameter (Chapter 5).  We can consider the most general case, where each unique model-hyper-parameter combination is a new $\lambda$ and go through the above $\lambda$ selection process.  The outcome of that is the selection of a specific model-hyper-parameter combination that gives $\min(E_{out}(\lambda))$.  We will see how this works in practical cases next.


Hyper-parameter Selection and Model Selection


The cross-validation analysis above is extremely important for us to assess different learning designs.  We can come up with creative architectures and be able to evaluate whether they are legit.  Let us look at some learning architectures we probably have debated in our past ML practice and see how they work under the light of the cross valiation theory we just learned.


Say we participate in a ML competition for one-million dollar award and we have every motivation to make sure our final hypothesis perform well in the secret set of test records locked in a safe by our campaign host.  So we are given $N$ records and there is no way for us to get any more records.

Learning Architecture A, which is the architecture we use to produce our submission hypothesis.
$\text{for each (Model, hyper-parameter) in  } \{(M_1, p_1^1), (M_1, p_2^1), \ldots, (M_n, p_1^n), (M_n, p_2^n), \ldots, (M_n, p_m^n) \} \\$ $\quad \text{pick the } (M^*, p^*) \text{ gives $\min(E_{out}(N-1))$ under LOOCV (in gray)}$ $\color{gray}{ \qquad \text{for each record $j$ in $N$ records} \\ \quad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \quad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \qquad E_{out}\vert_{ M,p} = \frac{1}{N} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$ $\text{Final Hypothesis: retrain } (M^*, p^*) \text{ on $N$ records and get } M^*(\mathbf{w}^*) \vert_{p^*}.$
We basically use LOOCV for model and hyper-parameter selection, then train it on all $N$ records to obtain the parameters (e.g., $\mathbf{w}$) for the best possible $E_{out}$ performance.

Due to the multiple-test problem, the true performance of our final hypothesis $E_{out}(N)$ is most likely worse than $E_{out}(N-1)$ archived by $(M^*, p^*)$ within the selection loop, because we intentionally pick the model/hyper-parameter that gives the best $E_{out}$, which is biased.  The running time is determined by the total number of model-hyper-parameter combinations $O(M \times P \times N)$.  The true performance of our final hypothesis can only be obtained by the host opening up his safe and test our hypothesis on the secret test set.  Assume we do not want to wait for the notification email from our host, let us see how we can estimate $E_{out}$ ourselves using Architecture B:


$\text{save $K$ records as test set} \\$
$\quad \text{run Architecture A on the remaining $(N-K)$ records in gray } \\$
$\color{gray}{ \qquad \text {loop through all $(M, P)$ candidates} \\ \quad \qquad \text{for each record $j$ in $N-K$ records} \\ \qquad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \qquad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \quad \qquad E_{out}\vert_{ M,p} = \frac{1}{N-K} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$
$\quad \text{obtain best model $M^*(\mathbf{w}^*) \vert_{p^*}$ by training on $N-K$ records}\\$
$\text{calculate $E_{out}(M^*(\mathbf{w}^*) \vert_{p^*})$ using $K$ test records } \\$

Here, we basically carve out $K$ records and pretend we have a test set, so that we know $E_{out}$. However, the model we are evaluating was trained based on $(N-K)$ records, $g^{N-K}$, is conceptually less accurate compared to the hypothesis in Architecture A $g^N$.  So we should still stick to the Architecture A's output as our submission hypothesis.  The $E_{out}$ here is an unbiased estimation of hypothesis $g^{N-K}$, which is a conservative estimation of $g^N$ according to Learning Curve theory.  There we submission hypothesis is conceptually better than our $E_{out}$ estimation.  Again Architecture B is just for $E_{out}$ estimation, it should not be used for final model submission.  The computation time for Architectu B is similar to A, i.e., $O(M \times P \times (N-K))$.  The disadvantage of Architecture B is what we talked about before, $K$ needs to be large in order to evaluate $E_{out}$, but large $K$ eats away our training data set.  Okay, we know the solution to this problem, that is LOOCV, which leads us to Architecture C.

$\text{for each $i$ in $N$ records, keep $i$ as test set } \\$
$\quad \text{run Architecture A on the remaining $(N-1)$ records in gray } \\$
$\color{gray}{ \qquad \text {loop through all $(M, P)$ candidates} \\ \quad \qquad \text{for each record $j$ in $N-1$ records} \\ \qquad \qquad \mathbf{w}_{M, p}^* = \arg\min_{\mathbf{w}} \sum_{k \ne j} {E_k\vert_{M, p}}  + E_{reg}\vert_{M, p}, \\ \qquad \qquad E_{out}(j) \vert_{M,p} = E_j(\mathbf{w}_{M, p}^*) \\ \quad \qquad E_{out}\vert_{ M,p} = \frac{1}{N-1} \sum_{j} \; E_{out}(j) \vert_{M,p} \\ \qquad (M^*, p^*)= \arg\min_{M,p} E_{out}\vert_{M,p}\\ }$
$\quad \text{obtain best model $M_i^*(\mathbf{w}^*) \vert_{p^*}$ }\\$
$\quad \text{calculate $E_{out}^i(M_i^*(\mathbf{w}^*) \vert_{p^*})$ using record $i$ } \\$
$E_{out}(N-1) = \frac{1}{N}\sum_i \; E_{out}^i(M_i^*(\mathbf{w}^*) \vert_{p^*}).$

Here, our $E_{out}$ estimation is unbiased and probably is very the close to final model $g^N$.  The computational cost is huge though $O(M \times P \times N^2)$.  We have $N$ different picked hypotheses within the loop running Architecture A (line 2-4), which one to choose as our final answer?  The one with the best estimated $E_{out}$?  No, because that could well be a hypothesis happens to archive the good performance due to multiple-test luck (multiple test on model+hyper-parameter combinations).  Actually, we do not pick any candidates from Architecture C, but only use it to estimate hypothesis performance!  Our final submission hypothesis is still the one obtained from Architecture A.

Architecture C is known as the nested cross validation scheme, as it runs CV twice, the outer CV is for an unbiased $E_{out}$ estimation of the whole learning process as described in Architecture A (i.e., the learning process includes the model and hyper-parameter selection logic), the inner CV is for model and hyper-parameter selection.  This is the most computationally expensive design, but gives the best $E_{out}$ estimation.  The multiple model+hyper-parameters selected within each fold of Architecture C are useful in understanding the stability of a model+hyper-parameter being the best choice.  If we see the model+hyper-parameter identified by Architecture A is consistently ranked lower compared to other choices in Architecture C, we know there probably is an over-fitting in the model selection process (i.e., our pool of candidate model+hyper-parameters pool is too large).  Notice since we evaluate $E_{out}$ for only one hypothesis in the most outer CV loop of both Architecture B and C, there is no multiple-test problem we need to worry about and the $E_{out}$ obtained is unbiased.  Also bear in mind that we are evaluating the whole learning process described as Architecture A, not any of those hypotheses appears inside Architecture A.  When $N$ is too large, we can use $k$-fold validation too replace the LOOCV loop to reduce computational cost.

To summarize, we construct our submission hypothesis using Architecture A.  Its $E_{out}$ should be estimated using Architecture C, however, the computational cost there is $O(N^2)$.  Our theoretical knowledge on cross validation and learning curve enable us to analyze these three different architectures and understand their advantages and disadvantages. This is a key skill probably distinguishes data scientists from casual tool users.

There are a few links that are useful to further understand model selection, we collect them here [3][4][5].

Reference

  1. LFD, page 64.
  2. LFD, page 147.

  3. While the nested CV (split data into train, validate, test) is used to calculate performance of the model found above. The normal CV is used to find the best model (hyper-parameter determination)
  4. We don not choose the final model from nested CV.



No comments: