Tuesday, May 23, 2017

Chapter 2. Learning Theory

Evaluating a Single Hypothesis - Hoeffding Inequality


Before we start, let us define two concepts: hypothesis and model.  A hypothesis is a specific function with no unknown parameters.  In the previous credit card approval example, approval based on income using $y = {\rm sign(salary \gt 30000)}$ is a specific hypothesis, where we approve applicants whose annual salary is above 30k.  $y = {\rm sign(salary \gt 50000)}$ is yet another hypothesis.  A hypothesis is a fixed form before even seeing any training data.  These two hypotheses are instances of the model $y = {\rm sign(salary} \gt w)$, where $w$ is a model parameter.  The model can be a nebula concept, e.g., we can say a hypothesis set can be a union of two models including both $y = {\rm sign(salary} \gt w_1)$ and $y = {\rm sign(saving} \gt w_2)$, or we say there is only one complex model $y = \delta \cdot {\rm sign(salary} \gt w_1) + (1-\delta) \cdot {\rm sign(saving} \gt w_2), \delta = \{0, 1\}$, with parameters $\delta$, $w_1$ and $w_2$.   The learning process, as described in Chapter 1, is to select the best hypothesis, i.e., determine the best values of model parameters, after seeing training data set $\mathcal{D}$.

Let us consider a classification problem $y \in \{-1, +1\}$.  If our hypothesis set $\mathcal{H}$ only contains one hypothesis $h$, i.e., we already have the model and its parameters in mind based on our prior knowledge before seeing the training data set $\mathcal{D}$.  In this case, the knowledge $h$ is given by the programmer, so no learning happens.  When given $N$ training data, we can measure the error of the model within the given samples as $E_{in}$ ("in" stands for within-sample error).  From statistics, we know that the true accuracy of the model, $E_{out}$, (the error measured by applying the exact same model on future never-seen out-of-sample data) follows a binomial distribution.  The true $E_{out}$ will be measured by a large number of records to be encountered in the future, in practice, it is always estimated based on a finite set.  For convenience, let us assume we will use another $N$ new records to estimate $E_{out}$, when the need occurs to in later context.

With reasonable size $N$, true $E_{out}$ follows a Gaussian distribution centered on $E_{in}$, and a standard deviation $\sigma$ based on binomial distribution, where
$$\begin{equation} \sigma = \sqrt{E_{in}(1-E_{in})/N}. \end{equation}$$
The probability distribution of $E_{out}$ is:
$$\begin{equation} p(E_{out}) = \frac{1}{\sqrt{2 \pi}\sigma} \exp\left(-\frac{{(E_{out}-E_{in})}^2}{2 \sigma^2}\right). \end{equation}$$
As $N$ grows, $\sigma$ shrinks, our estimation of $E_{out}$ is closer and closer to $E_{in}$.  We are all familiar with this in election poll studies, where $E_{in}$ is the candidate support rate within $N$ people who participated in our pool, $E_{out}$ is the true support rate in the population we try to predict.   When $N$ is large and thus $\sigma$ is small, $E_{in} \approx E_{out}$.

For the convenience of later discussion, we nevertheless need to go one step further to derive a mathematical form describing how close the two terms are.  Let us calculate the probability of undesirable case ("undesirable" here refers to true $E_{out}$ is far worse than in-sample estimation $E_{in}$), i.e., $E_{out} - E_{in} \ge \epsilon$, $\epsilon$ is a positive tolerance value.  According to Equation (2), we have:
$$\begin{equation} \begin{aligned} p(E_{out} - E_{in} \ge \epsilon) &= \int_{x = \epsilon}^{+\infty} \frac{1}{\sqrt{2 \pi}\sigma} \exp\left(-\frac{x^2}{2 \sigma^2}\right) dx, \\ &= \frac{1}{\sqrt{2 \pi}\sigma} {\left[ \int_{y = \epsilon}^{+\infty} \int_{x = \epsilon}^{+\infty} \exp\left(-\frac{x^2+y^2}{2 \sigma^2}\right) dx\; dy \right]}^{1/2}, \\ &\le \frac{1}{\sqrt{2 \pi}\sigma} {\left[ \frac{\pi}{2}\int_{r = \epsilon}^{+\infty} \exp\left(-\frac{r^2}{2 \sigma^2}\right) r\;dr \right]}^{1/2}, \\ &= \frac{1}{4}\exp\left(-\frac{\epsilon^2}{2 \sigma^2}\right), \\ &\le \frac{1}{4}\exp(-2\epsilon^2N). \end{aligned} \end{equation}$$
The last step came from Equation (1), where the maximum value of $\sigma$ is archived at $E_{in} = \frac{1}{2}$.  This is the Hoeffding Inequality:
$$\begin{equation} p(E_{out} - E_{in} \ge \epsilon) \le \frac{1}{4}\exp(-2\epsilon^2N), \end{equation}$$ which states the performance of a model in $E_{out}$ and $E_{in}$ are exponentially close as $N$ increases.  Therefore, if a given hypothesis works well in training data set $\mathcal{D}$ given sufficient $N$, it is likely to do well in test set as well, and a poor-performing hypothesis in $\mathcal{D}$ is likely to perform poorly in test set.  All make sense.  Just remember this conclusion, if it is challenging to follow the math in this section.  Equation (4) here applies to a single hypothesis, does it apply to a model or a hypothesis set?

Hypothesis Set and Learning Algorithm


In reality, the hypothesis set always contains multiple hypotheses.  If we have only one hypothesis, nothing to learn.  Let us consider a hypothesis set of $M$ independent hypotheses $\{h_1, h_2, \ldots, h_M\}$ and pretend they all have about similar performance $E_{out}$.  Given $N$ training samples, $E_{in}$  and $E_{out}$ for each individual hypothesis follows the Hoeffding Inequality, i.e.:
$$\begin{equation} p(E_{out}(h_i) - E_{in}(h_i) \ge \epsilon) \le \frac{1}{4}\exp(-2\epsilon^2N), \end{equation}$$
Since we test the training set through $M$ hypotheses to pick a hypothesis $g$ that has minimum $E_{in}$ as our final answer:
$$\begin{equation}
E_{in}(g) = \min_{h \in \mathcal{H}}(E_{in}(h)),
\end{equation}$$
the probability for obtaining an impressive $E_{in}(g)$ that is $\epsilon$ away (smaller) from its true $E_{out}$ increases by $M$ times!  In another word, $E_{in}(g)$ tends to be an over-optimistic estimation of $E_{out}(g)$.

This is also known as multiple test problem in statistics and can be explained with a coin-tossing experiment.  Taking $M$ = 1024 fair coins as an example, we throw each of them $N$ = 10 times to estimate how often they give a head.  Although their true probability of head is always 50%, we expect on average one of the $M$ coins will give a ten-head-in-a-roll result.  If fact, the chance that we do observe ten-head-in-a-roll is $1-{\left(\frac{1023}{1024}\right)}^{1024} = 63\%$.  Here among the $M = 1024$ coin tossers, very likely one person found a 10-head coin.  If he thinks  that coin is very likely give more heads in the future, he makes a mistake, as the probability for the coin to be unbiased is nearly one.  Even though a ten-head-in-a-roll happens with very small chance 1/1024, if we only test one coin, the probability roughly grows by a factor of 1024, if we test 1024 coins.

Applies the coin-tossing analogy to our case, although each of our $M$ hypotheses only has a small probability that their $E_{in}$ and $E_{out}$ is too far off (BAD) according to Hoeffding Inequality Equation (5), as we check on large enough $M$, the likelihood of such BAD case is multiplied by $M$.
$$\begin{equation} p(E_{out}(g) - E_{in}(g) \ge \epsilon) \le \frac{M}{4}\cdot\exp(-2\epsilon^2N). \end{equation}$$
If $M$ is large enough, it offsets the exponential term in Equation (7), then $E_{in}(g)$ and $E_{out}(g)$ are way off, learning becomes infeasible.  Just like we cannot find a most head-occurring coin after 1024 trial series and claim it is a head-biased coin with high confidence, we cannot ignore the bias introduced by $M$.  Hoeffding Inequality only works for one pre-determined hypothesis, it does not work for the best performing hypothesis out of a big hypothesis set.  The practice of multiplying $M$ to correct the multiple test probability in Equation (8) is called Bonferroni correction, which we will see later, is a very conservative correction.

The necessity of Bonferroni correction in Equation (7) can be explained in another very clear manner [1].  Imagine we repeatedly sample many $N$-record data sets from the source, $\{\mathcal{D}_1, \mathcal{D}_2, \ldots, \}$, and we apply a hypothesis to each of these data sets individually.  In figure 2.1, each row in the table represents a particular hypothesis and each column represents a data set sampled. We use the $BAD$ to flag the case, where a training set $\mathcal{D}_1$ gives hypothesis $h_1$ an over-optimistic $E_{in}$ (i.e., its $E_{in}$ no longer is a good estimate of its $E_{out}$), we flag the cell corresponding to $h_1$ and $\mathcal{D}_1$ as $BAD$ (Figure 2.1).  Hoeffding Inequality tells us for each given hypothesis in $\{h_1, h_2, \ldots, h_M\}$, i.e, for all cells in a row in Figure 2.1 except the last tally row, the portion of $BAD$ data set should be exponentially small as described by Equation (5).  $GOOD/BAD$ here does not refer to whether a hypothesis is accurate or not, $GOOD/BAD$ only refers whether $E_{out} \approx E_{in}$.  A random guessing model is always $GOOD$, because its performance in the within-sample data set and out-of-sample data sets are basically the same.
 
hypothesis$\mathcal{D}_1$$\mathcal{D}_2$$\mathcal{D}_3$$\ldots$$\mathcal{D}_n$$\ldots$
$h_1$$BAD$$\ldots$$\ldots$
$h_2$$BAD$$\ldots$$\ldots$
$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$$\ldots$
$h_M$$BAD$$\ldots$$BAD$$\ldots$
$\text{any } h$$BAD$$BAD$$\ldots$$BAD$$\ldots$
Figure 2.1 Each hypothesis, one per row, contains exponentially few BAD cells according to Equation (5). The last row is the union of BAD for all previous $M$ rows, so it could have as many BAD as descried by Equation (8).  If we pick any one of them (the bottom row) as $g$, it is guaranteed to hit less than $M$ times the expected number of BAD cells in one row.

The reality is we are only given one training data set and we will pick one of the $M$ hypotheses as the final hypothesis $g$.  If the data set we see happens to be a column in Figure 2.1 containing at least one cell flagged as $BAD$, we have at least one hypothesis whose $E_{in}$ does not represents $E_{out}$ well, and there is always a non-zero chance this is the hypothesis picked by our learning algorithm as our final hypothesis $g$.  When this happens, the in-sample performance of our $g$ is too far from its real performance.  Only when the given $\mathcal{D}$ happens to be a column that contains zero $BAD$ cell, it is safe for us to pick any hypothesis as our answer $g$, as it contains no $BAD$ cell in that column, i.e., $E_{out}$ is always predictable by $E_{in}$.  How often the given data set $\mathcal{D}$ falls into a column containing zero $BAD$ cell?  In the worst case scenario, if each $BAD$ cell occupies a different column, so that the last tally row contains the maximum possible number of $BAD$ columns described by Equation (7), that is the maximum probability of our encountering a unlucky $\mathcal{D}$.  This is where Bonferroni-correction for Hoeffding Inequality comes from.

With Equation (7) in hand, we are heading for an immediate disaster.  Think about a popular model, a linear classifier, where $2{\rm sign}(\mathbf{w}^\intercal \mathbf{x})-1$ determines whether we predict the $y$ label of a data point to be $+1$ or $-1$.  Weight vector $\mathbf{w} \in \mathbb{R}^{d+1}$, therefore, there are infinite number of $\mathbf{w}$, i.e., our hypothesis set $\mathcal{H}$ is a continuous hypothesis space, $M \to \infty$.  Therefore $E_{in}$ and $E_{out}$ are no longer close according to Equation (7).  In another word, the performance of the best linear hypothesis within the training data set no longer reflects its performance in future new sample sets.  Machine cannot learn from data using linear models!  However, our gut feeling is linear model works well in many cases, what went wrong in our reasoning?

There are a number of arguments against $M$ being infinity.  Many hypotheses are extremely close or nearly equivalent in the continuous hypothesis space (Figure 2.2).  Therefore, infinite number of linear decision planes that are so close that they almost always give identical classification results for most $N$-point data sets, then they should not be counted as two independent hypothesis, as their $BAD$ cell patterns in Figure 2.1 are basically identical.  Thus we can expect the “effective” number of independent hypotheses for continuous models are not infinite and there is hope that $\frac{M}{4} \cdot \exp(-2\epsilon^2N)$ actually converges to zero as $N$ grows.  Such effective hypothesis count is solved by the VC dimension theory.
Figure 2.2. Two linear models $h_A$ and $h_B$ are very close in practice, as they give the same classification results for the majority of data sets.  The $BAD$ pattern for these two models in Figure 2.1 would be largely identical.  If there are tons of $BAD$ cells occupying the same column, the last tally row would not have $M$ times more $BAD$ cells than a single row.


VC Dimension


Let us consider a two-dimensional linear model for a two-class classification problem.  Given $N = 3$, we can always find a plane that correctly separates $+$ and $-$ data points, no matter how we label them; this is called we can shatter $N = 3$ (Figure 2.3).  Given $N$ = 4 data points, it is no longer the case, Figure 2.3 shows an example, where no linear hypothesis could have generated labels for all four points correctly.  So for $N \ge 4$, even if we exhaustively test all possible linear hypotheses in $\mathcal{H},$ we still could not generate all $2^N$ possible labels.  This implies if we look through the holes of $N$ data points, the total effective number of hypotheses in $\mathcal{H}$ is less than $2^N$.  For example, in Figure 2.2, we really do not care about the difference between two close models $h_A$ and $h_B$, as they produce nearly the same output for all data points in the problem.

Figure 2.3. Linear hypothesis set can generate all $2^3$ possible labels for 3 given data points.  But it cannot generate $2^4$ possible labels for 4 given data points.  Here is an example of 4 data points where no linear hypothesis can separate.  This means some linear hypotheses have to produce identical labels across $N$ data points as N grows.  This redundancy will trim down the effective number of hypotheses heavily.

When we count effective hypotheses through the holes of $N$ data points, we can see how a linear model can only produce approximately $N^2$ independent hypotheses without rigorous proof.  We can roughly pick any two data points and draw a line (a plane), there are about $\frac{N(N-1)}{2}$ such planes that can serve as the bases of independent hypotheses (label patterns).  We do not need to be exact here, the key is we have a sense that all decision planes can only produce $\sim O(N^2)$ distinct labels for an $N$-record data set.  Even if we consider $N$ samples we use for model training and another independent $N$ test samples for $E_{out}$ estimation, there are at most $(2N)^2$ possible unique label assignments to the $2N$ total data points.  Any two hypotheses share the same label output are effectively counted as only one hypothesis in the view of our training and test data sets, as they produce the same $BAD$ patterns, therefore should only contribute once in the Bonferroni correction.  As long as the effective number of hypotheses in $\mathcal{H}$ is polynomial on $N$ and not exponential on $N$, the Bonferroni-corrected Hoeffding Inequality in Equation (7) will eventually converge at large enough $N$.

When $N$ is small and models can shatter $N$ records, our hypothesis space could in principle contain $2^N$ different models, which could offset the exponential term in Equation (7) and the gap between $E_{in}$ and $E_{out}$ are not bound. But as $N$ grows further, $M$ becomes a polynomial term on $N$, the exponential term will eventually dominate and the problem becomes machine learnable. The maximum $N$ our model can shatter is called the VC dimension (Vapnik–Chervonenkis) of our model $d_{VC}$. Our new bound on $E_{out}$ is: $$\begin{equation} p(E_{out}(g) - E_{in}(g) \ge \epsilon) \le \frac{N^{d_{vc}}}{4} \exp(-2\epsilon^2N) \end{equation}$$
Let us work on an exercise to determine the VC dimension of the linear model space, where $\mathbf{w} = ( w_0, w_1, \ldots,  w_d)$, consisting of a total of $d+1$ model parameters.  Given $N=d+1$ training data points of arbitrary $\{ +1, -1\}$ labels, we are guaranteed (ignore singularity case) to find a $\mathbf{w}$ vector that can shatter them.  A $\mathbf{w}$ that can correctly predict all possible label assignments can be found by solving the following linear equation:
$$\begin{equation} X\mathbf{w} = \mathbf{y}. \end{equation}$$
$$\begin{equation} \mathbf{w} = (X^\intercal X)^{-1}X^\intercal \mathbf{y}. \end{equation}$$
So a linear model with $d+1$ parameters can shatter any $d+1$ points.

If we are given $d+2$ points, in general there are $d+2$ equations, when we expand $X\mathbf{w} = \mathbf{y}$ for $d+1$ unknown weight components.  We may first solve $\mathbf{w}$ using the first $d+1$ equations.  Then say our solution $\mathbf{w}$ predicts the $d+2$ data point as $+1$, the model cannot shatter the a data set when the label for the $d+2$ data point is actually $-1$, vice versa.  This proves the VC dimension for a linear classification model with $d+1$ parameters is $d+1$.  The bounds for the linear model therefore is:
$$\begin{equation} p(E_{out}(g) – E_{in}(g) \ge \epsilon) = \frac{N^{d+1}}{4} \exp(-\epsilon^2N). \end{equation}$$
Unlike the above linear model, for most machine learning models, their exact VC dimensions are very hard to find.  However, all we need to know is as long as the VC dimension for the model is not $N$ but a finite value $d_{vc}$, the effective model counts is some polynomial term on $N^{d_{vc}}$, the learning problem is feasible given sufficiently large number of $N$.  The conclusion from the linear model can nevertheless serve as a rule-of-thumb that VC dimension is roughly the degree of freedom, i.e., the number of independent model parameters.  This is just an empirical observation, not a theory.

Are there learning models with exponential number of effective hypotheses $2^N$, i.e., they can shatter any given $N$ data points? Certainly.  My favorite example is a cheating polynomial classification/regression model.  Given $N$ data points $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)$, we can locate a classification/regression hypothesis that exactly explains all $y$ value as the following:
$$\begin{equation} \begin{aligned} y(\mathbf{x}) = &y_1\frac{|\mathbf{x}-\mathbf{x}_2|\cdot|\mathbf{x}-\mathbf{x}_3| \cdots |\mathbf{x}-\mathbf{x}_N |}{|\mathbf{x}_1-\mathbf{x}_2 | \cdot |\mathbf{x}_1-\mathbf{x}_3 |\cdots |\mathbf{x}_1-\mathbf{x}_N|}+ \\ & y_2\frac{|\mathbf{x}-\mathbf{x}_1|\cdot|\mathbf{x}-\mathbf{x}_3|\cdots |\mathbf{x}-\mathbf{x}_N|}{|\mathbf{x}_2-\mathbf{x}_1|\cdot|\mathbf{x}_2-\mathbf{x}_3|\cdots |\mathbf{x}_2-\mathbf{x}_N|}+ \\ & \cdots \\ & y_n\frac{|\mathbf{x}-\mathbf{x}_1|\cdot|\mathbf{x}-\mathbf{x}_2|\cdots |\mathbf{x}-\mathbf{x}_{N-1}|}{|\mathbf{x}_N-\mathbf{x}_1|\cdot|\mathbf{x}_N-\mathbf{x}_2|\cdots |\mathbf{x}_N-\mathbf{x}_{N-1}|}. \end{aligned} \end{equation}$$
This model will fit any $N$ input data points exactly, no matter how we assign their labels/values, because it basically memorizes the answers for the training data.  Therefore the VC dimension for our model is $N$.  Then there is no guarantee the adjusted Hoeffding inequality Equation (7) will converge to zero at large $N$.  In fact we will expect this model cannot learn from data and it should have no predicting power.  Some of us may realize this model actually should be the exact solution of $\mathbf{y} = X\mathbf{w}$ using $(N-1)^{th}$ order polynomial model (same solution as Equation (10)), except we keep on increasing the model complexity $N$ as input data points grows.

For most learnable models where VC dimension is bound, with good confidence, we know:
$$\begin{equation} p(E_{out}(g) - E_{in}(g) \le \epsilon) \approx \frac{N^{d_{vc}}}{4} \exp(-2\epsilon^2N) \end{equation}.$$
Assign
$$\begin{equation} \delta = p(E_{out}(g) - E_{in}(g) \le \epsilon), \end{equation}$$
we get:
$$\begin{equation} \epsilon = \sqrt{\frac{1}{N}\ln \left(\frac{N^{d_{vc}}}{4\sigma}\right)}. \end{equation}$$
This can be simplified into:
$$\begin{equation}\begin{aligned} E_{out} &\approx E_{in} + \sqrt{\frac{1}{N}\ln \left(\frac{M}{4\sigma}\right)}\\
&\approx E_{in} + \sqrt{\frac{1}{N}\ln \left(\frac{N^{d_{vc}}}{4\sigma}\right)}\\
&\approx E_{in} + O\left(\sqrt{d_{vc}\frac{\ln N}{N}}\right). \end{aligned}\end{equation}$$
Our ultimate goal is to have a model that works well in $E_{out}$, therefore, instead of finding a $g$ in $\mathcal{H}$ that givens the minimum $E_{in}$, we should look for the $g$ that gives the overall minimum score for $E_{out}$ in Equation (16).  The first term in the equation is the in-sample error $E_{in}$, the second term is a generalization term related to VC dimension, called $E_{reg}$.  If we ignore $E_{reg}$ and only focus on minimizing $E_{in}$, typically the model containing more parameters, i.e., larger $M$, has the advantage of exploring a larger hypothesis space, therefore, can find a smaller $E_{in}$.  The models such as the previous cheating polynomial model would be a winner (its $E_{in} = 0$), but then its marvelous success in $E_{in}$ cannot be extrapolated into $E_{out}$, the model sounds too misleadingly good in the sample data set, it only fail miserably in future test sets.

Figure 2.4 is probably the most important figure in ML [2].  It says if our model is too simple, it does not have enough power to describe training data well, therefore $E_{in}$ (orange arrow in Figure 2.4a) would be high and $\epsilon$ would be low (Equation (16)), then $E_{out}$ would be high and not useful.  However, if we adopt a rather complex model with large degrees of freedom in fitting, we can approach training data really well and $E_{in}$ is very small (blue arrow in Figure 2.4a), however, we lost the generalization capability ($\epsilon$ is huge), the second term in Equation (16) blows up, the $E_{in}$ would be far from $E_{out}$ due to “over-fitting” (as we test too many hypothesis and $g$ is extremely likely to be the one happens to have good $E_{in}$ by chance, but will perform poorly in $E_{out}$).  $E_{out}$ is depicted as upper bound of the confidence interval of the model error rate in Figure 2.4a.  This conclusion is also summarized in Figure 2.4b, where we expect there is an optimal model complexity for a given learning problem, neither too simple nor too complex.  The ideal model is the one minimizing Equation (16), not the one minimizing $E_{in}$ alone.

Figure 2.4.  The out-of-sample error $E_{out}$ consists of in-sample error $E_{in}$ and a regularization error $E_{reg}$.  For over-simplified models, $E_{in}$ is large $E_{reg}$ is small, this is called under-fitting.  For over-complicated models, $E_{in}$ is small but $E_{reg}$ is large, this is called over-fitting.  Neither is good, there is an optimal model complexity that balance the two terms and gives the optimal $E_{out}$.


There are book examples indeed minimize Equation (16), when $d_{vc}$ is known, although I cannot recall the reference to that example at this point. In practice, we do not use Equation (16) as our cost function, because $d_{vc}$ is often unknown.  More importantly, VC theory only estimates a lose bound, i.e., the generalization term in Equation (16) is far from being accurate.  The VC bound is extremely lose, as we have made huge approximations in order to get to this simple math form.  The real effective model counts are typically way less than $N^{d_{vc}}$.  E.g., according to VC theory, given a 2-$d$ linear model with $d_{vc}$=3, we would like the generalization error $\epsilon$ be at at most 0.1 with confidence 90% (i.e., $p \le 0.1)$, $N$ would be at about 30,000  according to Equation (16) [3].  In fact, experience shows “requiring $N$ be at least $\mathbb{10} \times d_{vc}$ to get decent generalization is a popular rule of thumb” [4].  Another example to convince us VC is an upper bound is the 1-nearest neighbor model.  This model can also correctly predict any given $N$ data points, therefore, it can shatter any $N$ points, with a VC dimension of $N$.   However, we know in fact 1-NN model works quite well in practice.  In fact it can be proven that for dense-enough training data, 1-NN has a performance at about half of what is achievable by the best possible model.  If the best possible classification model has an accuracy of 98%, 1-NN can archive an accuracy of 96% for a smooth target function![5].  So considering all the above arguments, we cannot come up with an analytical expression to capture generalization error, we need an alternative empirical approach to estimate $E_{reg}$.

Regularization


There are two lessons we learn from the VC theory: first, machine learning is feasible, as we do not really deal with a hypothesis set of infinity size.  Second, generalization error increases as model complexity increases, therefore, the more we can constrain the effective hypothesis counts in $\mathcal{H}$, the smaller the regularization error will be.  Along this line, if we consider a second-order polynomial regression model:
$$\begin{equation} y = w_0 + w_1x_1 + w_2x_2+ \ldots + w_nx_n + w_{11}x_1^2 + w_{12} x_1x_2 + \ldots + w_{nn}x_n^2. \end{equation}$$
This is a super hypothesis set of the linear model space, therefore, the VC dimension of this polynomial model is larger than linear model and $E_{reg}$ is also larger.  To reduce our model space, we either resort to a linear model, or we can stay within the second order polynomial model space but instead trim it down.  For polynomial models, we can archive this by constraining weight vector ${\Vert \mathbf{w} \Vert}^2$.  We therefore can minimize Equation (16) by the following proxy form:
$$\begin{equation} E_{out} \approx E_{in} + \lambda {\Vert \mathbf{w} \Vert}^2.\end{equation}$$
$\lambda$ is a parameter that we assume will allow us to best approximate the real generalization error.  This form of regularization is called ridge regression (or L2).  You can invent many other regularization forms, e.g., $\lambda \sum_{i} \vert \mathbf{w}_i \vert$, which is called Lasso regression (or L1).  For polynomial regression model $y= w_0 + w_1x +w_2x^2 + w_3x^3 + \ldots + w_qx^q$, we may also use $\lambda \sum_{i}\; e^q{(w_q)^2}$, so that weights for higher-order terms get penalized more, etc (Figure 2.5).   A mixture of Lasso and ridge is call elastic net.  In most cases, we do not know what the real analytical expression is for the regularization error function, but it does not matter too much, as we just use this term to adjust the size of our effective hypothesis space, so that we reach an optimal balance between $E_{in}$ minimization and $E_{reg}$ minimization.
Figure 2.5. Adding a regularization penalty on $\mathbf{w}$ effectively reduce the hypothesis space.  The schematic drawing outlines the original $\mathbf{w}$ space (first) and the reduced space under ridge regression (second), Lasso regression (third), and other forms.

In a linear model, $w_0$ is often used to serve as the data normalization term, e.g., if ${\rm mean}(\mathbf{y}) \neq 0$ (or $\mathbf{x}$ is not zero-centered) in a regression problem, this bias in input data can be absorbed by $w_0$.  Or if we shift $y$ by a constant, $w_0$ can offset that.   If we penalize $w_0$, the final hypothesis will be forced to be close to the origin.  Therefore, in most cases, we do not penalize $w_0$ and regularization error term should be modified as:
$$\begin{equation} \lambda {\Vert \tilde{\mathbf{w}} \Vert}^2 =  \lambda \sum_{i=1}^d \; w_i^2 \end{equation}.$$
Lasso and other regularization terms should also be modified accordingly.

Figure 2.6 is an example showing if we choose the right $\lambda$, we can significantly lower $E_{out}$, compared to the model where we only try to minimize $E_{in}$.  Regularization term is a necessity for all learning models!  An immediate question follows, how do we choose the hyper-parameter $\lambda$, because $E_{out}$ is only known in the future? We will answer this question in the next Chapter.  For now, we just need to understand if $\lambda$ is too small, we are not considering regularization error enough, we tend to produce models that are too complex and therefore over-fitting.  If $\lambda$ is too large, we worry too much about over-fitting, therefore resort to models that are too simple to explain the data (Figure 2.6).  But over-fitting is often more damaging than under-fitting.


Figure 2.6. The target function is a second order polynomial with Gaussian noise for $x \in [0, 1]$. Our input data set contains 8 points, with true function shown as the black curve on the right.  Our hypothesis space is $5^{th}$ order polynomial models.  The left plot shows how $E_{out}$ changes as $\lambda$ changes from $10^{-4}$ to $10^{4}$.  With $\lambda = 10^{-4}$, we obtain the red hypothesis, which tries to fit the noise in the input data points, but is certainly an over-fit.  With $\lambda = 10^{4}$, we obtain the blue hypothesis, which basically a flat line, does not capture the real trend.  The best hypothesis is obtained when $\lambda$ is around 1, where the corresponding green hypothesis is reasonably close to the true target function.  Therefore, introducing the regularization term could help us reach an overall better hypothesis.  Note: I made a mistake in creating this example, where my regularization term included $\mathbf{w}_0$, therefore the blue hypothesis is too low.  If we had used the correct regularization, the blue line should have gone through the mean of data points.

Where Does Over-fitting Come From?


It is generally reasonably straightforward to think over-fitting occurs when our model is more complicated than the true target.  If the true target function $f(y \vert \mathbf{x})$ is a second order polynomial, fitting it with a model of 5th order polynomial without regularization will produce over-fitting.  This is what we saw in Figure 2.6.

But what if our model captures the correct model?  We generate data points from a 5th order polynomial function with noise (Figure 2.7, left).  We either fit it with a 5th order polynomial or a 2nd order polynomial by minimizing $E_{in}$.  We then plot $E_{out}$ as a function of $N$ (Figure 2.7, right).  We see the 2nd order model actually performs better for $N  \le 10$.  Only for $N \ge 10$, the $5^{th}$ order model starts to out-perform.
Figure 2.7.  Even though the target function is $5^{th}$ order.  Due to noise, $2^{nd}$ order model actually performs better for small $N \le 10$.
What happened? The data contains noise.  If we are given enough data points, we would be able to tell what is the real trend (lower frequency trends) and what is noise (higher frequency patterns).  However, if there are not enough data points for the 5th order polynomial model to tell noise from real trend, it uses the extra power to model the noise and was misled.  Although the true answer is contained in $\mathcal H$, it is unable to locate the true model, due to the noise.  The 2nd order model ignores both high-frequency noise and higher-frequency trends, it only tries to capture the lower-frequency trends.  Yes, its best hypothesis is still quite some distance away from the truth, but it is overall closer to the truth.  Given sufficient number of training data ($N \ge 10$), the variance from a complex model is reduced and it is able to produce a solution closer to the truth and eventually win.

Over-fitting can also occur even when there is no noise.   Figure 2.8 is an experiment where the true model is 10th order polynomial without noise (black), but we only have 8 data points.  We know if we fit the data with $10^{th}$ order model, it will be under-determined, so we will not do that.  But even if we fit it with a 5th order polynomial (red), it can still be worse compared to a simpler 2nd order polynomial (blue).  Both examples show that over-fitting is not because our model is inaccurate or because the data contains noise, it is a disease of insufficient data considering the model complexity.  The rule of thumb is we should have at least 10 times more data points than the number of model parameters to avoid over-fitting.  If $N$ is insufficient, we should avoid the complex model, even if we know it is the true form of $f$!  Over-fitting is an important topic, please read LFD for more examples [5].
Figure 2.8. The target function is a $10^{th}$ order polynomial, however, we only have 8 data points.  The $5^{th}$ order model (red) over-fits, not because the data is noisy, but because there are too few data points.  The $2^{nd}$ order polynomial overall is more reliable for this example.


The Learning Curve


The learning curve describes how $E_{in}$ and $E_{out}$ behaves when $N$ increases.  First, $E_{out}$ is always greater than $E_{in}$ due to an extra term $E_{reg}$ in Equation (16). Second, $E_{in}$ generally increases as $N$ increases, however, plateau pretty rapidly.  Third, the gap between $E_{out}$ and $E_{in}$ decreases as $N$ grows, i.e., $E_{in}$ and $E_{out}$ converges to the same value.  Therefore, $E_{out}$ generally decreases as $N$ increases, it will eventually plateau as well.  The learning curve is very useful in practice, especially when you have influence on the sample size.  Given $N$ samples, we can apply our model to $m$ data points, then $m+1$ data points, then $m+2$ data points, and use the remaining data points to estimate $E_{out}$ and plot out the learning curve, i.e., $E_{out}$ as a function of $N$.  If we see $E_{out}$ is still on the significant slope of decreasing, when we use up the majority of $N$ samples for training, it implies the model will benefit significantly, if we are able to collect more samples.  If we already see $E_{out}$ plateaus, it means we already have sufficient number of samples for our model, there might be rooms to increase the model complexity to reduce $E_{in}$ further.

Let us discuss the behavior of $E_{in}$ and $E_{out}$ a bit further.  The larger the $N$, typically the larger $E_{in}$ becomes.  Think about a $(d+1)$-parameter linear model, it can perfectly classify $N \le d+1$ data points, can fit $d+2$ data points quite well, the fitting quality slightly decay, when we need to fit more points.  However, $E_{in}$ will soon converge and would hardly increase further as it tries to fit more and more data points.  On the other hand, the regularization term continuous to decrease as $N$ increases.  Therefore, we expect $E_{out}$, a combined result, to decrease as $N$ increases.  The learning curve behavior is summarized in Figure 2.9 based on LFD [6].
Figure 2.9 The learning curve for simple and complex models.  When $N$ is small, simple model wins due to smaller $E_{reg}$.  For a complex model, learning curve can tell us whether the model will significantly benefit from larger $N$.

Reference

  1. https://www.youtube.com/watch?v=MFL6xDn1lXM
  2. LFD, page 59
  3. LFD, page 57
  4. LFD, proof of 1-NN performance in e-Chapter: 6-6
  5. LFD, page 120-121, 123-125.
  6. LFD, page 67

Monday, May 22, 2017

Chapter 1. What is Machine Learning?

Introduction


There is no question that we are living in the middle of a data revolution and data science is reshaping our way of living and our way of working.   According to the Gartner hype cycle analysis (Figure 1.1) [1], machine learning is at about the apex of its life cycle.  This is a particularly important time for data scientists to go beyond the face value of various machine learning tools and understand the core components that are likely to last through whatever bumps ahead till the eventual maturity of this field.  

Figure 1.1.  Gartner Hype Cycle for Emerging Technologies, where Machine Learning is considered to be at its apex of hype cycle.

Remember the early days of the Internet hype cycle, nearly any web company was able to attract loads of venture capital funds.  If a programming worker allowed himself to think job requirements in the Internet section was all about learning HTML and web site design, or was about PHP scripting, or about email spamming and anti-spamming, his job for sure did not last long.  Which technologies developed during the hype cycle will survive the test of time are extremely hard to predict, e.g., most of us did not expect JavaScript to come out as a winner of web programming.  Fortunately, despite which Internet programming technology won and which lost, those who received solid computer science training had no problem adapting to new Internet-based programming landscape, because the science behind web programming remains largely unchanged in the past decade.  Our heavy reliance on expertise in data structure, computer algorithms, and parallel computing theories by no means lessen a bit.  These core foundation theories will remain strong for the coming future.  Along the same logic, the best way to prepare for the data science era is to learn the “science” within data science, as science is definitely the part that will stay and it will eventually determine how fast one can navigate the turbulent seas of data science.  As Michael Nielsen elegantly put in the beginning of his open book on neural network: "although learning through using some hot software libraries has an immediate problem-solving payoff, we want durable and lasting insights that will still be relevant years from now. Technologies come and technologies go, but insight is forever." (rephrased here) [2]

This note is not about the usage of specific technologies or tools, such as deep learning, support-vector machines, random forest, etc.; it is about why machine learning is feasible and what principles data scientists should follow in order to enable our machines successfully “learn from data”. That is we focus on theory not on technology.

It is also important to understand that this is a note about “machine learning”, not about “artificial intelligence” or “deep learning”.  All three are buzz words these days, together with other high-frequency confusing terms, such as “big data” and “cloud computing”.  Nobody can clearly define these terms and citing politically-correct Wikipedia definitions does not really help, so we do not attempt to define them here. However, we should be able to illustrate our own interpretation that can reduce, if not eliminate, ambiguity when we exchange machine learning ideas among ourselves.  NVIDIA has a nice blog article explaining the difference among the three concepts well [3] (Figure 1.2), which is in line with the explanation given by some well-known professionals [4].

Figure 1.2. NVIDIA’s explanation of AI, ML and DL.  The example for AI is chess playing.  IBM’s Deep Blue defeated human champion Kasparov in 1997.  The example for ML is email SPAM detection, where Naïve Bayesian classifier is often used to classify emails into SPAM (junk emails) and HAM (valuable emails).  The example for DL is Google’s accomplishment in detecting cats from video with 75% accuracy back in 2012, a project Andrew Ng participated.  Andrew Ng is well known for his extremely popular online course "Machine Learning" on Coursera.

Artificial intelligence (AI) emphasizes on machines acquiring human-like intelligence, machines that can pass Turing test (such as the movie Ex Machina and R2D2 in Star Wars) (Figure 1.3).  Although this is not totally science fiction, it currently remains largely a future tense.  Many impressive artificial intelligence examples may not even be driven by the machine learning (ML) algorithms that we typically practice, a possibly valid example is achieving AI through knowledge base, based on rules discovered by human not by machine itself.  One example is Mathematica software can do extremely impressive symbolic computations, including difficult calculus (Figure 1.4), however, it archives this by following mathematical rules coded by us, and those rules were not discovered by machine through learning from many calculus examples [5].  So AI is bigger than machine learning (ML).  AI is often reserved to specifically emphasize "intelligence", thus we normally do not call ML techniques such as one-nearest neighbor an AI, as it simply performs similarity lookup.  While AI is largely the future, ML is a reality, a sciency framework has emerged after decades of research work and that is why we can now use the word “Data Science”, which mostly refers to the science of ML.

Figure 1.3 Human-like AI robot in movie Ex Machina.  R2D2 in Star Wars.
Figure 1.4. Although Mathematica can do very impressive symbolic operations, it is doing so based on existing mathematical rules, not based on strategies discovered by itself through learning lots of examples, therefore, it achieves AI not through ML.

Over the past decades, many ML techniques have contributed to the development of the learning theory. Such techniques include neural network, random forest, support vector-machines, and deep learning, etc.   This is why we say Deep Learning (DL) is one of the many ML techniques out there, it is a subset or a special case of ML.  DL $\subset$ ML $\subset$ AI.  However, DL is a very powerful one, because it has out-performed other ML techniques in many applications, especially those where traditional ML techniques failed miserably.  DL has rekindled new hopes that ML can match or even surpass human intelligence in some areas in the near future.  DL is one of the several key technologies behind AlphaGo.  The defeat of Lee Se-dol to AlphaGo caught most of us as a big surprise, but the defeat of “Master” (AlphaGo 2.0) over Ke Jie in late May was a highly anticipated.  As DL is taking the lead in some most impactful AI applications, such as Tesla’s self-driving cars, no wonder for journalists, DL is everything, it is AI, it is ML.  All three terms are being used interchangeably as most news in AI are about DL, which is the most shining piece of ML techniques.

Is Machine Learning Really a Hype?


DL sends us recent AI shock waves and gives people an over-optimistic feeling that AI is becoming a reality.  This explains why Gartner placed ML at the top of the hype cycle.  Gartner here most likely mistakenly equates ML to DL and to AI.  The recent success of DL in AlphaGo and Tesla's self-driving system sets a high expectation on AI.  On one hand, thanks to DL, AI has never been so close to our life.  General population hopes the success of AlphaGo can be easily replicated to other AI domains; and all the heated debates on the sociological impact of AI makes AI appear to be something ready for prime time at any moment.  This over-optimistic view of AI is basically the root of the hype. If ML is a hype, it is actually the AI hype.  On the other hand, even among professionals, the success of DL creates a misconception that DL is the best ML technology and good for all ML and AI application.  We will discuss in Chapter 6 why this is not true.  If ML is a hype, it is actually the DL hype.

For the ML itself, as we will learn in this note, is a matured science.  ML has solid theories, some will be discussed in this note, it is the science of data science.  In the past two decades, ML has numerous proven applications in many research fields, including computer vision, bioinformatics, cheminformatics.  It is still on a solid path of further expansion.  So for the ML we understand, it is many things but a hype.  What we will learn about ML here are going to last!

Figure 1.5.  For data scientists, ML is a matured field (left).  For others, ML is often mistakenly equated to DL or AI (right).  DL is still under heavy development and AI largely remains a dream.  When the expectation on DL and AI is high, hype arises.

What is Learning?


When a bank-employed programmer writes a credit card review application to only approve applicants with salary higher than $30k, that computer program only “automates” the salary-checking rule without learning.  The approval criterion, which is the intelligent piece, is provided by human.  Computer automation has improved our efficiency and eliminated many jobs, such as cashiers, where not much intelligence is required.  Many jobs were unaffected as computers became wide spread, as they require human intelligence and automation does not warrant intelligence.  If the bank instead feeds a computer program 1000 past approved applications made by its best money-making credit reviewer, together with the outcome whether bank made or lost money from those approvals, the program then somehow updates its internal parameters and be able to review future applications and make decisions in a way very similar to that star reviewer, learning definitely happened in this example.  Based on this example, ML is machines develop code without rules given by human.  Code development is driven by some given examples.  The criterion for a successful ML is its prediction for future input data (Figure 1.6).
Figure 1.6.  Machine Learning is a machine capable of learning from training data by itself.  The resultant machine is able to predict unseen future test data.

When computer (machine) can learn and gain some intelligence, it will reshape the job marker in a much more impactful manner.  In this example, not only the bank can review more applications faster and more accurately, the machine will take over the job of mediocre-performing credit reviewers in the bank and generally in the banking industry, I believe this is probably already true.  Many jobs we are performing nowadays requires only marginal intelligence, therefore, data science/ML will significantly reshape the job market.  According to McKinsey’s analysis, “currently demonstrated technologies could automate 45 percent of the activities people are paid to perform and that about 60 percent of all occupations could see 30 percent or more of their constituent activities automated, again with technologies available today” [6].

The credit card approval example is a classic textbook machine learning problem called classification (classify an application into approve or deny actions) and this is the type of problems we will use to explain the ML theories.  In fact we will heavily study such two-label classification problem most of the time.  Such classification problem might appear special, it is nevertheless the most-encountered one in real applications.  Even self-driving cars have to carry out many classification decisions, such as whether the object ahead is a pedestrian, whether the object on the side is a road sign, then is it a stop sign or a speed-limit sign, etc.  Often times a classification task trivial to a human may become ridiculously difficult for machines, e.g., whether the object on the road ahead is a rock or a brown paper bag![7]  So do not underestimate the challenge ahead.

Learning Resource


This note is heavily based on a few resource I found extremely useful during my “human” learning of the “machine learning” subject.  In particular, this note should be credited to the book “Learning from Data”, including the online Caltech lecture taught by one of the book author Yaser Abu-Mostafa.  The selection of topics and the clarity of explanations are just superb in every way.  The online lecture “Machine Learning Foundations” taught by another co-author Hsuan-Tien Lin is more elaborate and enlightening (in Chinese though).  Since we may cite various book/video source, let us list a few of them and label them with acronyms:

The books are:

I recommend DSB as an introductory book; it helps orient yourself with some ML concepts and popular ML techniques.  KDS is an intermediate book, with a bit more math but not too heavy, it uses SVM to touch the surface of ML theories, but not as systematic as LFD.  LFD is an excellent book, where this note is based on.  LFD may be a bit heavy on math in some places, however, the online videos help a lot.  DLB is destined to be a classic book in DL, I am still in the process of reading and might quote some small parts.

The two online videos are:

The Learning Problem


Let us formulate the machine learning problem to be studied in this note [8].

Figure 1.7 Formulation of the learning problem.
The data source generates data records as $(\mathbf{x}, y)$ according a target function $f(y \vert \mathbf{x})$, let us consider the simple case where $\mathbf{x}$ is a $d$-dimensional vector $(x_1, x_2, \ldots, x_d)$, i.e., $\mathbf{x} \in \mathbb{R}^d$, and $y$ in this note is either a binary class variable $\{-1, +1\}$ (for classification problem), or a real number $\mathbb{R}$ (for regression problem).  $\mathbf{x}$ follows an unknown but fixed distribution $p(\mathbf{x})$, from which both the training data set and the test data set are sampled.  The exact form of $p(\mathbf{x})$ does not matter, as long as training and test samples share the same $p(\mathbf{x})$.  The hidden process samples $N$ training data points $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N$ according to $p(\mathbf{x})$, then it generates output $y_1, y_2, \ldots, y_N$ according to the target function $f(y \vert \mathbf{x})$.  That is the training data set $\mathcal{D}$ contains $N$ examples $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_N, y_N)$.  The machine is supposed to use these $N$ input points to make its best guess on $f$, how?

Before seeing any training data, computer decides to model the unknown $f$ with a collection of hypotheses $\mathcal{H}$, e.g., $\mathcal{H}$ can contains all possible linear class-separation hyper planes as hypothesis candidates: $y = w_0 + w_1x_1 + w_2x_2 + \ldots + w_dx_d$.  Each hypothesis here contains $d+1$ parameters: $(w_0, w_1, \ldots, w_d)$, and each $\mathbf{w}$ vector corresponds to one hypothesis $h_{\mathbf{w}}$ in $\mathcal{H}$.  $\mathcal{H}$ here contains infinite number of hypotheses.  Given the $N$ input points, we applied a learning algorithm $\mathcal{A}$, which will pick one hypothesis $g$ within $\mathcal{H}$ to be the best guess of $f$ (picking $g$ basically is picking a specific value of $\mathbf{w}$).  $g$ typically is identified by minimizing a function including an error measurement $e(\mathcal{D}, h)$ using some optimization algorithm.  How well our $g$ truly models $f$?  It should not be determined based on the error measure within $\mathcal{D}$, but has to be determined by its future application.  Otherwise, a $g$ memorizing the $N$ answers would be the best one.  To evaluate $g$, we sample an independent test data set $\mathcal{D}^\prime$ that the computer has never seen, in which the error of $g$ mimicking $f$ is measured.  Notice human knowledge is incorporated in the choice of both $\mathcal{H}$ and $\mathcal{A}$, although we do not define $g$.  The exact $g$ is learned based on $\mathcal{D}$ without human intervention, i.e., the best model is learn from data, not provided by human, that is where learning happens.

A few notations used in this note. $\mathbf{x}$ refers a $d$-dimensional input data point.  We typically add $1$ as its first element for the convenient of linear modeling.  $X$ is the matrix for all $N$ input data points.  $\mathbf{y}$ is the corresponding response variable for the $N$ data points.
$$\begin{equation}
\mathbf{x} = \begin{bmatrix} 1 \; x_1 \; x_2 \; \ldots \; x_d \end{bmatrix}^\intercal \end{equation}$$
$$\begin{equation}
X = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \cdots & x_{Nd} \\
\end{bmatrix}
\end{equation}$$
$$\begin{equation}
X = [ \mathbf{x}_1 \; \mathbf{x}_2 \; \ldots \; \mathbf{x}_N \; ]^\intercal
\end{equation}$$
$$\begin{equation}
\mathbf{y} = \begin{bmatrix} y_1 \; y_2 \; \ldots \; y_N \end{bmatrix}^\intercal
\end{equation}$$

Let us further assume each feature dimension, except the first dummy dimension $x_0 = 1$, has already been standardized, that is they are all zero-centered and scale.  This assumpt will help us understand popular regularization forms to be introduced in Chapter 2:
$$\begin{equation} \sum_{i}^N \; {x}_{ij}=0, \; {\rm for } \; j = 1,2,\ldots ,d, \end{equation}$$
$$\begin{equation} \frac{1}{N} \sum_{i=1}^{N} \; {({x}_{ij}-\bar{x}_{j})}^2=1. \end{equation}$$

In this note, we will study different hypothesis sets $\mathcal{H}$, containing linear and non-linear hypotheses alone or even in combine.  We will study different learning algorithms $\mathcal{A}$, including one-step analytical solution and multi-step gradient descent optimization.  The error measure could be accuracy (for classification problem), cross-entropy error (for classification with probability), or least square error (for regression problem).  Despite the fact that we only study the specific machine learning problem depicted in Figure 1.7, the learning workflow contains enough variations, that it can lead to a very rich manifestation of different machine learning applications.

Reference

  1. http://www.gartner.com/newsroom/id/3412017
  2. http://neuralnetworksanddeeplearning.com/about.html
  3. https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
  4. Deep learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville. http://www.deeplearningbook.org/. p9 Figure 1.4. 
  5. https://www.quora.com/How-does-Mathematica-work
  6. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/where-machines-could-replace-humans-and-where-they-cant-yet
  7. https://www.youtube.com/watch?v=40riCqvRoMs
  8. LFD, page 30.




Friday, May 12, 2017

Data Scientists' Notes on Machine Learning

Note: I am currently work on some notes on biological image analysis (April 19, 2019).

The goal of this note is to introduce the “science” of data science.  It explains the core machine learning theories for data scientists who previously learned some machine learning techniques.  Understand the origins of machines learning concepts such as over-fitting, regularization and cross validation prepares data scientists to quickly master newly emerging machine learning techniques, as their theoretical foundations likely remain unchanged.  A grasp of the learning theory also enables data scientists to evaluate, modify and invent appropriate new machine learning techniques as needed, with confidence.

Part I.  Machine Learning



The note contains six chapters.  The first chapter explains the basic concepts of machine learning, followed by the core learning theory in chapter 2 & 3.  Chapter 4 & 5 focus on key machine learning techniques, i.e., application of the learning theories.  Chapter 6 clarifies some additional concepts and provides a few areas for future learning.

The goals of each chapter is outlined as the following:


Chapter 1. What is Machine Learning?

  • Introduction
  • Is Machine Learning Really a Hype?
  • What is Learning?
  • Learning Resource
  • The Learning Problem
This chapter explains what machine learning is and formulate the learning problem we are going to discuss in this note.  We try to clarify a few confusing concepts: artificial intelligence, machine learning and deep learning.  The chapter provides pointers to valuable learning resource.

Chapter 2. Learning Theory


  • Evaluating a Single Hypothesis - Hoeffding Inequality
  • Hypothesis Set and Learning Algorithm
  • VC Dimension
  • Where Does Over-fitting Come From?
  • Regularization
  • The Learning Curve

This chapter explains the core of the machine learning theory from frequentists' point of view - why machines can learn.  It introduces the intuition behind the VC theory to explain why our typical models only contain finite number of effective hypotheses.  The multiple-test problem is originated from the multiple hypotheses we use in the learning, which directly leads to a regularization error.  It is important that we do not just minimize in-sample-error, which will lead to over-fitting, instead the best model should be identified by minimizing the sum of in-sample error and regularization error.

Chapter 3. Regularization and Cross Validation


  • Bayesian Interpretation of Regularization
  • Bias and Variance
  • Learning Theory Review
  • Unbiased $E_{out}$ Estimation with Leave-One-Out Cross Validation
  • Hyper-parameter Selection and Model Selection

This chapter first completes the learning theory, it continues to emphasize the necessity of the regularization error term.  Model hyper-parameters, such as the $\lambda$ within the regularization error term, should be determined by using a validation data set.  To make the most effective usage of our limited training data set, Leave-One-Out Cross Validation scheme is used to provide the closest unbiased performance estimation for our final hypothesis.  In practice, we may need to use two separate workflows, one to identify the best hypothesis, another to estimate its performance.

Chapter 4. Linear Models


  • Linear Classification - Accuracy
  • Logistic Regression - Cross Entropy Lost Function
  • Linear Regression - Maximum Likelihood
  • Feature Selection Capability of Linear Models
  • Non-linear Classifier
  • Regularization Again
This chapter discusses three popular applications of linear models, for classification, for probabilistic classification, and for regression.  We discuss why linear model has built-in feature selection capability and how it provides a general framework that can be extended into non-linear models through a mapping function.

Extra: Multiclass Classifier


Chapter 5. Support Vector Machines


  • Maximum Margin Linear Classifier
  • SVM Engine
  • Cross Validation of SVM
  • Non-linear SVM and Kernel
  • RBF Kernel - One Kernel for All
  • RBF Kernel for Higher-dimensional Input
  • Classification with Probability
This chapter discusses how SVM works.  We will analyze the popular KBF kernel in great detail. Through the construction of the infinite linear space for the KBF kernel, we explain why KBF kernel can model both linear models and polynomial models and why this is one kernel for all.

Chapter 6. The Path Forward


  • Review
  • Bayesian and Graphical Models
  • Deep Learning and Feature Learning
  • Data Engineering: Big Data and Cloud
  • Summary
This chapter outlines the ML topics that are not covered by this Note, in particular, it explains three different paths for further development: Bayesian graphical model, deep learning and data engineering.  We also try to explain why three confusing concepts, data engineering, big data and Cloud, tend to go hand-in-hand.



Part II.  Deep Learning



To a large extent, the recent data science revolution is the deep learning revolution.  Deep Learning offers a few frameworks that can be applied to very complicated problems such as image classification, language translation, gaming, etc.  Most traditional machine learning techniques are either inapplicable to such problems or perform poorly.

We will gradually cover the basics in this new field.  Instead of discussing the implementation techniques (e.g., TensorFlow), we focus on explaining the concepts related to why and how.  Due to the immaturity of its theoretical foundation, the so-called explanations are often people's or my own "arguments".  Be warned that I am just a new learner, my opinions could be quite wrong.  Nevertheless, I hope my gut feelings can help it look a bit less mysterious.


Chapter 7.  Introduction to Deep Learning


  • Resource
  • Feature Learning
  • GO Game as a Function
  • Deep Learning Implementation
  • A Blackbox DNN Solution

This chapter explains why deep learning is also known as feature learning.  Deep learning is basically a function approximation technique, in fact, games as complicated as GO can be modeled as a function.  We outline some implementation details and propose one can write a generic blackbox deep learning solution that is capable of solving traditional machine learning problems.

Bioinformatics Application: Gene Expression Inference with Deep Learning.

Chapter 8.  Convolutional Neural Network (CNN)


  • Introduction
  • CNN for Face Recognition
  • VGGNet for ImageNet
  • Miscellaneous Topics

This chapter explains how convolutional neural network (CNN) constructs a hierarchy of features recognizing spatial patterns starting from simple one (such as line or corners) to more complex ones (such as eyes, wheels), which addresses the translation-invariant requirement in computer vision.  The concept of different CNN neurons responding to individually specific patterns is very interesting, as it seems to shed lights on how our biological vision recognition system might function.

Bioinformatics Application: Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning.


Chapter 9. Recurrent Neural Network (RNN)



  • Introduction
  • Architecture
  • Memory
  • Running Average and Derivative
  • RNN Applications
  • LSTM


This chapter explains how to handle variable-length input sequences, where the response of a system does not only just depends on the current input, but is also influenced by the historical inputs.  The new RNN system can maintain a state vector encoding the trajectory of all past inputs.  The memory capability enables RNN to handle many of the most exciting deep learning applications such as translation, video classification, sentiment analysis, etc.

Bioinformatics Application: LSTM Networks for Predicting Subcellular Localization of Proteins

Wednesday, May 10, 2017

Test formula

https://www.latex-tutorial.com/tutorials/beginners/latex-amsmath/

https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

$\frac{5ab-4x^2-\frac{1}{2}\sqrt{x}}{-\frac{2}{3}y+\frac{5}{6}} = \frac{x^2 - \frac{1}{3}x - 3\frac{ab}{s}}{3x - \frac{4}{5}y+2}$

$K_D = \frac{[A_{total}]-[AB]}{[AB]}\cdot [B]$
$\frac{[AB]}{[A_{total}]} = \frac{1}{1+\frac{K_D}{[B]}}$


$1 - (1-p)^N \sim Np$
$E_{out} = E_{in} + E_{reg}$ $\lambda$ $g$  $g(\mathbf{w^*} \vert_{\lambda^*})$
${\rm sign}(\mathbf{w_n}^\intercal \mathbf{x}_2) = -y_2$
$f(\mathbf{x}_1) = f(\mathbf{x}_0) + \nabla f(\mathbf{x}_0)(\mathbf{x}_1-\mathbf{x}_0)$
$f(\mathbf{x}_1) - f(\mathbf{x}_0) = - \eta \nabla f(\mathbf{x}_0)\nabla f(\mathbf{x}_0)$
$\nabla f(\mathbf{x}) = \sum_{i=1}^N \; \nabla f(\mathbf{x}_i)$
$N \to \infty$
$y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_dx_d + w_{11}x_1^2 + w_{12}x_1x_2 + \cdots + w_{(d-1)d}x_{d-1}x_d + w_{dd}x_d^2$
$\sum_{i=1,d} \; \vert w_i \vert \le 1.0$
$E_{out}=E_{in} \pm \sigma$
$\max \sigma = \frac{1}{2\sqrt{N}}$
$\mathbf{w} = [w_0, w_1, w_2, \ldots, w_d]^\intercal$
$\mathbf{w} = [ w_1, w_2, \ldots, w_d]^\intercal$
$y = \{+1, -1\}$
$s = \frac{1}{2}at^2$f(x) -
$s = c_0 + c_1t+ c_2t^2+ c_3t^3$
$\mathbb{E}_\mathbf{x} \Vert f(\mathbf{x}) - g^D(\mathbf{x}) \Vert^2 = \int {\Vert f(\mathbf{x}) - g^D(\mathbf{x}) \Vert^2 p(\mathbf{x})d\mathbf{x}}$
$E_{out} = E_{in} + \lambda {\Vert \tilde{\mathbf{w}}\Vert}^2$
$Z^\intercal Z^\prime=$
$p(\mathbf{w}, \mathcal{D}) = p(\mathbf{w}| \mathcal{D})p(\mathcal{D}) = p(\mathcal{D}| \mathbf{w})p(\mathbf{w})$
$ w_1x_1 + w_2x_2 + \cdots + w_dx_d \ge b$
${\rm sign}(w_0 \cdot 1 + w_1x_1 + w_2x_2 + \cdots + w_dx_d)$
${\rm sign}(\mathbf{w}^\intercal \mathbf{x})$
$\mathbf{w}^\intercal\mathbf{x}_1=y_1$
$\mathbf{w}^\intercal\mathbf{x}_2=y_2$
$\cdots$
$\mathbf{w}^\intercal\mathbf{x}_N=y_N$
$E = \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i + b - y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w}$
$\frac{\partial{E}}{\partial{b}} = 0 \to b = \frac{1}{N} \sum(y_i - \mathbf{w}^\intercal \mathbf{x}_i) = \frac{1}{N} \sum y_i$
$\frac{\partial{E}}{\partial{\mathbf{w}}} = 0 \to (X^\intercal X + \lambda I)\mathbf{w} = y - b$
$$\begin{equation}
\phi(\mathbf{x}) = \exp(-\gamma (x_1^2+x_2^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2, \frac{{(\sqrt{2 \gamma}x_1)}^2}{\sqrt{2!}}, 2 \gamma x_1x_2, \frac{{(\sqrt{2 \gamma}x_2)}^2}{\sqrt{2!}}, o(x_1^2 + x_1x_2 + x_2^2) \right] \end{equation}$$ $$\begin{equation} \phi(\mathbf{x}^\prime) = \exp(-\gamma ({x_1^\prime}^2+{x_2^\prime}^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1^\prime, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2^\prime, \frac{{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}}, 2 \gamma x_1^\prime x_2^\prime, \frac{{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}}, o({x_1^\prime}^2 + x_1^\prime x_2^\prime + {x_2^\prime}^2) \right] \end{equation}$$ $$\begin{equation} \phi(x) = \exp(-\gamma x^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$
$$\begin{equation} \phi(x^\prime) = \exp(-\gamma {x^\prime}^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$

$$ \frac{[AB]}{[AB]+[A]} = \frac{1}{1+\frac{[A]}{[AB]}} = \frac{1}{1+\frac{K_D}{[B]}} \approx \frac{1}{1+\frac{K_D}{[B_{total}]}}$$

$y={\rm sign}(\mathbf{w}\cdot \mathbf{z}  + b)$
$s_1 = \mathbf{w}_1\mathbf{x}$
$s_i = \mathbf{w}_i\mathbf{x}$
$s_K = \mathbf{w}_K\mathbf{x}$
$f(x) \; x \; x_1 \; x_2 \; x_j \; x_m \; h \; z \; w_1 \; w_2 \; z_1 \; z_2 \; z_i \; z_{m^\prime}  w_{i1} \; w_{i2} \; w_{ij} \; w_{im} $
$s_1 = \mathbf{w}_1\mathbf{z}$
$s_i = \mathbf{w}_i\mathbf{z}$
$s_K = \mathbf{w}_K\mathbf{z}$
$f_1 = \frac{e^{s_1}}{1+e^{s_1}}$
$f_i = \frac{e^{s_i}}{1+e^{s_i}}$
$f_K = \frac{e^{s_K}}{1+e^{s_K}}$
$f_1 = \frac{e^{s_1}}{\sum_j e^{s_j}}$
$f_i = \frac{e^{s_i}}{\sum_j e^{s_j}}$
$f_K = \frac{e^{s_K}}{\sum_j e^{s_j}}$
$s_{ji} = \mathbf{w}_j\mathbf{x}_i$
$f_{ji} = \frac{e^{s_{ji}}}{1+ e^{s_{ji}}}$
$J(\theta_1, \theta_2, \theta_3, \ldots, \theta_i, \ldots, \theta_n)$
$J^\prime(\theta_1, \theta_2, \theta_3, \ldots, \theta_i+\Delta\theta_i, \ldots, \theta_n)$
$\frac{\partial{J}}{\partial{\theta_i}} = \frac{J^\prime - J}{\Delta\theta_i}$
$\cos(x)\sin(y)+\frac{y}{x}$
$\mathbf{h}_{t-1}=\begin{bmatrix} x_{t-3} \\ x_{t-2} \\ x_{t-1} \end{bmatrix}$
$\begin{bmatrix} h_{\Sigma,t-1} \\ h_{\nabla,t-1} \\ h_{mem,t-1} \end{bmatrix} $