Sciency Stuff: Chapter 4. Linear Models

We have already seen linear models in previous chapters. Here we are going to study linear models more systematically, as it is the foundation of understanding more complex models. In many cases, non-linear models can be simplified into linear models and we will see how that works at the end of the chapter.

Linear Classification – Accuracy

Consider a classification problem, where inputs are $X$ and $y$. $X$ is a $N \times (d+1)$ matrix and $\mathbf{y}$ is an $N \times 1$ vector with elements $\in \left\{-1, +1\right\}$. Notice we add element 1 to the first dimension of each input vector, so that the math later can be written more concisely. Please see the end of Chapter 1 for the spell out of the $X$ and $\mathbf{y}$ as matrix forms (Chapter 1. Equation (1)-(4)).

A linear classifier uses model:

$$\begin{equation} \mathbf{y} = {\rm sign}(\mathbf{w}^\intercal X ) \end{equation},$$
where $\mathbf{w}$ is the $d + 1$ element weight vector: $[ w_0, w_1, w_2, \ldots, w_d ]^\intercal$. We incorporated the extra first dimension to $X$, so that $w_0$ can be part of $\mathbf{w}$.

A natural choice of the cost function to maximize accuracy is:
$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation}. $$

This is not quite right, as we are now ML-educated, we should take into account the regularization error:

$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation} + \lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}}, $$
where $\tilde{\mathbf{w}} = [ w_1, w_2, \ldots, w_d ]^\intercal$. We here use ridge regularization for simplicity, but it can certainly be other regularization forms you prefer. Because the ${\rm sign}(\cdot)$ function has no analytical derivative, we can only solve this equation numerically and iteratively.

At a certain step $n$, our guess of the answer is $\mathbf{w}_n$. Let us assume point $\mathbf{x}_1$ is classified correctly and $\mathbf{x}_2$ is misclassified. We consider making the next guess $\mathbf{w}_{n+1}$ around a small neighborhood of $\mathbf{w}_n$. Let us assume the change in the prediction for $\mathbf{x}_1$ is small and ignore that part for now, as $\mathbf{w}_n$ is already good from its perspective. We need to make $\mathbf{w}_{n+1}$ a better predictor for $\mathbf{x}_2$. If $y_2$ is currently $+1$, it means $\mathbf{w}_n^\intercal \mathbf{x}_2$ is currently negative (as it is misclassified), so it makes sense to have $\mathbf{w}_{n+1}$ to incorporate a positive component, i.e., $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 > 0$. Otherwise, if $y_2$ is -1, we would like $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 < 0$, so that we have a better chance of correctly predicting $\mathbf{x}_2$ in the next iteration.

Considering both cases, we can design:

$$\begin{equation} \mathbf{w}_{n+1} = \mathbf{w}_n + \eta y_2 \mathbf{x}_2, \end{equation}$$
so that $\mathbf{w}_{n+1}^\intercal \mathbf{x}_2$ has an additional term $\eta y_2 \mathbf{x}_2^\intercal \mathbf{x}_2$, which always improves $\mathbf{w}$ in a way that the outcome for $\mathbf{x}_2$ classification becomes more favorably. $\eta$ is a positive parameter called learning rate. $\eta$ should be large enough so that we can explore $\mathbf{w}$ space quickly, while it should also be small enough that we do not jump over and miss the minimum. Empirically let us assume $\eta$ is 0.1 without going off topic [1].

So the learning algorithms goes like the following: it starts with $\mathbf{w}_0$, if there are points that are misclassified, it randomly picks one of those and use it to alter $\mathbf{w}$ according to Equation (4). Then loops this process. If data points are indeed linearly separable, it can be proven that our algorithm will eventually converge and all data points will be $100\%$ correctly classified, if there is no regularization error term, so the minimization algorithm works. In reality, the algorithm may not stop as not all data points can be all correctly classified, we can memorize the solution associated with the minimum cost identified so far, and use that solution as our final answer $\mathbf{w}^*$.

What would be a good starting guess for $\mathbf{w}_0$? If data are reasonably separable, we expect data points to form two clusters, points are expected to be closer together within the same labels and further away from points of opposite labels (Figure 4.1). Therefore, a reasonable decision plane has its norm vector $\mathbf{w}$ pointing from the center of $-1$ cluster to the center of $+1$ cluster. A good initial guess is:

$$\begin{equation} \mathbf{w}_0 = \frac{1}{N^{+1}} \sum \mathbf{x} \vert_{y = +1} - \frac{1}{N^{-1}} \sum \mathbf{x} \vert_{ y = -1}. \end{equation}$$

$\mathbf{w}_0$ can certainly be scaled to adjust the initial regularization cost. The optimal hyper-parameter $\lambda$ will be determined through a cross-validation routine described in the previous chapter.

Figrue 4.1. A reasonable choice of $\mathbf{w}_0$ is the one that points from $\mu_{-}$ to $\mu_{+}$.

Logistic Regression – Cross Entropy Lost Function

The truth is not always black and white in the classification problems. Applicants with the same salary, age, and other feature values could behave differently, some turned out to be money makers for the bank and some to be money losers. For those who are credit worthy, some we are extremely confidence they are money makers and the bank could offer more attractive terms and start with a generous credit limit, and others are marginal and should be either deprioritized or approved under somewhat more stringent terms including lower credit limit. Therefore, it would be nice to model the probability of a data point being $+1$ or $-1$. The probability prediction could be handy, e.g., for a cell phone company to retain customers. Those predicted to churn with high probability should be prioritized first for retention department to hammer on. Even if we do not care about probability, minimizing probability cost function could lead to better classification results, as it enourages classes to be pushed fruther part and results in a more robust classifier [2].

This probabilistic classification problem can be formulated as: the truth is $p(y \vert \mathbf{x})$. We are given $N$ records sampled from this distribution $p(\mathbf{x}, y)$ and we would like to model $p(y = +1 \vert \mathbf{x})$ with a family of hypotheses $f(\mathbf{x})$. The probability of the $-1$ class is $p(y = -1 \vert \mathbf{x}) = 1-f(\mathbf{x})$. The optimal hypothesis $g$ is the one that gives the maximum likelihood based on the given data, subject to the constrain of regularization error. We can use linear models, called Logistic Regression (LR) to solution this problem.

Our LR hypothesis space for modeling $p(+1 |\mathbf{x})$ is:
$$\begin{equation}
f(s) = {e^s}/(1+e^s),
\end{equation}$$
where
$$\begin{equation}
s = \mathbf{w}^\intercal \mathbf{x}.
\end{equation}$$
The logistic function is convenient, because the probability of obtaining $-1$ label is:
$$\begin{equation} \begin{aligned}
1-f(s) &= 1/(1+e^s) \\ & = f(-s)
\end{aligned} \end{equation}.$$
In another words, the probability of a data point with label $y$ is:
$$\begin{equation}
p(y_i \vert \mathbf{x}_i) = f(y_i \mathbf{w}^\intercal \mathbf{x}_i).
\end{equation}$$

That is we try to first calculate a linear score $s$, and hope the more positive the score $s$ is, the more likely the label $+1$ is; the more negative the score, the more likely the label $-1$ is. We then arbitrary convert this linear score in range $(-\infty, + \infty)$ into range $(0, 1)$ using Equation (6), so that it has a chance to be interpreted as a probability score.

Let us use the Bayesian approach to find $\mathbf{w}^*$:

$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\max_{\mathbf{w}} \; p(\mathbf{w} \vert X, \mathbf{y}) \\
&= \arg\max_{\mathbf{w}} \; \prod_{i=1}^N \; \frac{p(y_i \vert \mathbf{x}_i, \mathbf{w})p(\mathbf{w})}{p(y_i|\mathbf{x}_i)p(\mathbf{x}_i)}. \end{aligned} \end{equation}$$

The denominator is independent of $\mathbf{w}$ and can be ignored. We can then minimize the negative log likelihood of the numerator:
$$\begin{equation}
\mathbf{w}^* = \arg\min_{\mathbf{w}} \; - \sum_{y_i=+1}\log(f(\mathbf{w}^\intercal \mathbf{x}_i)) - \sum_{y_i=-1}\log(1-f(\mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
Using Equation (9), it can be simplified into:
$$\begin{equation}
\mathbf{w}^* = \arg\min_{\mathbf{w}} \; - \sum_{i=1}^N \; \log(f(y_i \mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
The first terms is what is so-called cross-entropy loss function, which is an important error measurement when we would like to model a probability distribution. This term is basically the sum of negative log correct probabilities of all data points, i.e., we want the probability of each data point belonging to its true class to be maximized. The second term basically gives a regularization error term, containing a parameter $\lambda$, which we have already discussed in Chapter 3. In addition, $\mathbf{w}$ should be replaced by $\tilde{\mathbf{w}}$ in the second term.

Equation (12) has no analytical solution, we have to solve it numerically. Let us introduce the idea of gradient descend. To minimize a general function $f(\mathbf{x})$, we start with an initial guess $\mathbf{x}_0.$ The gradient at $\mathbf{x}_0$ is $\nabla f(\mathbf{x}_0)$, pointing at the direct where $f(\mathbf{x})$ increases fastest. Therefore, we should search in the opposite direction. Our next guess therefore would be:

$$\begin{equation} \mathbf{x}_1 = \mathbf{x}_0 \; – \; \eta \nabla f(\mathbf{x}_0) \end{equation}.$$

Taylor expansion shows $f(\mathbf{x}_1) - f(\mathbf{x}_0)$ would be negative. We can iterate this process to find $\mathbf{x}_2$, $\mathbf{x}_3$, …, until the process converges or we use up computational quotas. Searching for smaller $f$ values along negative gradient line explains the name gradient descend (GD). We have heard about many optimization algorithm, GD is popular in machine learning because a variant of GD called Stochastic Gradient Descent (SGD) enables us to scale up the algorithm to deal with large training data sets [3]. It has been proven by math genius that Equation (12) only has one minimum, which makes it very straightforward to solve and we will not get into it.

A question we may have is we simply coin a function $e^s/(1+e^s)$ and do the parameter fitting on $\mathbf{w}$, then we argue the value of this function has the meaning of the probability of $p(+1|\mathbf{x})$. To us it is a huge leap of faith in associating a function with range $(0, 1)$ to a probability interpretation. The range $(0, 1)$ is only a necessary condition for a function to carry a probability meaning, but not a sufficient condition. At least this is something bothers me for quite a while, indeed some authors argue the $p$ values given by logistic regression should not be treated as probability, but I now believe they should be. We could agree $f(\mathbf{w}^\intercal \mathbf{x})$ can be used for ranking data points, but why its value is an estimation of probability? The answer lies in the first two terms in Equation (11) are cross entropy terms.

Imagine there are plenty of $n_i$ data points within the neighborhood of $\mathbf{x}_i$, where $n_{+}$ and $n_{-}$ are generated from the two underlying classes, respectively, i.e., the probability of being positive $p_{+}$ is $\frac{n_{+}}{n_{+}+n_{-}}$. For these subset of data points, the first two terms in Equation (11) become:

$$\begin{equation} - n_{+} \log f(\mathbf{x}_i) - n_{-} \log(1-f(\mathbf{x}_i)) \end{equation}, $$

It is straightforward to prove this cross entropy term reaches maximum, when $f = p_{+}$, therefore, our optimal solution $f$ is molded to be as close to $p_{+}$ as possible. The logistic model is one naive guess to mimic $p_{+}$, but if someone comes up with a better guess $f^\prime$, his model would be scored more favorably. So by minimizing our error function in Equation (11), $f$, the best winning model, is approaching $p_{+}$, therefore, can be interpreted as a proxy for $p_{+}$. Certainly $f$ might not be an accurate proxy, if the true shape of $p_{+}$ is far away from a logistic shape.

Linear Regression – Maximum Likelihood

Linear regression problem is to model an unknown true value function $y = f(\mathbf{x})$ using a linear model. That is we aim to search within the hypothesis set containing hyper planes described by $y = \mathbf{w}^\intercal \mathbf{x}$ for a hypothesis $g$, so that $g(\mathbf{x}) \approx f(\mathbf{x})$.

The quality of $g$ can be measured by least square error (discussed in Chapter 2):

$$\begin{equation} \begin{aligned}
e(\mathbf{w}) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i – y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} - \mathbf{y})^\intercal (X\mathbf{w} - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned} \end{equation}$$

Again, the second term is a regularization term, although we use ridge regression form, you can feel free to use Lasso or other forms. However, ridge regularization form gives a beautiful analytical solution by setting derivative to $\mathbf{w}$ to zero in Equation (15):

$$\begin{equation} (X^\intercal X + \lambda I)\mathbf{w} = X^\intercal \mathbf{y} \end{equation},$$
$$\begin{equation} \mathbf{w}^* = {(X^\intercal X + \lambda I)}^{-1}X^\intercal \mathbf{y} \end{equation}.$$
Notice:
$$\begin{equation} X^\intercal \mathbf{y} = \sum_i \; y_i \mathbf{x}_i \end{equation},$$

the final solution is a linear combination of all input vectors, i.e., all data points contribute.

In practice, we do not use Equation (17). This is because the regularization term is actually $\lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}},$ we either have to use GD method, or need to centered input vectors in order for the analytical solution to work [4]. Let us resort to the form where $w_0$ is taken out of vector $\mathbf{w}$ and written as $b$:

$$\begin{equation} \begin{aligned}
e(\mathbf{w}, b) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i + b - y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} + b - \mathbf{y})^\intercal (X\mathbf{w} + b - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned}. \end{equation}$$

To minimize $e$, we set partial derivatives to zero for $\mathbf{w}$ and $b$, and obtain the following: $$\begin{equation} \begin{aligned} b &= \frac{1}{N} \sum(y_i - \mathbf{w}^\intercal \mathbf{x}_i) \\ &= \frac{1}{N} \sum y_i \end{aligned} \end{equation}$$ $$\begin{equation} \begin{aligned} (X^\intercal X + \lambda I)\mathbf{w} &= y - b. \end{aligned} \end{equation}$$ The value of $b$ in Equation (20) takes advantage of the assumption that $\mathbf{x}$ have been zero-centered. Equation (21) is basically what we saw before, with $y$ being replaced by $y - b$. Therefore, linear regression can still be solved analytically with the regularization term.
One interesting question is since we can find the optimal hypothesis $g$ by the above analytical form without evaluating all the other infinite number of linear hypotheses, does it mean our hypothesis set $\mathcal H$ only contains one hypothesis $g$ and we therefore do not have to worry about over-fitting? Obviously too good to be true.

Imagine there are three algorithms: the first one painstakingly goes through a rather tedious grid search in the parameter space, explores a vast set of hypotheses in $\mathcal H$, and eventually finds $\mathbf{w}^*$. The second one uses the GD method in Equation (13) and identified $\mathbf{w}^*$ after multiple iterations. The third one takes advantage of Equation (18) and lands on $\mathbf{w}^*$ in a single step. They basically explore the same hypothesis space, except the third one is the smartest in avoiding the searches it knows will not come out favorably (or consider the third one has explored all hypotheses virtually). Our hypothesis set still contains infinity number of linear hypotheses, the reason we are not testing them individually is because we happen to be able to foresee their performs are not going to be better than the analytical solution. This "super vision" only comes after we see $\mathcal{D}$. In another word, all linear hypotheses in $\mathcal H$ have a chance to be our analytical solution, the best hypothesis $g$ is not known beforehand, but only after $\mathcal D$ is given. Therefore, what determines VC dimension here is not how we reach the final answer, but what is the effective total hypothesis candidate space.

For more complex models, where we do not have an analytical solution and rely on the GD algorithm explore the hypothesis space like the second algorithm. We are not just evaluating one hypothesis per iteration, we have effectively explore many hypotheses that are in the vicinity of the trial hypothesis under evaluation, it is just we have the knowledge to believe that those other hypotheses not along the negative gradient direction are not very likely to be have better performance compared to the next hypothesis along the opposite gradient direction. So even our solution is found after $n$ iteration steps, our hypothesis space is much larger than $n$. In those cases, the effective hypothesis space explored by GD is indeed smaller than the total hypothesis space, as they are probably other independent local minimum hypothesis missed by GD. In the linear regression example, the space that GD does not explore happens to contain no local minimum. The bottom line is having an analytical solution does not eliminate the need for regularization.

Figure 4.2. Effective hypothesis space explored by Gradient Descend algorithm is more than those individual red hypothesis points.

Feature Selection Capability of Linear Models

Often only a subset of features are informative, the rest are irrelevant to the learning problem. Many algorithms cannot distinguish informative features from noisy ones and be confused by that. For example, if only a small portion of features are informative and we use $k$-nearest neighbor algorithm and our similarity metrics is defined as Euclidean distance, where all features are weighted equally, then the distance would be dominate by noise and the performance would be poor. The goal of feature selection is to identify and eliminate a subset of features without too much sacrifice of $E_{in}$. Then by using fewer features, our model requires fewer parameters, therefore, smaller VC complexity, therefore smaller $E_{reg}$ eventually leads to overall smaller $E_{out}$.

A linear model is interesting in a way that it automatically does feature prioritization. Consider the two-dimensional two-class example, the weight vector $\mathbf{w}^*$ approximately points from the center of the $-1$ cluster to the center of the $+1$ cluster (Figure 4.1). In this example, $x_1$ is more informative than $x_2$, and we have $w_1 \gg w_2$, because:
$$\begin{equation} w_i = \frac{\partial{y}}{\partial{x_i}} \end{equation}.$$
So the linear plane points at the direction where gradient of $y$ tend to be the largest, and the weight vector, the norm of the hyper plane, basically describes how important each feature dimension is and weights them accordingly.

Non-linear Classification

In Figure 4.3, the boundary between the two classes are obviously non-linear. One way is use a hypothesis set including non-linear boundary models. Another way, which we prefer, is to transform this problem into a linear problem!

The general idea is we introduce a mapping function from origin $X$ space into a new space $Z$:
$$\begin{equation} X \to Z: Z = \phi(X).\end{equation}$$
The mapping is non-linear, e.g., Z is a second order polynomial space, $Z = \left[ 1, x_1, x_2, x_1^2, x_1x_2, x_2^2 \right]$. We here project a three dimensional data point $[1, x_1, x_2]$ into six dimensions. Hopefully, the non-linear transformation has absorb the non-linearity in the data, so that when we exam data points $[z_0, z_1, z_2, \ldots , z_m]$ in the $Z$ space (with dimensionality $m$), a linear model works well there:

$$\begin{equation} \begin{aligned} y &= {\rm sign} (\mathbf{w}^\intercal \mathbf{z}) \\ &= {\rm sign} (\mathbf{w}^\intercal \phi(\mathbf{x})). \end{aligned} \end{equation}$$

In $Z$ space, we use the linear solvers discussed above to find the optimal weights $\mathbf{w}^* = \left[ w_0, w_1, \ldots, w_m \right]$. Let us assume in our particular example, the optimal solution is $\mathbf{w}^* = (1, 0, 0, -1, 0, -1)$. The decision function in the $X$ space will be non-linear:

$$\begin{equation} \begin{aligned} y &= {\rm sign}(\mathbf{w}^{*\intercal}\mathbf{z}) \\ &={\rm sign}(1-z_3-z_5) \\ &= {\rm sign}(1-x_1^2 - x_2^2). \end{aligned} \end{equation}$$

We show how to obtain a good non-linear model, where the core computation of our learning algorithm takes place in a linear space $Z$. We will discuss more on how to automatically include non-linearity using a Support Vector Machines classifier in the next chapter. Please read corresponding LFD Chapters for more examples [5].

Figure 4.3. Relying on a non-linear mapping function, linear hyper-plane works well in the $Z$ space. The decision plane corresponds to a non-linear decision boundary in the original $X$ space.

Regularization Again

A puzzling question regarding the regularization term is why models with larger $\mathbf{w}$ are considered more complex. If our model is non-linear, or our linear model is in the transformed $Z$ space, some weight parameters represent higher-order $\mathbf{x}$ terms, i.e., more complex hypotheses, reducing those weights does reduce the model complexity, which is not hard to understand. The challenge is if we are in the original linear $X$ space, why a hyperplane with larger $\mathbf{w}$ is considered more complex?

For a linear regression problem, if both $\mathbf{x}$ and $\mathbf{y}$ are normalized, so that they are centered around zero, we could appreciate lines with $\mathbf{w}$ closer to zero might be more likely, as they are rather straightforward hypotheses. A better way to think about this is given two hypoethesis of similar $E_{in}$, i.e., their predictions areabout the same: $y\approx \mathbf{w_1}^\intercal \mathbf{x} \approx \mathbf{w_2}^\intercal \mathbf{x}$, since the gradient $\nabla_{\mathbf{x}} y = \mathbf{w}$, the hypothesis with smaller $\mathbf{w}$ has smaller gradient against $\mathbf{x}$. This means we can tolerate larger input noise $\sigma_{\mathbf{x}}$ and prediction remains essentially unchanged, such hypotheses should be more generalizable. Also notice this argument naturally excludes $w_0$ from regularization, as it is not present in the gradient expression.

For a linear classification problem, one can scale $\mathbf{w}$ by a positive value and retain the same accuracy, as $y = \rm{sign}(\mathbf{w}^\intercal \mathbf{x})$ remains unchanged for all $c\cdot \mathbf{w}$. On one hand, this means our hyphothesis space is redundant, therefore, reducing the redundancy by introducing regularization makes hopythesis search easier. On the other hand, similar to the regression case, smaller $\mathbf{w}$ means the transition from $-1$ to $+1$ as we move from one class of $\mathbf{x}$ to the opposite class is smoother. When the data points are linearly separable in the logistic regression case, the optimal solution without regularization is $\mathbf{w} \to \infty$, where probability function is a sharp step function, this basically is an overfit [6]. Constraining $\mathbf{w}$ to be small essentially excludes such step functions. The small $\mathbf{w}$ means $y$ changes slowly, which corresponds to the maximum margin of the decision boundary, as we will in the next chapter.

Extra - Multiclass Classifer

What we described here are binary classifiers, in reality many problems contain multiple target labels, such as identifying the object for a given picture, where the object can be one of 1000 candidates. Popular approaches for multiclass classifiers are described in another blog.

Reference

LFD, page 94-95
DLB, page 270
LFD, page 97-98
The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Second Edition. page 63-64.
LFD, page 99-104
Christopher M. Bishop. Pattern Recognition and Machine Learning. p.206.
It is worth noting that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to $\sigma = 0.5$., equalivalent to $\mathbf{w}^\intercal \mathbf{x} = 0$, separates the two classes and the mangitude of $\mathbf{w}$ goes to infinity. $\cdots$ Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable. The singularity can be avoided by inclusion of a prior and finding a MAP solution for $\mathbf{w}$, or equivalently by adding a regularization term to the error function.

Sciency Stuff

Thursday, May 25, 2017

Chapter 4. Linear Models