$$\begin{equation} \mathbf{y}_{ji} = \delta_{kj}, \\
\text{where}\; \delta_{kj} = 1 \;\text{iff}\; k = j\end{equation}.$$
Logistic Regression
LR is a linear classifier, i.e., we first seek linear functions, one per class (we omit the bias term for simplicity):
$$\begin{equation}s_{ki} = \mathbf{w}_{k}\mathbf{x}_i\end{equation},$$
where $k = 1, 2, \cdots, K$ and $i = 1, 2, \cdots, N$.
Since we aim to predict the probability of the $x_i$ belonging to class $k$, we need to convert the vector $\left[ s_{1i}, s_{2i}, \cdots, s_{Ki} \right]$ into a probability vector. Borrowing the concept behind Equation (3.6), we can use the $softmax$ function to obtain the probability vector as:
$$\begin{equation}f_{ki} = \frac{e^{s_{ki}}}{\sum_{j=1}^{K} \; e^{s_{ji}}}\end{equation}$$
The above transformation guarantee that the resultant vector satisfy the definition of probability, i.e., all $K$ elements are in $[0, 1]$ and with their sum into one (Figure 1).
Following the same Bayesian approach as described in Equation (4.10), we then seek to obtain the optimal $\mathbf{w}^*$ through minimizing a loss function $L$:
$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \;-y_{ji} \log f_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; -y_{ki} \log f_{ki} \end{aligned}\end{equation}, $$
which is basically a cross entropy loss term, a.k.a., K-L divergence (Equation 3.14). In Equation 4, we try to have our probability function $f_{ji}$ model as close as possible to the observations $y_{ji}$ by considering all data points across all possible classes.
Gradient descend (GD) method can be applied to minimize Equation 4. The gradient with respect to parameter $\mathbf{w}$ is well behaved:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ij} \right) &= \sum_{i=1}^N \; -\frac{y_{ki}}{f_{ki}} \nabla_{\mathbf{w}}f_{ji} \end{aligned} \end{equation} $$.
Considering Equation 3 gives:
$$\begin{equation} \nabla_{\mathbf{wk}} f_{ki} = f_{ki}(1-f_{ki}) \mathbf{x}_i \end{equation}$$
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1, j=y_i}^N \; -\log f_{kj} \right) &= -(1-f_{ki})\mathbf{x}_i \end{aligned} \end{equation}. $$
When our prediction is very wrong, i.e., $f_{ki}$ approach zero, $\mathbf{w}$ then moves in the negative gradient direction in the next iteration, which will lead to an increase in $f_{ki}$, exactly what we want.
Readers unfamiliar with the Bayesian framework might intuitively propose minimizing least-square error as an alternative loss function in the form of:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= \sum_{i=1}^N \; \sum_{j=1}^K \; {( f_{ji} - y_{ji})}^2. \end{aligned} \end{equation} $$
However, this form is inferior and should be avoided. The reason being the gradient of Equation 8 can be derived as:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= 2(f_{ji}-y_{ji})f_{ji}(1-f_{ji}) \mathbf{x}_i. \end{aligned} \end{equation}$$
This poses a serious issue. Imagine the true class $y_{ji}$ is one, but our prediction $f_{ji}$ is very small. The feedback $(f_{ji}-y_{ji})$ in attenuated by a very small factor $f_{ji}$, so the feedback signal to tune $\mathbf{w}$ in the next round is too weak. The same is true when $y_{ji}$ is zero and our prediction is very wrong with $f_{ji}$ close to one. Basically, when we made a very wrong prediction, where $y_{ji}$ is close to zero or one, the gradient value becomes too weak and $\mathbf{w}$ can get stuck in the wrong value. Therefore, for classification problems, we always use cross-entropy as the loss function and never use least-square form.
Multitask Classifier
There is a related problem called multitask classification. For instance, for a given compound, we would like to predict whether it is active in a panel of $K$ assays [1]. Here the status is non-exclusive, i.e., a compound can be found active in multiple assays. We basically predict a target vector of $K$ element, where each element $f_{ji}$ can be either 0 or 1, independently. This can be solved by treating the task as constructing $K$ independent binary LR classifier (Figure 2), where for the $j$-th classifier:
$$\begin{equation} \begin{aligned} s_{ji} &= \mathbf{w}_{j}\mathbf{x}_i, \\
f_{ji} &= \frac{1}{1+ e^{s_{ji}}}. \end{aligned} \end{equation}$$
The reason we mention multitask classifier is because in many cases, we do not apply LR on the initial input feature vectors $\mathbf{x}$, but instead apply a non-linear transformation $\mathbf{x} \to \mathbf{z}$ first, and then do the linear classification (see Equation 4.23). Multitask classification described here becomes a very effective learning solution, when the $K$ tasks share the same underlying transformation function $\mathbf{z} = \phi(\mathbf{x})$, and $\phi$ is independent of task class. In future deep learning chapters, we will learn that $\mathbf{z}$ is a new set of features learned from the raw input $\mathbf{x}$ and the $K$ tasks use the same set of feature learning approach for different classification purposes (Figure 3). By sharing the $\phi$ function, the learning of optimal feature transformation can take advantage of all training data sets for all $K$ classes, instead of training a separate $\phi$ function per task. The aggregated training data sets provides a better learning curve and can lead to more robust $\phi$, therefore, more robust new features $\mathbf{z}$.
Support Vector Machines
SVM is designed to be a binary classifier, therefore, there are typically two ways to apply SVM to multiclass classification [2]. The first choice is to build $K$ SVM classifiers, each trained by data from class $k$ against data from the remaining sets, i.e., so called one-vs-all approach. Then the class corresponding to the largest margin (i.e., the data point is the furthest away from the SVM boundary) is chosen as the predicted label. The second choice is to build $n(n-1)$ pairwise SVM between data sets from all two classes. Then the class obtaining the most number of winning votes wins, and this is called one-vs-one approach. If all $K$ training data sets are roughly of the same size, one-vs-all can be unbalanced, so one-vs-one might be preferred. In addition, one-vs-one is typically faster to train due to the much smaller training size involved. The popular library libsvm uses one-vs-one[3].
Here instead we apply SVM directly to multiclass classification problems using a soft-margin SVM loss function. The basic idea of soft-margin SVM is to only penalize those data points that are not far enough from a predefined margin $\Delta$. Data points that are already correctly predicted and has a distance longer than $\Delta$ receives no penalty. Data points that are further away do not receive extra reward as well. With this in mind, we can write the loss function for a multiclass SVM classifier and training all classes in one shot, a.k.a a "single machine" approach. Assume $s_{ij}$ represents the distance of data point $i$ to the linear boundary of class $j$. For data point $i$ of class label $y_i = k$, it will be a clean prediction if its distance to class $k$, $s_{ki}$, is at least $\Delta$ greater than its distances to all the remaining $K-1$ classes, otherwise, penalty occurs [4]:
$$\begin{equation} \begin{aligned}\mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; L_{i} \\
&= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1, j \neq y_i}^K \; \max(- (s_{y_i i} - s_{ji} - \Delta), 0) \end{aligned} \end{equation}. $$
It should be noted that the above is not the only loss form for SVM. There can be other forms, such as proposed in the original multi-class SVM paper [5]. However, for deep learning applications, we rarely see multiclass SVM is used, probably because it does not provide a probabilistic prediction and it is not well documented. Multiclass LR has become the method of choice.
Reference
- https://arxiv.org/pdf/1406.1231.pdf
- http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf
- https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f419
- http://cs231n.github.io/linear-classify/
- http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf