Tuesday, December 19, 2017

Multiclass Classifier

In Chapter 4 we discussed how to solve a binary classification problem using either logistic regression (LR) or support vector machines (SVM).  For more complex tasks, such as classifying an image from ImageNet into one of the 1000 pre-defined object classes, it is a multiclass classification problem.  If the true class for a give data record $\mathbf{x}_{i}$ is $k$, we encode its true label $\mathbf{y}_i$ as a one-hot vector (vector elements are all zero, except its $k$-th element is one):

$$\begin{equation} \mathbf{y}_{ji} = \delta_{kj}, \\
\text{where}\; \delta_{kj} = 1 \;\text{iff}\; k = j\end{equation}.$$

Logistic Regression


LR is a linear classifier, i.e., we first seek linear functions, one per class (we omit the bias term for simplicity):

$$\begin{equation}s_{ki} = \mathbf{w}_{k}\mathbf{x}_i\end{equation},$$
where $k = 1, 2, \cdots, K$ and $i = 1, 2, \cdots, N$.

Since we aim to predict the probability of the $x_i$ belonging to class $k$, we need to convert the vector $\left[ s_{1i}, s_{2i}, \cdots, s_{Ki} \right]$ into a probability vector.  Borrowing the concept behind Equation (3.6), we can use the $softmax$ function to obtain the probability vector as:

$$\begin{equation}f_{ki} =  \frac{e^{s_{ki}}}{\sum_{j=1}^{K} \; e^{s_{ji}}}\end{equation}$$

The above transformation guarantee that the resultant vector satisfy the definition of probability, i.e., all $K$ elements are in $[0, 1]$ and with their sum into one (Figure 1).

Figure 1. A neural-network-style diagram for the multiclass LR.  Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector is normalized by $softmax$ function into a probability vector $\mathbf{f}$ for output.

Following the same Bayesian approach as described in Equation (4.10), we then seek to obtain the optimal $\mathbf{w}^*$ through minimizing a loss function $L$:

$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \;-y_{ji} \log f_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; -y_{ki} \log f_{ki} \end{aligned}\end{equation}, $$
which is basically a cross entropy loss term, a.k.a., K-L divergence (Equation 3.14).  In Equation 4, we try to have our probability function $f_{ji}$ model as close as possible to the observations $y_{ji}$ by considering all data points across all possible classes.

Gradient descend (GD) method can be applied to minimize Equation 4.  The gradient with respect to parameter $\mathbf{w}$ is well behaved:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ij} \right) &= \sum_{i=1}^N \; -\frac{y_{ki}}{f_{ki}} \nabla_{\mathbf{w}}f_{ji} \end{aligned} \end{equation} $$.

Considering Equation 3 gives:
$$\begin{equation} \nabla_{\mathbf{wk}} f_{ki} = f_{ki}(1-f_{ki}) \mathbf{x}_i \end{equation}$$

Equation 5 can be simplified into:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1, j=y_i}^N \; -\log f_{kj} \right) &= -(1-f_{ki})\mathbf{x}_i \end{aligned} \end{equation}. $$

When our prediction is very wrong, i.e., $f_{ki}$ approach zero, $\mathbf{w}$ then moves in the negative gradient direction in the next iteration, which will lead to an increase in $f_{ki}$, exactly what we want.

Readers unfamiliar with the Bayesian framework might intuitively propose minimizing least-square error as an alternative loss function in the form of:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= \sum_{i=1}^N \; \sum_{j=1}^K \; {( f_{ji} - y_{ji})}^2. \end{aligned} \end{equation} $$

However, this form is inferior and should be avoided.  The reason being the gradient of Equation 8 can be derived as:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= 2(f_{ji}-y_{ji})f_{ji}(1-f_{ji}) \mathbf{x}_i. \end{aligned} \end{equation}$$

This poses a serious issue.  Imagine the true class $y_{ji}$ is one, but our prediction $f_{ji}$ is very small.  The feedback $(f_{ji}-y_{ji})$ in attenuated by a very small factor $f_{ji}$, so the feedback signal to tune $\mathbf{w}$ in the next round is too weak. The same is true when $y_{ji}$ is zero and our prediction is very wrong with $f_{ji}$ close to one.  Basically, when we made a very wrong prediction, where $y_{ji}$ is close to zero or one, the gradient value becomes too weak and $\mathbf{w}$ can get stuck in the wrong value.  Therefore, for classification problems, we always use cross-entropy as the loss function and never use least-square form.

Multitask Classifier


There is a related problem called multitask classification.  For instance, for a given compound, we would like to predict whether it is active in a panel of $K$ assays [1].  Here the status is non-exclusive, i.e., a compound can be found active in multiple assays.  We basically predict a target vector of $K$ element, where each element $f_{ji}$ can be either 0 or 1, independently.  This can be solved by treating the task as constructing $K$ independent binary LR classifier (Figure 2), where for the $j$-th classifier:

$$\begin{equation} \begin{aligned} s_{ji} &= \mathbf{w}_{j}\mathbf{x}_i, \\
f_{ji} &= \frac{1}{1+ e^{s_{ji}}}. \end{aligned} \end{equation}$$

Figure 2. A neural-network-style diagram for the multitask LR.  Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector element is individually normalized by logistic transformation into a probability value for the corresponding class membership for the output.

The reason we mention multitask classifier is because in many cases, we do not apply LR on the initial input feature vectors $\mathbf{x}$, but instead apply a non-linear transformation $\mathbf{x} \to \mathbf{z}$ first, and then do the linear classification (see Equation 4.23).  Multitask classification described here becomes a very effective learning solution, when the $K$ tasks share the same underlying transformation function $\mathbf{z} = \phi(\mathbf{x})$, and $\phi$ is independent of task class.  In future deep learning chapters, we will learn that $\mathbf{z}$ is a new set of features learned from the raw input $\mathbf{x}$ and the $K$ tasks use the same set of feature learning approach for different classification purposes (Figure 3).  By sharing the $\phi$ function, the learning of optimal feature transformation can take advantage of all training data sets for all $K$ classes, instead of training a separate $\phi$ function per task.  The aggregated training data sets provides a better learning curve and can lead to more robust $\phi$, therefore, more robust new features $\mathbf{z}$.

Figure 3. A neural-network-style diagram for the multitask LR with feature sharing.  Raw input features $\mathbf{x}$ is first transformed, typically nonlinearly, into a new linear space $\mathbf{z}$, which is then used for the multitask classification problem in exactly the same manner as shown in Figure 2.  The main difference is the transformed feature $\mathbf{z}$ is shared by all tasks, therefore $\mathbf{x} \to \mathbf{z}$ transformation can be trained by using all training data points regardless of their task labels, instead of trained independently one per task.


Support Vector Machines


SVM is designed to be a binary classifier, therefore, there are typically two ways to apply SVM to multiclass classification [2].  The first choice is to build $K$ SVM classifiers, each trained by data from class $k$ against data from the remaining sets, i.e., so called one-vs-all approach.  Then the class corresponding to the largest margin (i.e., the data point is the furthest away from the SVM boundary) is chosen as the predicted label.  The second choice is to build $n(n-1)$ pairwise SVM between data sets from all two classes.  Then the class obtaining the most number of winning votes wins, and this is called one-vs-one approach.  If all $K$ training data sets are roughly of the same size, one-vs-all can be unbalanced, so one-vs-one might be preferred.  In addition, one-vs-one is typically faster to train due to the much smaller training size involved.  The popular library libsvm uses one-vs-one[3].

Here instead we apply SVM directly to multiclass classification problems using a soft-margin SVM loss function.  The basic idea of soft-margin SVM is to only penalize those data points that are not far enough from a predefined margin $\Delta$.  Data points that are already correctly predicted and has a distance longer than $\Delta$ receives no penalty.  Data points that are further away do not receive extra reward as well.  With this in mind, we can write the loss function for a multiclass SVM classifier and training all classes in one shot, a.k.a a "single machine" approach.  Assume $s_{ij}$ represents the distance of data point $i$ to the linear boundary of class $j$.  For data point $i$ of class label $y_i = k$, it will be a clean prediction if its distance to class $k$, $s_{ki}$, is at least $\Delta$ greater than its distances to all the remaining $K-1$ classes, otherwise, penalty occurs [4]:

$$\begin{equation} \begin{aligned}\mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \;  L_{i} \\
&= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1, j \neq y_i}^K \;  \max(- (s_{y_i i} - s_{ji} - \Delta), 0) \end{aligned} \end{equation}. $$

It should be noted that the above is not the only loss form for SVM.  There can be other forms, such as proposed in the original multi-class SVM paper [5].  However, for deep learning applications, we rarely see multiclass SVM is used, probably because it does not provide a probabilistic prediction and it is not well documented.  Multiclass LR has become the method of choice.

Reference



  1. https://arxiv.org/pdf/1406.1231.pdf
  2. http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf
  3. https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f419
  4. http://cs231n.github.io/linear-classify/
  5. http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf