Sciency Stuff: 2017

Tuesday, December 19, 2017

Multiclass Classifier

In Chapter 4 we discussed how to solve a binary classification problem using either logistic regression (LR) or support vector machines (SVM). For more complex tasks, such as classifying an image from ImageNet into one of the 1000 pre-defined object classes, it is a multiclass classification problem. If the true class for a give data record $\mathbf{x}_{i}$ is $k$, we encode its true label $\mathbf{y}_i$ as a one-hot vector (vector elements are all zero, except its $k$-th element is one):

$$\begin{equation} \mathbf{y}_{ji} = \delta_{kj}, \\
\text{where}\; \delta_{kj} = 1 \;\text{iff}\; k = j\end{equation}.$$

Logistic Regression

LR is a linear classifier, i.e., we first seek linear functions, one per class (we omit the bias term for simplicity):

$$\begin{equation}s_{ki} = \mathbf{w}_{k}\mathbf{x}_i\end{equation},$$
where $k = 1, 2, \cdots, K$ and $i = 1, 2, \cdots, N$.

Since we aim to predict the probability of the $x_i$ belonging to class $k$, we need to convert the vector $\left[ s_{1i}, s_{2i}, \cdots, s_{Ki} \right]$ into a probability vector. Borrowing the concept behind Equation (3.6), we can use the $softmax$ function to obtain the probability vector as:

$$\begin{equation}f_{ki} = \frac{e^{s_{ki}}}{\sum_{j=1}^{K} \; e^{s_{ji}}}\end{equation}$$

The above transformation guarantee that the resultant vector satisfy the definition of probability, i.e., all $K$ elements are in $[0, 1]$ and with their sum into one (Figure 1).

Figure 1. A neural-network-style diagram for the multiclass LR. Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector is normalized by $softmax$ function into a probability vector $\mathbf{f}$ for output.

Following the same Bayesian approach as described in Equation (4.10), we then seek to obtain the optimal $\mathbf{w}^*$ through minimizing a loss function $L$:

$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \;-y_{ji} \log f_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; -y_{ki} \log f_{ki} \end{aligned}\end{equation}, $$
which is basically a cross entropy loss term, a.k.a., K-L divergence (Equation 3.14). In Equation 4, we try to have our probability function $f_{ji}$ model as close as possible to the observations $y_{ji}$ by considering all data points across all possible classes.

Gradient descend (GD) method can be applied to minimize Equation 4. The gradient with respect to parameter $\mathbf{w}$ is well behaved:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ij} \right) &= \sum_{i=1}^N \; -\frac{y_{ki}}{f_{ki}} \nabla_{\mathbf{w}}f_{ji} \end{aligned} \end{equation} $$.

Considering Equation 3 gives:
$$\begin{equation} \nabla_{\mathbf{wk}} f_{ki} = f_{ki}(1-f_{ki}) \mathbf{x}_i \end{equation}$$

Equation 5 can be simplified into:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1, j=y_i}^N \; -\log f_{kj} \right) &= -(1-f_{ki})\mathbf{x}_i \end{aligned} \end{equation}. $$

When our prediction is very wrong, i.e., $f_{ki}$ approach zero, $\mathbf{w}$ then moves in the negative gradient direction in the next iteration, which will lead to an increase in $f_{ki}$, exactly what we want.

Readers unfamiliar with the Bayesian framework might intuitively propose minimizing least-square error as an alternative loss function in the form of:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= \sum_{i=1}^N \; \sum_{j=1}^K \; {( f_{ji} - y_{ji})}^2. \end{aligned} \end{equation} $$

However, this form is inferior and should be avoided. The reason being the gradient of Equation 8 can be derived as:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= 2(f_{ji}-y_{ji})f_{ji}(1-f_{ji}) \mathbf{x}_i. \end{aligned} \end{equation}$$

This poses a serious issue. Imagine the true class $y_{ji}$ is one, but our prediction $f_{ji}$ is very small. The feedback $(f_{ji}-y_{ji})$ in attenuated by a very small factor $f_{ji}$, so the feedback signal to tune $\mathbf{w}$ in the next round is too weak. The same is true when $y_{ji}$ is zero and our prediction is very wrong with $f_{ji}$ close to one. Basically, when we made a very wrong prediction, where $y_{ji}$ is close to zero or one, the gradient value becomes too weak and $\mathbf{w}$ can get stuck in the wrong value. Therefore, for classification problems, we always use cross-entropy as the loss function and never use least-square form.

Multitask Classifier

There is a related problem called multitask classification. For instance, for a given compound, we would like to predict whether it is active in a panel of $K$ assays [1]. Here the status is non-exclusive, i.e., a compound can be found active in multiple assays. We basically predict a target vector of $K$ element, where each element $f_{ji}$ can be either 0 or 1, independently. This can be solved by treating the task as constructing $K$ independent binary LR classifier (Figure 2), where for the $j$-th classifier:

$$\begin{equation} \begin{aligned} s_{ji} &= \mathbf{w}_{j}\mathbf{x}_i, \\
f_{ji} &= \frac{1}{1+ e^{s_{ji}}}. \end{aligned} \end{equation}$$

Figure 2. A neural-network-style diagram for the multitask LR. Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector element is individually normalized by logistic transformation into a probability value for the corresponding class membership for the output.

The reason we mention multitask classifier is because in many cases, we do not apply LR on the initial input feature vectors $\mathbf{x}$, but instead apply a non-linear transformation $\mathbf{x} \to \mathbf{z}$ first, and then do the linear classification (see Equation 4.23). Multitask classification described here becomes a very effective learning solution, when the $K$ tasks share the same underlying transformation function $\mathbf{z} = \phi(\mathbf{x})$, and $\phi$ is independent of task class. In future deep learning chapters, we will learn that $\mathbf{z}$ is a new set of features learned from the raw input $\mathbf{x}$ and the $K$ tasks use the same set of feature learning approach for different classification purposes (Figure 3). By sharing the $\phi$ function, the learning of optimal feature transformation can take advantage of all training data sets for all $K$ classes, instead of training a separate $\phi$ function per task. The aggregated training data sets provides a better learning curve and can lead to more robust $\phi$, therefore, more robust new features $\mathbf{z}$.

Figure 3. A neural-network-style diagram for the multitask LR with feature sharing. Raw input features $\mathbf{x}$ is first transformed, typically nonlinearly, into a new linear space $\mathbf{z}$, which is then used for the multitask classification problem in exactly the same manner as shown in Figure 2. The main difference is the transformed feature $\mathbf{z}$ is shared by all tasks, therefore $\mathbf{x} \to \mathbf{z}$ transformation can be trained by using all training data points regardless of their task labels, instead of trained independently one per task.

Support Vector Machines

SVM is designed to be a binary classifier, therefore, there are typically two ways to apply SVM to multiclass classification [2]. The first choice is to build $K$ SVM classifiers, each trained by data from class $k$ against data from the remaining sets, i.e., so called one-vs-all approach. Then the class corresponding to the largest margin (i.e., the data point is the furthest away from the SVM boundary) is chosen as the predicted label. The second choice is to build $n(n-1)$ pairwise SVM between data sets from all two classes. Then the class obtaining the most number of winning votes wins, and this is called one-vs-one approach. If all $K$ training data sets are roughly of the same size, one-vs-all can be unbalanced, so one-vs-one might be preferred. In addition, one-vs-one is typically faster to train due to the much smaller training size involved. The popular library libsvm uses one-vs-one[3].

Here instead we apply SVM directly to multiclass classification problems using a soft-margin SVM loss function. The basic idea of soft-margin SVM is to only penalize those data points that are not far enough from a predefined margin $\Delta$. Data points that are already correctly predicted and has a distance longer than $\Delta$ receives no penalty. Data points that are further away do not receive extra reward as well. With this in mind, we can write the loss function for a multiclass SVM classifier and training all classes in one shot, a.k.a a "single machine" approach. Assume $s_{ij}$ represents the distance of data point $i$ to the linear boundary of class $j$. For data point $i$ of class label $y_i = k$, it will be a clean prediction if its distance to class $k$, $s_{ki}$, is at least $\Delta$ greater than its distances to all the remaining $K-1$ classes, otherwise, penalty occurs [4]:

$$\begin{equation} \begin{aligned}\mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; L_{i} \\
&= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1, j \neq y_i}^K \; \max(- (s_{y_i i} - s_{ji} - \Delta), 0) \end{aligned} \end{equation}. $$

It should be noted that the above is not the only loss form for SVM. There can be other forms, such as proposed in the original multi-class SVM paper [5]. However, for deep learning applications, we rarely see multiclass SVM is used, probably because it does not provide a probabilistic prediction and it is not well documented. Multiclass LR has become the method of choice.

Reference

https://arxiv.org/pdf/1406.1231.pdf
http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf
https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f419
http://cs231n.github.io/linear-classify/
http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf

Thursday, November 2, 2017

What is Cloud Computing?

The Electricity Analogy

More than a century ago, if there had been the acronym CEO at the time, it would have stood for Chief Electricity Officer. The electricity at the time had become an essential taxable ingredient of the manufacturing industry, however, due to its existence in the form of direct current, electricity could not be transmitted far before its energy dissipated within metal wires. Back in the 1800s, individual businesses built their own power generators. Sitting next to a company, whether it was a steel mill or a factory, was its own power plant (Figure 1). Only after Nikola Tesla invented alternating current at the end of 1800s, long-distance electricity transmission became feasible and gradually led to the modern electrical grid. Today nobody worries about electricity generation, given a new electronic appliance, we simply plug and play.

Figure 1. Thomas Edison’s Pearl Street Station (New York City) electrified a one-square-mile area (yellow highlighted) of Lower Manhattan beginning September 4, 1882, providing power to illuminate his new electric light bulbs. The world’s first commercial electric power plant operated until it was destroyed by fire in 1890. Upper-right is the Dynamo direct current generator used. (credit: https://www.embedded.com/print/4404090)

The analogy of Cloud Computing being the new Electrical Grid makes it far easier to understand. Power generator was local, was part of the power consumption unit, it gradually migrated to remote site enabled by electricity transmission technology. Similarly, CPU (computing resource) was mostly local, it is in the process of migrating to remove enabled by the Internet. Resources in the Cloud is forming a gigantic network of computer grid, very much in a way traces the historical trajectory of the forming of the electricity grid. Here we will explain why Cloud Computing is different from your company's own data center and why it is not a fancier hype word for remote computing.

As far as it relates to computers, the term "cloud" dates back to early network design, when engineers would map out all the various components of their networks, but then loosely sketch the unknown networks (like the Internet) theirs was hooked into. What does a rough blob of undefined nodes look like? A cloud (Figure 2). Imagine I am giving this Cloud Computing presentation in a company's conference room, with PowerPoint slides opened from a file hosted on a server sitting within company's data center. For our sake of understanding how the file was transferred to the presenting computer and how video signal is sent to the overhead projector, the exact server and network architectures within the data center is irrelevant, therefore, we can simply represent our data center as a Cloud symbol.

Figure 2. The Cloud symbol is something that take little skill to draw. It's a squiggly line formed into a rough ellipse. Over time, clouds were adopted as the stand-in image for the part of a computer or telephone network outside one's own.

High Availability & Fault Tolerant

By using network resource versus a local device, data is available on multiple devices and remains available when some device or cloud component went down (due to built-in redundancy) (Figure 3). However, these are just necessary conditions for the Cloud definition, not sufficient conditions. The reason is we mostly gain these two benefits by using our company's data center, but our data center is not a Cloud. Then what are the unique characteristics of the Cloud?


Figure 3. But storing data remotely in the Cloud, we gain high availability and fault tolerance. (credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1)

Scalability

In this big data era, we are all facing the explosion of data, the only difference is the extend varies. IT departments in all companies are busy managing constantly-growing demands for larger storage capacity, more powerful databasing, and subsequently more computing power.

Imagine we are a small startup company that has 3 servers supporting 1000 users in 2016. In 2017, our user base has grown into 5000 and we doubled our servers to 6. Our projection for 2018 is 20,000 users, therefore, we propose a budge to buy 9 more servers. We need to work on a convincing proposal, manage to get capital approval, obtain quotes from multiple vendors and go through the procurement process, find/build additional rack space in our data center, wire the hardware, install and configure software ecosystem on these servers. Even for the most efficient companies, the process can easily take months, besides the large lump-sum capital investment. Scalability is what a company's data center cannot provide. Using the electricity analogy, this headache exists because we are running our own power generator.

Figure 4. On-premise scale up is costly both financially and time wise.
(credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1)

What if we can go to a Cloud provider instead, with a few clicks, we specify the CPU, memory and storage requirements for our nine servers, as well as the operating systems we desire. In a few minutes, the servers are up and running and yet better, we only pay less than one dollar per hour for each of them, no more capital budget required. That is what Cloud Computing promises and that fundamentally changes the way we think and use computing resource, more towards the plug-and-play utility model.

Elasticity

Growth is always hard to be predicted accurately. In the previous example, what if our user based will not grow to 20,000 but only 7,000 next year? Eight out of nine servers are in waste, this is especially undesirable, as seeding fund is particularly limited for our little startup company. Even if we made the accurate prediction, the nine servers are only gradually come into use as we reach the end of 2018. So on-premise computing infrastructure cannot be sized down. As the scalability limitation previously mentioned is often addressed by overshooting the resource needs, waste is hard to avoid.

Figure 4. On-premise scale down is costly due to the waste of resource.
(credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1)

With our Cloud provider, since the overhead of getting a new server is so light, our server farm only scales up when we really need to, thus avoid the need to size down.

Let us give another example to illustrate the importance of elasticity. Retails such as Walmart earn their revenues mostly in the last two months of the year. If its in-house IT department purchases a battery of servers in order to accommodate the computing needs in busy November and December, those resource will sit in idle when January hits and until the next shopping season. With Cloud provider, it can shut down and return any extra computing resource in January and immediately stop paying for them.

The Cloud provider can absorb the risk of scalability and elasticity by the economy of scale. Just like power plants form a giant electrical grid to balance power consumption, Cloud provider leverage the fluctuations among the needs of its tenants and in a much better position to manager to provide or take back resources on demand.

What is Cloud?

Therefore, the Cloud can be viewed as a virtually infinite storage and computing resource, where we can size up and down as we wish and we only pay for what is really being used. It is just like our plug-and-play model on electricity usage. The Cloud leads to efficiency by eliminating the operational overhead of scale up, as well as significantly reducing the cost by saving up-front investment and avoid size-down waste.

However, Cloud Computing is not entirely plug-and-play yet. The Electrical Grid has been polished over the century to provide a set of standard parameters: voltage and frequency. But the computing resource is much more complicated to define. Computing resource relies on software ecosystem, where we do not yet have a standardized platform. Linux, Windows, OSX, iOS, Android all have their own advantages for specific applications and there is no one-size-fit-all for the near future. In addition, Cloud Computing is provided remotely via the Internet, network latency is rather significant not due to the speed of light, but rather due to the relay of our data package over dozens of servers sitting between us and the Cloud provider. When Cloud computing needs to be fueled by our own data (which is not a case for electricity), it probably does not make sense to transfer terabytes of data into the Cloud just for a short computation.

Last but not least is the pace of innovation. Innovations in electricity generation and distribution happens on the scale of decades or centuries. In contrast, Moore's Law is measured in months. As Cloud is largely a technology, not a science, what survives is not necessarily the most correct one, but usually the most popular one will eventually become the standard. Cloud is at its nascent stage, which means there are lots of new technologies come and go, constant learning is the key. So which Cloud platform is winning at the current point? At this point, Amazon Web Service (AWS), Microsoft Azure, and Google Cloud Platform are the three largest Cloud providers. AWS pretty much dominates the market, especially in terms of providing raw computing resource (IaaS, Infrastructure as a Service), therefore we will briefly explain the main concepts of AWS.

Amazon Web Service (AWS)

AWS has data centers in many geo-locations (Figure 5). Each center represents a "region", e.g., we have a region in San Francisco, one in Oregon, another in Virginia, etc. Each region is a circle on the world map. For a company to use AWS, it should first choose a region to physically establish a virtual data center, called Virtual Private Center (VPC). Within a region, say Oregon, there are multiple data centers called Availability Zones (AZ). Oregon has three AZs, called a, b, and c. Each AZ (data center) consists of racks and racks of computers and AWS allocates and retakes these computers to/from different tenants based on their demands. Our VPC is a logical concept, when it first created, it contains no computer at all. Only when we request one, AWS take a free instance and virtually places it into our VPC (more just label it in its resource management database). To use a VPC, we need to first create a few data structures called subnets, in order to pre-allocate our IP address space to avoid future confliction. Each subnet has to be on one AZ (cannot span AZs), although we draw an orange box in Figure 5 to represent a subnet, a subnet actually consists of computers randomly allocated by AWS on-demand scattered around within that AZ. This is Cloud, so we do not care about physically where our computers are and how they are networked, it is sufficient for us to imagine we have one virtual data center, consists of a few sub-networks that we can apply different routing and network security settings on each. E.g., a subnet for our database servers that do not need to talk to anything else except web servers, and another subnet hosting web servers that can talk to the database subnet and also to the Internet for processing http requests, etc.

In Figure 5, we create a VPC in Oregon and pre-allocate four subnets, each has an unique IP range for holding up to 256 computers (251 to be precise due to some reserved IPs), say 10.0.0.*/24, 10.0.1.*/24, 10.0.2.*/24 and 10.0.3.*/24. Logically, we can imagine any computers we launched within our VPC are on one giant local Ethernet (LAN) and can talk to each other as routing permits.

For data scientists to use AWS, it does put burden on us learning some basic skills that we used to rely on our UNIX administrators. Cloud Computing is most effective when we use a self-serving model, otherwise, relying on a ticketing system handled by Cloud Administrator, we will again have to wait for hours or days for a resource we could have obtained within minutes. Cloud Computing is here to stay and it is definitely the future, therefore, it is worthwhile for data scientists to learn the basics of AWS.

Some basic tasks include how to spin up an computer called EC2 instance, how to "sudo" install the packages needed (say Python3) and how to install modules (say use python "pip"), etc. But do not be scared, to compile and/or install a software package, we have the freedom to specify the hardware, to choose the operating system and a release we need. This is a luxury in-house UNIX administrators dream to have. Furthermore, we can even take advantage of pre-packaged software containers (such as Docker containers) or even use a whole Amazon Machine Image (AMI) to provision a machine ecosystem. This is not an AWS usage tutorial, AWS is best learnt through hands-on lab sessions, which is not covered in this short blog.

Figure 5. The high-level architecture of a company's virtual data center in AWS.

We here explain what Cloud Computing is and I hope you can now appreciate Cloud is not simply a buzz word, but it brings us a data engineering revolution, just like going from using a power generator to relying on an electrical grid.

Reference

Information are complied based on a few resources:

Video: AWS Concepts: Understanding the Course Material & Features
https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1
https://cacm.acm.org/magazines/2010/5/87274-cloud-computing-and-electricity-beyond-the-utility-model/fulltext
https://www.theatlantic.com/technology/archive/2011/09/clouds-the-most-useful-metaphor-of-all-time/245851/
https://www.quora.com/How-is-cloud-computing-analogous-to-electricity-grid

Sunday, June 11, 2017

40 Machine Learning Questions

There is a good list of 40 ML Interview Questions [1], which could be used to check our general ML knowledge. Please visit the original blog for the Q & As, we here just classify them, and provide additional comments as we gain better understandings ourselves.

Contents covered by the DSB book: Data Science for Business: What You Need toKnow About Data Mining and Data-Analytic Thinking (Foster Provost, Tom Fawcett):

Statistics: Q3, Q13, Q17, Q19, Q20, Q30
Machine Learning Techniques: Q4, Q5, Q11, Q12, Q21, Q22, Q26, Q28, Q29, Q33 (diagree with answer), Q37

Contents related to special topics not covered in DSB:

Linear Model and PCA: Q2, Q7, Q10, Q14, Q15, Q32

Contents overlapped with our notes:

What is Machine Learning: Q8, Q34
Learning Theory: Q16
Regularization & Cross Validation: Q6, Q9, Q23, Q24, Q27, Q31, Q36, Q38, Q39, Q40
Linear Models: Q1, Q18, Q35
SVM: Q25

Reference

https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/

Saturday, May 27, 2017

Chapter 6. The Path Forward

Review

For machines to learn from data, it explores a set of hypotheses $\mathcal{H}$ provided by us and identifies the best hypothesis $g$ based on the training data set $\mathcal{D}$, which contains $N$ pieces of data records. The upper confidence interval bound on the performance of the solution in test sets, $E_{out}$, differs from its within-sample performance $E_{in}$ (Figure 6.1):
$$\begin{equation} \begin{aligned}
E_{out} &= E_{in} + E_{reg} \\ &\approx E_{in} + \epsilon\left(\sqrt{\frac{\ln M}{N}} \right), \\
&\approx E_{in} + E_{reg}(\lambda).
\end{aligned} \end{equation}$$
where $M$ is the effective number of independent hypotheses and $\lambda$ is employed to adjust this number.

Figure 6.1. What we care about is $E_{out}$, i.e., the upper bound of true performance with a high confidence $1-\delta$, where $\delta$ usually is some small value. To minimize $E_{out}$, no longer should we minimize $E_{in}$ but also the confidence interval $\epsilon$, which depends on $N$ and $M$. Increasing sample size $N$ reduces $\epsilon$, reducing hypothesis space, i.e., reducing $d_{vc}$ and effective hypothesis count $M$ reduces $\epsilon$. As the functional form of $\epsilon$ is generally unknown, we typically model $E_{reg}$ using a penalty term to continuously adjust the effective size of our hypothesis space.

To minimize $E_{in}$, we need to employ sophisticated models that is close to the target function $f$ and increase the hypothesis search space. To minimize $E_{reg}$, we need to either reduce the effective model counts $M$ or increase the sample size $N$, while the latter is often not an option. From the Bayesian camp, $E_{in}$ is the evidence carried by the data set, $E_{reg}$ is our prior knowledge, $\lambda$ tunes our stubbornness in learning from data.

To find the balance of the two opposing requirements on model complexity, we modulate $M$ by adjusting a hyper-parameter $\lambda$. The optimal value of $\lambda$, $\lambda^*$, can be empirically determined by a cross validation protocol, where estimated $E_{out}$ score is minimized. $\lambda^*$ which subsequently determines the model $g$, which is expected to achieve the optimal $E_{out}^*$. This is basically the core principle of learning theory introduced in Chapter 2 & 3.

Standing on top of the learning theory, we show how linear or non-linear models can be applied to both classification and regression problems, which rely on a hyper-parameter $\gamma$ to modulate the complexity of the hypothesis space. The resultant model space could be as simple as only including some linear models or as sophisticated as including higher-order polynomial models. SVM only rely on a few support vectors, therefore have much fewer model parameters compared to other models, thus it has extremely robust generalization properties.

Bayesian and Graphical Models

We have covered a wide range of learning theories for ML in this note, nevertheless, there are enormous amount of ML knowledge yet to be studied on our own. For instance, we have not covered Ensemble methods, where multiple ML models can be pooled together to form a decision committee. There are examples where a high-quality committee can be formed by rather crude individual models. For this reason, people generally do not compare non-ensemble methods to ensemble method directly [1]. We should read about additional ML topics and one good resource is the excellent Machine Learning Technique lecture [2] (in Chinese though).

We also did not cover graphical model, or cases where input is not a well-structured feature matrix $X$. For example, when ${\rm Xbox Kinect}^{TM}$ classifies a player’s movement into either kick or jump, the input is a series of images (or poses), where there are varying number of frames serving as the input. Using hidden-Markov chain to describe the transition between poses would be more effective [3]. For problems based on complex inputs such as computer vision, voice recognition, nature language processing, techniques such as SVM no longer works well. So one future direction is to tackle the learning problem from Bayesian approach or graphical approach. In the regularization discussion, we mentioned how the Bayesian approach avoid the over-fitting topic, and replace it with a prior belief, so a ridge regularization is basically a Gaussian prior and a Lasso regularization is basically a Laplace prior. To proceed along this direction of Bayesian learning, a good books might be "Pattern Recognition and Machine Learning" by Christopher Bishop. To study graphical models, a good but challenging online course might be “Probabilistic Graphical Model” Coursera series by Daphne Koller (although I do not recommend the book, use Bishop book instead).

Figure 6.2. A action is consisted of a series of states described by a hidden-Markov Model (HMM). In this example, a low-kick action consists of three state transitions: from stand to foot lifted, to kick, and then back to stand. At each time point, the body position $X(t)$ is observed as a sample from its underlying state $S(t)$. The goal is to predict the action label $A$ (low-kick, high-kick, sit, jump, etc.) based on a time-series of observations $X(t)$. One would need to maximize the likelihood: $p(X|A=k) = \sum_{S_k(t_0), S_k(t_1), \ldots} \; \left\{ p(S_k(t_0))p(X(t_0)|S_k(t_0)) \prod_{i=1,N} \; p(S_k(t_i)|S_k(t_{i-1})p(X(t_i)|S_k(t_i) \right\}$.

Deep Learning for Feature Learning

The learning theory above also heavily depends on the quality of features in the $X$ matrix. Often times, informative features are very hard to obtain. In cellular imaging, one typically has to carry out challenging segmentation task (a computer vision task) to identify individual cells, then apply programs such as CellProfiler [4] to compute several hundred features for each cell. There are custom-designed features to describe the size, intensity, textural and the morphology of cells. Some nice image features, such as Scale-invariant Feature Transform (SIFT) [5], requires decades of hard research and are still may not be very effective. Is there a way to ask machine to do feature learning, i.e., having machine automatically construct informative feature vectors, is a very useful area of research. Feature learning can also be treated as the goal of finding a mapping function $f: X \to Z$ by machine. In this regard, Deep Learning (DL) has great advantage, in fact, some experts consider DL equivalent to feature learning, because its Convolutional Neural Network (CNN) is particularly apt at extracting a hierarchical set of features ranging from rudiment edge and corners to complex elements like wheels and faces. So another future path is Deep Learning. To that end the good book is DLB and a good course probably is “Neural Network for Machine Learning” by the DL godfather Geoffrey Hinton on Coursera.

Figure 6.3. Deep Learning can automatically learn features, without experts going through tedious manual feature construction process. Here a DL network extracts lower-level features such as edge, corners, then gradually builds up higher-level features represents eyes, nose and finally faces. [https://s3.amazonaws.com/datarobotblog/images/deepLearningIntro/013.png]

Data Engineering: Big Data and Cloud

The learning theory we covered here obviously has very strong emphasis on over-fitting: “the ability to deal with over-fitting is what separates professionals from amateurs in the field of learning from data” according to LFD [6]. Based on the learning curve and VC theory, over-fitting will no longer be a concern if one has lots of data. When $N$ is 100 or 1000 times larger than the number of model parameters, $E_{in}$ and $E_{out}$ will be very close. However, keep in mind that although we live in a big data era, only some problems have the luxury of having big data as training data set. The majority of the problems we are deal with as data scientists in non-Internet-data companies (such as Facebook, LinkedIn, Google, Amazon, etc.) are still very much under the constrains of limited $N$ and all the theories we have discussed in this note are crucial. The word “big” data can be very misleading. In biology next-generation sequencing (NGS) data could be a few GB, but it is not big due to its lack of complexity. Such data file can be post-processed into sequence counts for twenty thousand human genes. If we think each NGS sample is a record and each of the twenty thousand gene is a feature, one thousand NGS samples (TB of data) still does not qualify it as a big data problem. You can image a cat with enormous resolution, therefore, generate a huge picture file, the size of file does not make this a big data. Big data is not just about volume of data.

Nevertheless, let us pretend we are data scientists in Google for a moment and we trying to solve a real big data problem with basically infinite $N$ in our hands, our worry is no longer over-fitting. We have to worry about how to store our training data – a big data storage problem, how to carry out the computation to find the optimal hypothesis in $\mathcal H$ – a big data computation problem, and how to retrieve particular future records efficiently for prediction – a big data transaction problem. ML then becomes a real engineering challenge, and we start to rely more and more on data engineering technologies and less and less on data science knowledge. Infinite $N$ means the variance term in $E_{out}$ is no longer the issue, the performance bottleneck is the bias term. To reduce bias one should introduce more sophisticated models. This is why DL, with tens of thousands, millions, or billions of model parameters, is only possible when firstly there are nearly infinite amount of training data, secondly there is a vast computational infrastructure, which is enabled by both big data and by GPU hardware. Therefore, we see big data has strong ties to Deep Learning, and to data engineering. Please bear in mind that when DL is applied to small data size, we will see jargon such as regularization, over-fitting being mentioned again and again in the DLB; so all we have learned in this note will not be wasted in your DL journey.

Data engineers study the three big data problems raised above: storage, computation, and transaction. Due to the engineering nature of such challenges, the result is not a set of new theory, but a set of technical solutions, more precisely, are many alternative technical systems, where the industry standard is practically determined by whichever has the dominant user base. The answer to big data storage originated from Hadoop File System (HDFS), an open source implementation of Google File System (GFS). The answer to big computation was first proposed again by Google - its Map-Reduce (MR) framework. The answer to big data transaction is the numerous NoSQL systems (Google's Big Table is a pioneer). In the early days, one has to establish local big data infrastructures using these components, where fortunately most are open source and free. Nowadays, we are more likely to focus on their utility and let Cloud providers such as Amazon Web Service (AWS) take care of setting up these infrastructure for us. In this aspect, big data and Cloud do go hand-in-hand. Amazon’s solution to big storage is S3, to big computation is EC2 cluster or EMR, to big transaction is DynamoDB, etc. So our third future development path is data engineer/big data engineering/cloud engineering, I consider them nearly synonyms due to their tight connections. They emphasize on the programming and automation aspect, compare to the data science’s emphasize on the statistics skill sets. In reality, people tend to confuse data science with data engineering, hopefully there will be more clarification as the fields develop further.

Figure 6.4. As $N$ grows to $\inf$, over-fitting is no longer a concern and $E_{out}$ basically equals $E_{in}$. Our focus will switch from the science topics to practical engineering challenges. Engineering challenges include big data storage, computing and transaction. Currently Amazon Web Services (AWS) is the most popular big-data platform in the Cloud, so that we do not have to set up such an environment on our own.

Summary

In this note, we focus on how machine can select the best hypothesis from a hypothesis set $\mathcal H$, given a limited number of $N$ input training record $X$. We need to take into account both within-sample error $E_{in}$ and generalization error (determined by VC dimension) to avoid under-fitting and more importantly over-fitting. An effective way to determine the optimal balance of the two error terms is through hyper-parameter determination based on cross validation. Support Vector Machines is a huge success of machine learning theory with all these concepts in combine. The Gaussian kernel provides a continuous hypothesis space ranging from linear to high-order polynomial hypothesis, while still preserving decent generalization capability. Once we have a solid theoretical foundation, we are ready to understand many machine learning techniques. An excellent online course is “Machine Learning Techniques” on YouTube [2].

Reference

DLB, Page 251: Model averaging is an extremely powerful and reliable method for reducing generalization error. Its use is usually discouraged when benchmarking algorithms for scientific papers, because any machine learning algorithm can benefit substantially from model averaging at the price of increased computation and memory. For this reason, benchmark comparisons are usually made using a single model.
Machine learning technique video (in Chinese) by Hsuan-Tien Lin.
https://www.youtube.com/watch?v=A-GxGCCAIrg&index=1&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2
Probabilistic Graphical Models, Coursera online course by Daphne Koller. Part III. PGM Programming Assignment: Learning with Incomplete Data.
http://cellprofiler.org
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
LFD, page 119

Friday, May 26, 2017

Chapter 5. Support Vector Machines

We talk about VC dimension theory and explain how to improve the regularization property by reducing the hypothesis space $\mathcal{H}$. We also talk about linear models and how to take advantage of the linear model framework to learn with non-linear learning models. Building on the fundation of all the theories we have covered so far, a very impressive machine learning technique called Support Vector Machines (SVM) was invented. For problems of limited data set size, SVM has routinely been shown to outperform other learning algorithms, so if you can only choose one machine learning model, you generally will not regret by going with SVM. Only in recently years, SVM is surpassed by deep learning models, but that is only when large data set is available. As this note only considers small data set, SVM, for us, is the king of the castle.

Maximum Margin Linear Classifier

SVM has built-in regularization property! It relies on an intuitive idea of maximum classification margin. Let us only consider the case where the $N$ input data points in $\mathcal{D}$ are linearly separable, i.e., there exists a hyper plane that can classify all $N$ points without error. Although there are many solutions that can fit the bill (Figure 5.1a), SVM says we would like to go with the one that has the maximum margin, or we say it is the fattest decision plane with two margin planes called boundaries (Figure 5.1b), as this one is most robust against measurement uncertainty. Why is that? Each data point $\mathbf{x}$ in the real physical world carries some degree of uncertainty. When we measure the height of a person, uncertainty comes from the tool we used for the measuring, comes from the height itself varying during different times of the day, comes from the measuring protocol whether the person is shoes-on or off, etc. So it is best to assume each $\mathbf{x}$ is only an estimation of its true value, in another word, its true value is within a halo around $\mathbf{x}$. If we let the halo expand and eventually its true value will cross the linear separating plane and it means the point will be misclassified. The plane with the maximum margin is able to handle the largest halo size, therefore, can accommodate the largest degree of uncertainty in $\mathbf{x}$. Figure 5.1c & d compares the error-tolerance of a small-margin classifier and the maximum-margin classifier, the later is more robust against error in $\mathbf{x}$.

Figure 5.1. The blue decision plane is the fattest one we can find for this example (a & b). It can tolerate the largest amount of uncertainty in input data points $\mathbf{x}$ (c & d).

The maximum margin concept is also supported by the VC theory. Given $N$ data points, if we try to place in a plane with certain margin size. We can see in Figure 5.2 that the larger the margin, the smaller the possible space the separator can occupy, i.e., the smaller the hypothesis set space, therefore the smaller the effective VC dimension is. If a few margins are sorted ascendingly: $m_1 \subset m_2 \subset \ldots \subset m_n$, then their VC dimensions: $d_1^{vc} \ge d_2^{vc} \ge \ldots \ge d_n^{vc}$ [1]. Among all the linear separation planes, the one with the maximum margin has the best regularization property.

Figure 5.2. A smaller-margin decision plane outlined in green can occupy a larger hypothesis space, compared to the maximum-margin decision place outlined in blue.

SVM Engine

We here only consider the case where $N$ data points are linearly separable. A linear separating plane is the model:

$$\begin{equation} y = {\rm sign}(\mathbf{w} \cdot \mathbf{x} + b). \end{equation}$$
For $\mathbf{x}$ on the separating plane, $y = 0$. For other points, $y$ is either $+1$ or $-1$. Notice here $\mathbf{x}$ does not contains the constant element 1 in its first dimension, we explicitly write $\mathbf{w}_0$ as $b$, so that math would be simpler this way. If our boundaries are maximum margin, they must touch at least two data points, one on each side, otherwise, the margin cannot be the maximum. Since we can always linearly scale $b$ and $\mathbf{w}$ by multiplying them with a positive scaling factor without changing the classification results given by Equation (1), so that we can choose the scaling factor in such a way, so that $y = -1$ and $y = +1$ for the data points touching the two boundaries.

In Figure 5.3, $\mathbf{x}^+$ is on the +1 boundary plane, $\mathbf{x}^-$ is on the -1 boundary plane, we have:

$$\begin{equation} \left\{ \begin{aligned} \mathbf{w} \cdot \mathbf{x}^+ + b & = 1, \\ \mathbf{w} \cdot \mathbf{x}^- + b & = -1. \end{aligned} \right. \end{equation}$$

Figure 5.3. Illustration on how to calculate the size of a margin.

The thickness of the margin equals the projection of $\mathbf{x}^+ - \mathbf{x}^-$ (Figure 5.3, green arrow) along the normal direction of the planes ($\mathbf{w}$ in Figure 5.3):
$$\begin{equation} \begin{aligned} {\rm margin} &= (\mathbf{x}^+ - \mathbf{x}^-) \cdot \frac{\mathbf{w}}{\Vert{\mathbf{w}}\Vert} \\ &= \frac{2}{\Vert{\mathbf{w}}\Vert} \end{aligned} \end{equation}. $$
The second step uses Equation (2). Therefore, maximizing margin is equivalent to minimizing $\mathbf{w}^\intercal\mathbf{w}$. Recall in linear models, $\mathbf{w}^\intercal\mathbf{w}$ has been used as a regularization term to describe the complexity of the linear hypothesis (with $w_0$ already excluded), therefore, we say SVM has built-in regularization, as it prefers simpler hypotheses. Just to be aware that SVM has anti-over-fitting built in does not mean it is free from over-fitting, it will over-fit if we run a 100-dimensional SVM on 10 data points! It is just SVM is over-fitting-aware, and it is much more robust against over-fitting compared to other ML techniques.

To solve the maximum-margin decision plane problem is to solve:
$$\begin{equation} \min_\mathbf{w} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} \end{equation},$$
with the constrains:
$$\begin{equation} y_i (\mathbf{w}^\intercal\mathbf{x}_i + b) \ge 1. \quad i = 1, 2, \ldots, N \end{equation}$$
We normally minimize $E_{out}$, however, Equation (4) only contains the regularization term, where is $E_{in}$? Here we consider a linear-separable case, therefore, we are only interested in the case where all training data points are correctly classified, i.e., $E_{in} = 0$! How do we make sure we only consider $\mathbf{w}$ and $b$, where all data points are correctly classified? The answer is Equation (5), as the constrains means all data points are on the correct side of the margin and are all on or outside the boundaries. The factor $\frac{1}{2}$ is just for convenience. In summary, Equation (5) makes sure we only consider decision hyperplanes where all training data are on the correct sides and Equation (4) makes sure we choose among such candidates the one that maximize the margin.

The above constrained problem can be solved by introducing Lagrangian multipliers, i.e., the problem is equivalent to:

$$\begin{equation} \min_{\mathbf{w}, b} \; \max_{\alpha_i \ge 0} \; \frac{1}{2}\mathbf{w}^\intercal\mathbf{w} + \sum_{i=1}^{N} { \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b)}. \end{equation}$$
Let us try to understand why Lagrangian multiplier method works. Assume we consider an invalid solution candidate $(\mathbf{w}, b)$, i.e., there is at least one data point, say $i$, falls inside the two boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \gt 0$, $\alpha_i$ would go to $+\infty$ in order to maximize the new function. Then the function value would be $+\infty$ for such invalid $(\mathbf{w}, b)$, therefore, such invalid solution candidates have no chance to win, as there are other solutions give finite function values, thus invalid solutions are naturally excluded from solution pool. So the basic idea here is to cast our original function into a new form, so that those $(\mathbf{w}, b)$ violating the constrains has zero chance to be our final solution (Figure 5.4).

Figure 5.4. The basic idea of Lagrangian multiple is to convert the original problem, minimizing the function on the left with a constrain $\mathbf{x} \in [lb, ub]$ into an unconstrained minimization problem of a new function on the right. The introduction of $\alpha$ helps to construct a new function where its value is $+\infty$ everywhere $\mathbf{x}$ is outside the constrained region, therefore, there is no chance for bad $\mathbf{x}$ to be the final solution. In our example, the new function is at a higher dimensional space, including addition $\alpha$ dimensions.

The winning solution to Equation (6) naturally satisfy the following constrains:

$$\begin{equation} \alpha_i (1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) = 0. \end{equation}$$
Why is that? We already prove the solution to Equation (6) cannot violate Equation (5), so data points can never fall within the boundaries of our solution. Considering data points outside the boundaries, $(1- y_i \mathbf{w}^\intercal\mathbf{x}_i – y_i b) \lt 0$, therefore, any $\alpha_i \gt 0$ would decrease the value of the function in Equation (6), thus $\alpha_i = 0$ is the best case senario for $\max\alpha_i$ operator. Therefore in our final solution, only those data points fall exactly on the decision boundaries can have non-zero $\alpha_i$ values, i.e., they are the ones supporting the decision plane, therefore, also called support vectors. All other non-support vector data points must have $\alpha_i = 0$.

To solve Equation (6), mathematicians proof that we can swap the order of min and max, therefore, we will take a few partial derivatives against $b$ and $\mathbf{w}$ and set them to zero:
$$\begin{equation} \sum_{i=1}^N \; \alpha_i y_i = 0. \end{equation} $$
$$\begin{equation} \mathbf{w} = \sum_{i=1}^N \; \alpha_i y_i \mathbf{x}_i \end{equation}$$
Equation (8) implies each positively-labelled support vector contributes a positive force $\alpha_i$, each negatively-labeled support vector contributes a negative force, all forces add to zero and they are in balance. So there must be at least two support vectors, one positive and one negative. Equation (9) implies the optimal normal vector $\mathbf{w}^*$ is basically a linear combination of $N$ data vectors, which is known in our discussion of any linear model in Chapter 4 Equation (18). The difference is we only consider the contributions from those support vectors, as all other points has $\alpha_i$ equals zero.

Plug Equation (8) and (9) back to Equation (6), we obtain a constrained minimization problem for variable $\alpha_i$:
$$\begin{equation} \min_{\alpha_i \ge 0} \; \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \mathbf{x}_i^\intercal\mathbf{x}_j - \sum_{i=1}^{N} { \alpha_i }, \\ \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}$$

We will not go further into the math of how to solve exact values of optimal solution $\alpha_i^*$, that is a problem of Quadratic Programming (QP) problem already solved by great mathematical minds. What is important is our analysis above shows the solution of the maximum margin problem must rely on the identification of a few support vectors from the $N$ input data points. These support vectors determines the separation boundary, as only they have non-zero $\alpha_i^*$, all other data points have no contribution to the solution. Then we use Equation (10) to find $\alpha_i$ and Equation (9) to solve $\mathbf{w}^*$. $b^*$ can be calculated using any support vectors on the boundaries, say point $\mathbf{x}_s$, as:
$$\begin{equation} \begin{aligned} b^* &= y_s - \mathbf{w}^{*\intercal}\mathbf{x}_s \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \mathbf{x}_i^{\intercal}\mathbf{x}_s. \end{aligned} \end{equation} $$
The second step comes from Equation (9) and $y_s$ is either +1 or -1, depending on where $\mathbf{x}_s$ lies on.

Cross Validation of SVM

The observation that the solution of SVM only depends on a few support vectors lead to a very interesting result regarding its accuracy. Let us consider estimate the accuracy of SVM using LOOCV; each time we remove one data point and use it as the validation set. If there are $k$ support vectors, as long as the removed data point $\mathbf{x}_i$ is not a support vector, i.e., $\alpha_i = 0$ in the solution of the original problem, we know the solution will not change. We can also prove this by observing Equation (10), the solution of $\alpha_i$ in the original problem must be in the solution space for the new problem with point $\mathbf{x}_i$ removed. On the contrary, if the solution of the new problem is not the solution of the original problem, $\mathbf{x}_i$ must fall within the new decision boundary, then $\mathbf{x}_i$ should be a support vector to the old problem (the exact proof is to be worked out). Therefore, the optimal solution $\alpha^*$ on the orginal problem must also be the optimal solution when a non-support vector data point $\mathbf{x}_i$ is omitted. Since we are dealing with separable case, $y_i$ is already correctly predicted by our decision plane, therefore the validation error for such cases are zero. Only when the validation data point comes from a support vector, the decision plane might change and the validation data point might be misclassified. Therefore, the accuracy for a separably SVM model is:

$$\begin{equation} E_{cv} \le \frac{k}{N}. \end{equation}$$

In most problems, the number of support vectors $k \ll N$, therefore the performance of SVM is generally very impressive, even true for high-dimensional data inputs. Because even in high dimensions, SVM minimizes $\Vert \mathbf w \Vert$, therefore constrains the hypothesis search space into a small region, the resultant number of support vectors $k$ only grows slowly as dimension increases, therefore, SVM is generally pretty robust.

The linearly separable case is called hard-margin SVM. In the most general case, data points are not linearly separable, we will have to allow points to fall into the sandwiched margin space, therefore called soft-margin SVM problem. The scoring function will include a penalty term for these points. We will not go into the details, as the logic and the solution forms are quite similar to the hard-margin SVM problem, except there is a penalty hyper-parameter. All those points fall inside the margins become bounded support vectors, and they contribute equally, with the same $\alpha$ value. Those points on the boundaries are still margin support vectors that contribute varying $\alpha$ values. Points beyond margins all have zero contribution. In real applications, we always use the soft-margin SVM, never use hard-margin SVM. Although we do not discuss soft-margin SVM in details, because all the conclusions on hard-margin SVM applies to it as well (it is not clear if the LOOCV results applies though), it is important that we should study that topic, for example using LFD supplementary chapter 8 [2].

Non-linear SVM and Kernel

Follow the same non-linear idea outlined in Chapter 5, we can transform the $X$ space into $Z$ space using a non-linear transformation:

$$\begin{equation} \mathbf{z} = \phi(\mathbf{x}) \end{equation}.$$

Then in the $Z$ space, we assume a maximum margin linear model would work well, as all non-linearity has been absorbed by the $\phi(\mathbf{x})$ transformation function.

We then run the SVM solution in $Z$ space and obtain our solution for the problem:
$$\begin{equation}
\min_{\alpha_i \ge 0} \; \frac{1}{2} \sum_{i=1}^N \; \sum_{j=1}^N \; y_i y_j \alpha_i \alpha_j \phi(\mathbf{x}_i)^\intercal\phi(\mathbf{x}_j) - \sum_{i=1}^{N} { \alpha_i }, \\ \sum_{i=1}^N \; y_i \alpha_i = 0. \quad (i = 1, \ldots , N)\end{equation}.$$

The solution $\alpha^*$ enables us to find the linear separation plane in the $Z$ space as:
$$\begin{equation} \begin{aligned}
\mathbf{w}^* &= \sum_{i=1}^N \; \alpha_i^* y_i \mathbf{z}_i \\
&= \sum_{i=1}^N \; \alpha_i^* y_i \phi(\mathbf{x}_i) \\
b^* &= y_s - \mathbf{w}^{*\intercal}\phi(\mathbf{x}_s) \\ &= y_s - \sum_{i=1}^N \; y_i \alpha_i^* \phi(\mathbf{x}_i)^{\intercal}\phi(\mathbf{x}_s).
\end{aligned} \end{equation}$$

Then for a given new data point $\mathbf{x}_{new}$, our prediction will be based on:
$$\begin{equation} \begin{aligned}
y &= {\rm sign } ( \mathbf{w}^*\phi(\mathbf{x}_{new}) + b^*) \\
&= {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i \phi(\mathbf{x}_i)^{ \intercal } \phi(\mathbf{x}_s)\right).
\end{aligned} \end{equation}$$
We basically project all $\mathbf{x}$ and $\mathbf{x}_{new}$ into the $Z$ space, then do the dot product in the $Z$ space to obtain the classification prediction. In typically cases, if we need to project $X$ to a high-order polynomial space, the dimensionality of the $Z$ space grows very quickly. The computational cost of coordinates in $Z$ and dot products in $Z$ space is going to be very expensive. However, the reason we listed the long Equations (14) and (16) is not because of the math itself, but rather to point out a very important observation: both the Equation (14) for $\alpha$ and the final classifier in Equation (16) only depends on $\mathbf{x}$ through the form of their dot products: $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j)$, $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_{new})$ and $\phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_s)$. If somehow, we are able to efficiently compute dot products among all pairs of $\phi(\mathbf{x})$, we can limit our SVM algorithm to running time of $O(N^2)$ regardless of the dimensionality of $Z$. If we choose transformation $\phi$ in such a special way that its dot product in the high-dimensional $Z$ space can be calculated by some function, which called a "kernel function", in the lower-dimensional $X$ space, we are in business, i.e.,
$$\begin{equation} K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^{\intercal} \phi(\mathbf{x}_j) \end{equation}.$$
Be aware that $\phi(\cdot)$ is a vector, not a scalar. Once we have $K(\cdot, \cdot)$ defined, we can use the dot products to solve $\alpha$ and calculate $y_{new}$ for $\mathbf{x}_{new}$ according to Equation (18) below, without leaving the $X$ space. If you insist on carrying out $X$ to $Z$ transform, and solve for $\alpha$, then do dot products in $Z$ space, you would end up with exactly the same results, but just wasting more computer memories to hold the higher-dimension coordinates and CPU cycles to calculate the dot products.
$$\begin{equation}
y = {\rm sign } \left(\sum_{i=1}^N \; \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_{new}) + y_s \; – \sum_{i=1}^N \; y_i \alpha_i K(\mathbf{x}_i, \mathbf{x}_s)\right).
\end{equation}$$

Kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ is the dot product $\phi(\mathbf{x}_i)$ and $\phi(\mathbf{x}_j)$ in the $Z$ space, so kernel function carries the meaning of the similarity between vectors $\mathbf{x}_i$ and $\mathbf{x}_j$ and it should be defined as both symmetric and positive. It is theoretically guaranteed that as only as $K$ is symmetrically and positively defined, there always exists a mapping $\phi: X \to Z$, where dot product in $Z$ is the same as what is computed from $K$. We do not know how to prove this, just know some math genius has proven this, so we can pretty much define any distance metrics we are using, such as Pearson, Manhattan and Tanimoto, etc. We will use the Gaussian kernel next to do a deep dive into how kernel tricks enable us to introduce non-linearity, while remaining computationally efficient.

RBF Kernel and One Kernel for All

A very popular kernel is Gaussian kernel or Radial Basis Function (RBF) kernel, where:

$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(- \gamma {\Vert \mathbf{x} - \mathbf{x}^\prime \Vert }^2) \\
&= \exp(- \gamma \mathbf{x}^2)\exp(- \gamma \mathbf{x}^{\prime 2}) \exp(2 \gamma \mathbf{x}\mathbf{x}^\prime)
\end{aligned} \end{equation},$$
where $\gamma$ is a hyper-parameter determining how sensitive the similarity score is against distance of two points. Let us actually work out the mapping function $\phi$, as it will help us understand this kernel inside out, although the math below are a bit messy, the idea is simply, so do not let the face of the equations intimidate you.

First, let us just focus on the simple 1-$d$ case for $\mathbf{x}$, as the math quickly gets really complex as dimensionality increases, but we will explain later why the conclusions obtained from 1-$d$ hold for higher dimensions.

It is not hard to guess $\phi$ should be the form of $\exp(- \gamma x^2) \cdot \psi(x)$, so that the dot product between $\phi(x)$ and $\phi(x^\prime)$ contains $\exp(-\gamma x^2)\exp(- \gamma x^{\prime 2})$ from the first factor. $\psi(x)$ needs to be chosen in a way that the sum of $\psi(x) \cdot \psi(x^{\prime})$ across all $Z$ dimensions equals $\exp(2 \gamma xx^\prime)$. How?

Remember the Taylor expansion:

$$\begin{equation} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \ldots + \frac{x^n}{n!} + \ldots \end{equation}$$

We can expand:
$$\begin{equation} \exp(2 \gamma xx^\prime) = 1 \cdot 1 + \frac{(\sqrt{2 \gamma}x)(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x)}^2{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \frac{{(\sqrt{2 \gamma}x)}^3 {(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}\sqrt{3!}} \ldots \end{equation}$$
Now we should be able to write out $\phi$ as:
$$\begin{equation} \phi(x) = \exp(-\gamma x^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x)}^3}{\sqrt{3!}}, \ldots \right] \\ \phi(x^\prime) = \exp(-\gamma {x^\prime}^2) \cdot \left[ 1, \frac{(\sqrt{2 \gamma}x^\prime)}{\sqrt{1!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^2}{\sqrt{2!}}, \frac{{(\sqrt{2 \gamma}x^\prime)}^3}{\sqrt{3!}}, \ldots \right] \end{equation}$$
Therefore, $\phi(x)\phi(x^\prime)$ using Equation (22) and (21) produces the same result as the kernel function! The $Z$ space here is actually an infinite-dimensional space, in this infinite space, if you diligently carry out dot products between $x$ and $x^\prime$, you will end up with $K(x, x^\prime)$, which can be easily computed using $\exp\left(-\gamma {(x-x^\prime)}^2\right)$, so the $Z$-space was only a hypothetical space that no actually computations have to take place, we can accomplish everything in the $X$ space, thanks to the kernel function.

Without losing generality, let’s assume our $\mathbf{x}$ is within $[0, 1)$ (we can always do this through data normalization during pre-processing step), most of the coordinates in our infinite dimensions quickly approach zero as $x^n$ converges to zero. So effectively we only have a few lower-order dimensions, for most of the higher-order dimension, all data points have coordinates near zero, therefore, they are not informative. Since linear model can automatically do feature selection by weighting, those dimensions will have weight approximately zero. So we are not really doing classification in an infinite dimension space, the effective VC dimension remain finite and SVM remains robust.

With the kernel function, to solve the SVM problem in Equation (14), the cost is $O(N^2)$. Then to calculate $y$ for a data point to be predicted, we use Equation (16), which only depends on the number of support vectors, at most $O(N)$. Therefore, the kernel enables SVM computation to be limited by $N$, not by the dimensionality. The effective VC dimension is approximately proxied by the number of model parameters $\alpha$, which is the number of support vectors. No matter how many features $X$ contains, SVM is computationally bounded.

Now let us look at two extreme cases of $\gamma$ to understand the power of RBF kernel.

If $\gamma \to 0$, the $Z$ space described by Equation (22) looks like:

$$\begin{equation} \exp(-\gamma x^2) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x, O(\gamma) \right] \end{equation}$$
We omit the higher dimensions, as data points are close to the origin due to small $\gamma$. Well, this is simply the original linear space $X$. Therefore, RBF kernel under very small $\gamma$ degenerates into a linear transformation, e.g., it is basically a linear SVM model in this extreme case.

For the other extreme when $\gamma$ is large. As the coordinate in each higher dimension increases by a factor of $\sqrt{2 \gamma}x/\sqrt{n}$, the coordinates first increase, then eventually converge to zero. The dimension where it peaks depends on the value of $\gamma$, the larger $\gamma$, the more higher-dimension $x$ terms are included. That is we can control the polynomial order in the model by tuning $\gamma$.

By tunning $\gamma$ from very small values to larger values, our RBF kernel adjust the hypothsis set $\mathcal H$ to include linear models to higher-order polynomial models. Therefore, we say the RBF kernel covers a broad spectrum of models, it is indeed one kernel for all! For any given learning problem, we treat $\gamma$ as a regularization hyper-parameter and use CV to obtain its optimal value (Chapter 3), i.e., we tune our hypothesis space to identify the best model that contains enough complexity (polynomial model), but yet not too complex. The RBF kernel is so powerful that it should be considered as the default kernel for your SVM classifier [3].

RBF Kernel for Higher-dimensional Input

Just to have a peace of mind, let us exam the case where the input $\mathbf{x}$ is 2-$d$ and make sure everything we discussed for the KBF kernel above still holds true. Since the complexity of the math expressions grows exponentially, we will just do enough to convince you it is indeed the case, without carrying out all the details.

Following the similar spirit of decomposing the kernel function using Taylor expansion, our kernel for 2-$d$ input can be written as:
$$\begin{equation} \begin{aligned}
K(\mathbf{x}, \mathbf{x}^\prime) &= \exp(-\gamma ({(x_1-x_1^\prime)}^2 + {(x_2-x_2^\prime)}^2)) \\
&= \exp(-\gamma (x_1^2+x_2^2)) \cdot \exp(-\gamma (x_1^{\prime 2}+x_2^{\prime 2}))) \cdot \exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime)
\end{aligned} \end{equation}$$
We have learned that the trick to constructure $\phi$ is to decouple $\mathbf{x}$ and $\mathbf{x^\prime}$ in the kernel function above; the first two terms have already taken care of themselves, we need to rewrite the last two terms as a dot product.
$$\begin{equation}
\exp(2 \gamma x_1x_1^\prime) \cdot \exp(2 \gamma x_2x_2^\prime) = \\
\left[ 1 + \frac{(\sqrt{2 \gamma}x_1)(\sqrt{2 \gamma}x_1^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_1)}^2{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right] \cdot
\left[ 1 + \frac{(\sqrt{2 \gamma}x_2)(\sqrt{2 \gamma}x_2^\prime)}{\sqrt{1!}\sqrt{1!}} + \frac{{(\sqrt{2 \gamma}x_2)}^2{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}\sqrt{2!}} + \ldots \right]
\\ \end{equation}$$
Now we are ready to figure out $\phi$ by cross multiple the two terms into the sum of multiples:
$$\begin{equation}
\phi(\mathbf{x}) = \exp(-\gamma (x_1^2+x_2^2)) \cdot
\left[
1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1,
\frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2,
\frac{{(\sqrt{2 \gamma}x_1)}^2}{\sqrt{2!}},
2 \gamma x_1x_2,
\frac{{(\sqrt{2 \gamma}x_2)}^2}{\sqrt{2!}},
o(x_1^2 + x_1x_2 + x_2^2) \right]
\\ \phi(\mathbf{x}^\prime) = \exp(-\gamma ({x_1^\prime}^2+{x_2^\prime}^2)) \cdot \left[ 1, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_1^\prime, \frac{\sqrt{2 \gamma}}{\sqrt{1!}}x_2^\prime, \frac{{(\sqrt{2 \gamma}x_1^\prime)}^2}{\sqrt{2!}}, 2 \gamma x_1^\prime x_2^\prime, \frac{{(\sqrt{2 \gamma}x_2^\prime)}^2}{\sqrt{2!}}, o({x_1^\prime}^2 + x_1^\prime x_2^\prime + {x_2^\prime}^2) \right] \end{equation}$$
As $\gamma$ approaches zero, the key $Z$ dimensions are $\left\{ 1, \sqrt{2 \gamma}x_1, \sqrt{2 \gamma}x_2 \right\}$, which is the original linear space. As $\gamma$ getting larger, higher order terms $x_1^2$, $x_2^2$, $x_1x_2$, and even higher-higher-order polynomial terms come into play, $\phi$ becomes a polynomial mapping. Same conclusion as the 1-$d$ case. Also notice in polynomial terms $x_1x_2$, where the features in the input $X$ space are mingled in most of the $Z$ dimensions. If in the input space $X$, there is a feature that is noise, it will contaminate many other features in the $Z$-space. The ability of a linear model is able to do interpretable feature weighting and feature selection, however, this now happens in the $Z$ space. The weights cannot be interpreted in the original $X$ space. Therefore, to do feature selection, we should do feature selection in $X$ space, before we use the kernel SVM. Otherwise, one would rely on iteratively removing one feature at a time within $X$ and repeat the SVM process, get rid of the features, which do not degrade the performance. This feature selection process can be a bit computationally expensive.

To summarize, we explain SVM with RBF kernel is not only a very robust classifier with built-in generalization capability, but also a very powerful classifier, as its hyper-parameter $\gamma$ allows us to continuously cover linear model, higher-order polynomial models and models in between. This is a beautiful ML technique.

Classification with Probability

One limitation with SVM is it only predicts the label of $y$ without providing a probabilistic estimation. We have seen in Chapter 5, where probability can be estimated using a linear model called logistic regression (LR) model, which minimizes a cross-entropy error plus regularization error. The disadvantage of LR is the solution depends on all $N$ input data points, which SVM depends on only a few support vectors. Therefore, SVM has much better generalization property. Can we estimate class probability, while still taking advantage of the support vectors?

Yes. The idea is to run SVM first, then take the $d_i = \mathbf{w}^\intercal \phi(\mathbf{x}_i) + b$, i.e., the distance of $\mathbf{x}_i$ to the decision hyper-plane in the $Z$ space as the input features $(d_i, y_i)$ to train a one-dimensional LR model:
$$\begin{equation} f(+1 \vert d) = \frac{e^{\alpha d + \beta}}{1+e^{\alpha d + \beta}}. \end{equation}$$
There are some nuances in exactly how to computer $d_i$ [4], e.g., using cross-validation to obtain $d_i$ in an unbiased manner, but the main idea is outlined above. For points within and near the margin, their probability vary from $\sim 0$ to $\sim 1$; for most points further away from the decision boundary, their probability quickly goes to zero (for -1 class) or one (for +1 class).

It is also worth mentioning that although most literature states multi-class classification problems can be solved by SVM using either $N$ one-vs-all SVMs or $\frac{N(N-1)}{2}$ one-vs-one SVMs, the problem can also be directly solved by minimizing a hinge loss function in the original $X$ space (cannot use Kernel) [5]. This enables linear SVM to be used in Deep Learning as an alternative to multi-class Logistic Regression classifier.

Reference

KDS, page 173-175
LFD, eChapter 8, page 40-45.
https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, Page 4.
http://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf
http://cs231n.github.io/linear-classify (see multi-class support vector machine loss)

Thursday, May 25, 2017

Chapter 4. Linear Models

We have already seen linear models in previous chapters. Here we are going to study linear models more systematically, as it is the foundation of understanding more complex models. In many cases, non-linear models can be simplified into linear models and we will see how that works at the end of the chapter.

Linear Classification – Accuracy

Consider a classification problem, where inputs are $X$ and $y$. $X$ is a $N \times (d+1)$ matrix and $\mathbf{y}$ is an $N \times 1$ vector with elements $\in \left\{-1, +1\right\}$. Notice we add element 1 to the first dimension of each input vector, so that the math later can be written more concisely. Please see the end of Chapter 1 for the spell out of the $X$ and $\mathbf{y}$ as matrix forms (Chapter 1. Equation (1)-(4)).

A linear classifier uses model:

$$\begin{equation} \mathbf{y} = {\rm sign}(\mathbf{w}^\intercal X ) \end{equation},$$
where $\mathbf{w}$ is the $d + 1$ element weight vector: $[ w_0, w_1, w_2, \ldots, w_d ]^\intercal$. We incorporated the extra first dimension to $X$, so that $w_0$ can be part of $\mathbf{w}$.

A natural choice of the cost function to maximize accuracy is:
$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation}. $$

This is not quite right, as we are now ML-educated, we should take into account the regularization error:

$$\begin{equation} \mathbf{w}^* = \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \vert y_i - {\rm sign }(\mathbf{w}^\intercal \mathbf{x}_i) \vert \end{equation} + \lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}}, $$
where $\tilde{\mathbf{w}} = [ w_1, w_2, \ldots, w_d ]^\intercal$. We here use ridge regularization for simplicity, but it can certainly be other regularization forms you prefer. Because the ${\rm sign}(\cdot)$ function has no analytical derivative, we can only solve this equation numerically and iteratively.

At a certain step $n$, our guess of the answer is $\mathbf{w}_n$. Let us assume point $\mathbf{x}_1$ is classified correctly and $\mathbf{x}_2$ is misclassified. We consider making the next guess $\mathbf{w}_{n+1}$ around a small neighborhood of $\mathbf{w}_n$. Let us assume the change in the prediction for $\mathbf{x}_1$ is small and ignore that part for now, as $\mathbf{w}_n$ is already good from its perspective. We need to make $\mathbf{w}_{n+1}$ a better predictor for $\mathbf{x}_2$. If $y_2$ is currently $+1$, it means $\mathbf{w}_n^\intercal \mathbf{x}_2$ is currently negative (as it is misclassified), so it makes sense to have $\mathbf{w}_{n+1}$ to incorporate a positive component, i.e., $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 > 0$. Otherwise, if $y_2$ is -1, we would like $(\mathbf{w}_{n+1}-\mathbf{w}_n)^\intercal \mathbf{x}_2 < 0$, so that we have a better chance of correctly predicting $\mathbf{x}_2$ in the next iteration.

Considering both cases, we can design:

$$\begin{equation} \mathbf{w}_{n+1} = \mathbf{w}_n + \eta y_2 \mathbf{x}_2, \end{equation}$$
so that $\mathbf{w}_{n+1}^\intercal \mathbf{x}_2$ has an additional term $\eta y_2 \mathbf{x}_2^\intercal \mathbf{x}_2$, which always improves $\mathbf{w}$ in a way that the outcome for $\mathbf{x}_2$ classification becomes more favorably. $\eta$ is a positive parameter called learning rate. $\eta$ should be large enough so that we can explore $\mathbf{w}$ space quickly, while it should also be small enough that we do not jump over and miss the minimum. Empirically let us assume $\eta$ is 0.1 without going off topic [1].

So the learning algorithms goes like the following: it starts with $\mathbf{w}_0$, if there are points that are misclassified, it randomly picks one of those and use it to alter $\mathbf{w}$ according to Equation (4). Then loops this process. If data points are indeed linearly separable, it can be proven that our algorithm will eventually converge and all data points will be $100\%$ correctly classified, if there is no regularization error term, so the minimization algorithm works. In reality, the algorithm may not stop as not all data points can be all correctly classified, we can memorize the solution associated with the minimum cost identified so far, and use that solution as our final answer $\mathbf{w}^*$.

What would be a good starting guess for $\mathbf{w}_0$? If data are reasonably separable, we expect data points to form two clusters, points are expected to be closer together within the same labels and further away from points of opposite labels (Figure 4.1). Therefore, a reasonable decision plane has its norm vector $\mathbf{w}$ pointing from the center of $-1$ cluster to the center of $+1$ cluster. A good initial guess is:

$$\begin{equation} \mathbf{w}_0 = \frac{1}{N^{+1}} \sum \mathbf{x} \vert_{y = +1} - \frac{1}{N^{-1}} \sum \mathbf{x} \vert_{ y = -1}. \end{equation}$$

$\mathbf{w}_0$ can certainly be scaled to adjust the initial regularization cost. The optimal hyper-parameter $\lambda$ will be determined through a cross-validation routine described in the previous chapter.

Figrue 4.1. A reasonable choice of $\mathbf{w}_0$ is the one that points from $\mu_{-}$ to $\mu_{+}$.

Logistic Regression – Cross Entropy Lost Function

The truth is not always black and white in the classification problems. Applicants with the same salary, age, and other feature values could behave differently, some turned out to be money makers for the bank and some to be money losers. For those who are credit worthy, some we are extremely confidence they are money makers and the bank could offer more attractive terms and start with a generous credit limit, and others are marginal and should be either deprioritized or approved under somewhat more stringent terms including lower credit limit. Therefore, it would be nice to model the probability of a data point being $+1$ or $-1$. The probability prediction could be handy, e.g., for a cell phone company to retain customers. Those predicted to churn with high probability should be prioritized first for retention department to hammer on. Even if we do not care about probability, minimizing probability cost function could lead to better classification results, as it enourages classes to be pushed fruther part and results in a more robust classifier [2].

This probabilistic classification problem can be formulated as: the truth is $p(y \vert \mathbf{x})$. We are given $N$ records sampled from this distribution $p(\mathbf{x}, y)$ and we would like to model $p(y = +1 \vert \mathbf{x})$ with a family of hypotheses $f(\mathbf{x})$. The probability of the $-1$ class is $p(y = -1 \vert \mathbf{x}) = 1-f(\mathbf{x})$. The optimal hypothesis $g$ is the one that gives the maximum likelihood based on the given data, subject to the constrain of regularization error. We can use linear models, called Logistic Regression (LR) to solution this problem.

Our LR hypothesis space for modeling $p(+1 |\mathbf{x})$ is:
$$\begin{equation}
f(s) = {e^s}/(1+e^s),
\end{equation}$$
where
$$\begin{equation}
s = \mathbf{w}^\intercal \mathbf{x}.
\end{equation}$$
The logistic function is convenient, because the probability of obtaining $-1$ label is:
$$\begin{equation} \begin{aligned}
1-f(s) &= 1/(1+e^s) \\ & = f(-s)
\end{aligned} \end{equation}.$$
In another words, the probability of a data point with label $y$ is:
$$\begin{equation}
p(y_i \vert \mathbf{x}_i) = f(y_i \mathbf{w}^\intercal \mathbf{x}_i).
\end{equation}$$

That is we try to first calculate a linear score $s$, and hope the more positive the score $s$ is, the more likely the label $+1$ is; the more negative the score, the more likely the label $-1$ is. We then arbitrary convert this linear score in range $(-\infty, + \infty)$ into range $(0, 1)$ using Equation (6), so that it has a chance to be interpreted as a probability score.

Let us use the Bayesian approach to find $\mathbf{w}^*$:

$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\max_{\mathbf{w}} \; p(\mathbf{w} \vert X, \mathbf{y}) \\
&= \arg\max_{\mathbf{w}} \; \prod_{i=1}^N \; \frac{p(y_i \vert \mathbf{x}_i, \mathbf{w})p(\mathbf{w})}{p(y_i|\mathbf{x}_i)p(\mathbf{x}_i)}. \end{aligned} \end{equation}$$

The denominator is independent of $\mathbf{w}$ and can be ignored. We can then minimize the negative log likelihood of the numerator:
$$\begin{equation}
\mathbf{w}^* = \arg\min_{\mathbf{w}} \; - \sum_{y_i=+1}\log(f(\mathbf{w}^\intercal \mathbf{x}_i)) - \sum_{y_i=-1}\log(1-f(\mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
Using Equation (9), it can be simplified into:
$$\begin{equation}
\mathbf{w}^* = \arg\min_{\mathbf{w}} \; - \sum_{i=1}^N \; \log(f(y_i \mathbf{w}^\intercal \mathbf{x}_i)) - \log p(\mathbf{w}).
\end{equation}$$
The first terms is what is so-called cross-entropy loss function, which is an important error measurement when we would like to model a probability distribution. This term is basically the sum of negative log correct probabilities of all data points, i.e., we want the probability of each data point belonging to its true class to be maximized. The second term basically gives a regularization error term, containing a parameter $\lambda$, which we have already discussed in Chapter 3. In addition, $\mathbf{w}$ should be replaced by $\tilde{\mathbf{w}}$ in the second term.

Equation (12) has no analytical solution, we have to solve it numerically. Let us introduce the idea of gradient descend. To minimize a general function $f(\mathbf{x})$, we start with an initial guess $\mathbf{x}_0.$ The gradient at $\mathbf{x}_0$ is $\nabla f(\mathbf{x}_0)$, pointing at the direct where $f(\mathbf{x})$ increases fastest. Therefore, we should search in the opposite direction. Our next guess therefore would be:

$$\begin{equation} \mathbf{x}_1 = \mathbf{x}_0 \; – \; \eta \nabla f(\mathbf{x}_0) \end{equation}.$$

Taylor expansion shows $f(\mathbf{x}_1) - f(\mathbf{x}_0)$ would be negative. We can iterate this process to find $\mathbf{x}_2$, $\mathbf{x}_3$, …, until the process converges or we use up computational quotas. Searching for smaller $f$ values along negative gradient line explains the name gradient descend (GD). We have heard about many optimization algorithm, GD is popular in machine learning because a variant of GD called Stochastic Gradient Descent (SGD) enables us to scale up the algorithm to deal with large training data sets [3]. It has been proven by math genius that Equation (12) only has one minimum, which makes it very straightforward to solve and we will not get into it.

A question we may have is we simply coin a function $e^s/(1+e^s)$ and do the parameter fitting on $\mathbf{w}$, then we argue the value of this function has the meaning of the probability of $p(+1|\mathbf{x})$. To us it is a huge leap of faith in associating a function with range $(0, 1)$ to a probability interpretation. The range $(0, 1)$ is only a necessary condition for a function to carry a probability meaning, but not a sufficient condition. At least this is something bothers me for quite a while, indeed some authors argue the $p$ values given by logistic regression should not be treated as probability, but I now believe they should be. We could agree $f(\mathbf{w}^\intercal \mathbf{x})$ can be used for ranking data points, but why its value is an estimation of probability? The answer lies in the first two terms in Equation (11) are cross entropy terms.

Imagine there are plenty of $n_i$ data points within the neighborhood of $\mathbf{x}_i$, where $n_{+}$ and $n_{-}$ are generated from the two underlying classes, respectively, i.e., the probability of being positive $p_{+}$ is $\frac{n_{+}}{n_{+}+n_{-}}$. For these subset of data points, the first two terms in Equation (11) become:

$$\begin{equation} - n_{+} \log f(\mathbf{x}_i) - n_{-} \log(1-f(\mathbf{x}_i)) \end{equation}, $$

It is straightforward to prove this cross entropy term reaches maximum, when $f = p_{+}$, therefore, our optimal solution $f$ is molded to be as close to $p_{+}$ as possible. The logistic model is one naive guess to mimic $p_{+}$, but if someone comes up with a better guess $f^\prime$, his model would be scored more favorably. So by minimizing our error function in Equation (11), $f$, the best winning model, is approaching $p_{+}$, therefore, can be interpreted as a proxy for $p_{+}$. Certainly $f$ might not be an accurate proxy, if the true shape of $p_{+}$ is far away from a logistic shape.

Linear Regression – Maximum Likelihood

Linear regression problem is to model an unknown true value function $y = f(\mathbf{x})$ using a linear model. That is we aim to search within the hypothesis set containing hyper planes described by $y = \mathbf{w}^\intercal \mathbf{x}$ for a hypothesis $g$, so that $g(\mathbf{x}) \approx f(\mathbf{x})$.

The quality of $g$ can be measured by least square error (discussed in Chapter 2):

$$\begin{equation} \begin{aligned}
e(\mathbf{w}) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i – y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} - \mathbf{y})^\intercal (X\mathbf{w} - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned} \end{equation}$$

Again, the second term is a regularization term, although we use ridge regression form, you can feel free to use Lasso or other forms. However, ridge regularization form gives a beautiful analytical solution by setting derivative to $\mathbf{w}$ to zero in Equation (15):

$$\begin{equation} (X^\intercal X + \lambda I)\mathbf{w} = X^\intercal \mathbf{y} \end{equation},$$
$$\begin{equation} \mathbf{w}^* = {(X^\intercal X + \lambda I)}^{-1}X^\intercal \mathbf{y} \end{equation}.$$
Notice:
$$\begin{equation} X^\intercal \mathbf{y} = \sum_i \; y_i \mathbf{x}_i \end{equation},$$

the final solution is a linear combination of all input vectors, i.e., all data points contribute.

In practice, we do not use Equation (17). This is because the regularization term is actually $\lambda \tilde{\mathbf{w}}^\intercal \tilde{\mathbf{w}},$ we either have to use GD method, or need to centered input vectors in order for the analytical solution to work [4]. Let us resort to the form where $w_0$ is taken out of vector $\mathbf{w}$ and written as $b$:

$$\begin{equation} \begin{aligned}
e(\mathbf{w}, b) &= \sum_{i=1}^N \; (\mathbf{w}^\intercal \mathbf{x}_i + b - y_i)^2 + \lambda \mathbf{w}^\intercal \mathbf{w} \\
&= (X\mathbf{w} + b - \mathbf{y})^\intercal (X\mathbf{w} + b - \mathbf{y}) + \lambda \mathbf{w}^\intercal \mathbf{w}.
\end{aligned}. \end{equation}$$

To minimize $e$, we set partial derivatives to zero for $\mathbf{w}$ and $b$, and obtain the following: $$\begin{equation} \begin{aligned} b &= \frac{1}{N} \sum(y_i - \mathbf{w}^\intercal \mathbf{x}_i) \\ &= \frac{1}{N} \sum y_i \end{aligned} \end{equation}$$ $$\begin{equation} \begin{aligned} (X^\intercal X + \lambda I)\mathbf{w} &= y - b. \end{aligned} \end{equation}$$ The value of $b$ in Equation (20) takes advantage of the assumption that $\mathbf{x}$ have been zero-centered. Equation (21) is basically what we saw before, with $y$ being replaced by $y - b$. Therefore, linear regression can still be solved analytically with the regularization term.
One interesting question is since we can find the optimal hypothesis $g$ by the above analytical form without evaluating all the other infinite number of linear hypotheses, does it mean our hypothesis set $\mathcal H$ only contains one hypothesis $g$ and we therefore do not have to worry about over-fitting? Obviously too good to be true.

Imagine there are three algorithms: the first one painstakingly goes through a rather tedious grid search in the parameter space, explores a vast set of hypotheses in $\mathcal H$, and eventually finds $\mathbf{w}^*$. The second one uses the GD method in Equation (13) and identified $\mathbf{w}^*$ after multiple iterations. The third one takes advantage of Equation (18) and lands on $\mathbf{w}^*$ in a single step. They basically explore the same hypothesis space, except the third one is the smartest in avoiding the searches it knows will not come out favorably (or consider the third one has explored all hypotheses virtually). Our hypothesis set still contains infinity number of linear hypotheses, the reason we are not testing them individually is because we happen to be able to foresee their performs are not going to be better than the analytical solution. This "super vision" only comes after we see $\mathcal{D}$. In another word, all linear hypotheses in $\mathcal H$ have a chance to be our analytical solution, the best hypothesis $g$ is not known beforehand, but only after $\mathcal D$ is given. Therefore, what determines VC dimension here is not how we reach the final answer, but what is the effective total hypothesis candidate space.

For more complex models, where we do not have an analytical solution and rely on the GD algorithm explore the hypothesis space like the second algorithm. We are not just evaluating one hypothesis per iteration, we have effectively explore many hypotheses that are in the vicinity of the trial hypothesis under evaluation, it is just we have the knowledge to believe that those other hypotheses not along the negative gradient direction are not very likely to be have better performance compared to the next hypothesis along the opposite gradient direction. So even our solution is found after $n$ iteration steps, our hypothesis space is much larger than $n$. In those cases, the effective hypothesis space explored by GD is indeed smaller than the total hypothesis space, as they are probably other independent local minimum hypothesis missed by GD. In the linear regression example, the space that GD does not explore happens to contain no local minimum. The bottom line is having an analytical solution does not eliminate the need for regularization.

Figure 4.2. Effective hypothesis space explored by Gradient Descend algorithm is more than those individual red hypothesis points.

Feature Selection Capability of Linear Models

Often only a subset of features are informative, the rest are irrelevant to the learning problem. Many algorithms cannot distinguish informative features from noisy ones and be confused by that. For example, if only a small portion of features are informative and we use $k$-nearest neighbor algorithm and our similarity metrics is defined as Euclidean distance, where all features are weighted equally, then the distance would be dominate by noise and the performance would be poor. The goal of feature selection is to identify and eliminate a subset of features without too much sacrifice of $E_{in}$. Then by using fewer features, our model requires fewer parameters, therefore, smaller VC complexity, therefore smaller $E_{reg}$ eventually leads to overall smaller $E_{out}$.

A linear model is interesting in a way that it automatically does feature prioritization. Consider the two-dimensional two-class example, the weight vector $\mathbf{w}^*$ approximately points from the center of the $-1$ cluster to the center of the $+1$ cluster (Figure 4.1). In this example, $x_1$ is more informative than $x_2$, and we have $w_1 \gg w_2$, because:
$$\begin{equation} w_i = \frac{\partial{y}}{\partial{x_i}} \end{equation}.$$
So the linear plane points at the direction where gradient of $y$ tend to be the largest, and the weight vector, the norm of the hyper plane, basically describes how important each feature dimension is and weights them accordingly.

Non-linear Classification

In Figure 4.3, the boundary between the two classes are obviously non-linear. One way is use a hypothesis set including non-linear boundary models. Another way, which we prefer, is to transform this problem into a linear problem!

The general idea is we introduce a mapping function from origin $X$ space into a new space $Z$:
$$\begin{equation} X \to Z: Z = \phi(X).\end{equation}$$
The mapping is non-linear, e.g., Z is a second order polynomial space, $Z = \left[ 1, x_1, x_2, x_1^2, x_1x_2, x_2^2 \right]$. We here project a three dimensional data point $[1, x_1, x_2]$ into six dimensions. Hopefully, the non-linear transformation has absorb the non-linearity in the data, so that when we exam data points $[z_0, z_1, z_2, \ldots , z_m]$ in the $Z$ space (with dimensionality $m$), a linear model works well there:

$$\begin{equation} \begin{aligned} y &= {\rm sign} (\mathbf{w}^\intercal \mathbf{z}) \\ &= {\rm sign} (\mathbf{w}^\intercal \phi(\mathbf{x})). \end{aligned} \end{equation}$$

In $Z$ space, we use the linear solvers discussed above to find the optimal weights $\mathbf{w}^* = \left[ w_0, w_1, \ldots, w_m \right]$. Let us assume in our particular example, the optimal solution is $\mathbf{w}^* = (1, 0, 0, -1, 0, -1)$. The decision function in the $X$ space will be non-linear:

$$\begin{equation} \begin{aligned} y &= {\rm sign}(\mathbf{w}^{*\intercal}\mathbf{z}) \\ &={\rm sign}(1-z_3-z_5) \\ &= {\rm sign}(1-x_1^2 - x_2^2). \end{aligned} \end{equation}$$

We show how to obtain a good non-linear model, where the core computation of our learning algorithm takes place in a linear space $Z$. We will discuss more on how to automatically include non-linearity using a Support Vector Machines classifier in the next chapter. Please read corresponding LFD Chapters for more examples [5].

Figure 4.3. Relying on a non-linear mapping function, linear hyper-plane works well in the $Z$ space. The decision plane corresponds to a non-linear decision boundary in the original $X$ space.

Regularization Again

A puzzling question regarding the regularization term is why models with larger $\mathbf{w}$ are considered more complex. If our model is non-linear, or our linear model is in the transformed $Z$ space, some weight parameters represent higher-order $\mathbf{x}$ terms, i.e., more complex hypotheses, reducing those weights does reduce the model complexity, which is not hard to understand. The challenge is if we are in the original linear $X$ space, why a hyperplane with larger $\mathbf{w}$ is considered more complex?

For a linear regression problem, if both $\mathbf{x}$ and $\mathbf{y}$ are normalized, so that they are centered around zero, we could appreciate lines with $\mathbf{w}$ closer to zero might be more likely, as they are rather straightforward hypotheses. A better way to think about this is given two hypoethesis of similar $E_{in}$, i.e., their predictions areabout the same: $y\approx \mathbf{w_1}^\intercal \mathbf{x} \approx \mathbf{w_2}^\intercal \mathbf{x}$, since the gradient $\nabla_{\mathbf{x}} y = \mathbf{w}$, the hypothesis with smaller $\mathbf{w}$ has smaller gradient against $\mathbf{x}$. This means we can tolerate larger input noise $\sigma_{\mathbf{x}}$ and prediction remains essentially unchanged, such hypotheses should be more generalizable. Also notice this argument naturally excludes $w_0$ from regularization, as it is not present in the gradient expression.

For a linear classification problem, one can scale $\mathbf{w}$ by a positive value and retain the same accuracy, as $y = \rm{sign}(\mathbf{w}^\intercal \mathbf{x})$ remains unchanged for all $c\cdot \mathbf{w}$. On one hand, this means our hyphothesis space is redundant, therefore, reducing the redundancy by introducing regularization makes hopythesis search easier. On the other hand, similar to the regression case, smaller $\mathbf{w}$ means the transition from $-1$ to $+1$ as we move from one class of $\mathbf{x}$ to the opposite class is smoother. When the data points are linearly separable in the logistic regression case, the optimal solution without regularization is $\mathbf{w} \to \infty$, where probability function is a sharp step function, this basically is an overfit [6]. Constraining $\mathbf{w}$ to be small essentially excludes such step functions. The small $\mathbf{w}$ means $y$ changes slowly, which corresponds to the maximum margin of the decision boundary, as we will in the next chapter.

Extra - Multiclass Classifer

What we described here are binary classifiers, in reality many problems contain multiple target labels, such as identifying the object for a given picture, where the object can be one of 1000 candidates. Popular approaches for multiclass classifiers are described in another blog.

Reference

LFD, page 94-95
DLB, page 270
LFD, page 97-98
The Elements of Statistical Learning. Trevor Hastie, Robert Tibshirani, Jerome Friedman. Second Edition. page 63-64.
LFD, page 99-104
Christopher M. Bishop. Pattern Recognition and Machine Learning. p.206.
It is worth noting that maximum likelihood can exhibit severe over-fitting for data sets that are linearly separable. This arises because the maximum likelihood solution occurs when the hyperplane corresponding to $\sigma = 0.5$., equalivalent to $\mathbf{w}^\intercal \mathbf{x} = 0$, separates the two classes and the mangitude of $\mathbf{w}$ goes to infinity. $\cdots$ Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data set is linearly separable. The singularity can be avoided by inclusion of a prior and finding a MAP solution for $\mathbf{w}$, or equivalently by adding a regularization term to the error function.