Saturday, May 27, 2017

Chapter 6. The Path Forward

Review


For machines to learn from data, it explores a set of hypotheses $\mathcal{H}$ provided by us and identifies the best hypothesis $g$ based on the training data set $\mathcal{D}$, which contains $N$ pieces of data records.  The upper confidence interval bound on the performance of the solution in test sets, $E_{out}$, differs from its within-sample performance $E_{in}$ (Figure 6.1):
$$\begin{equation} \begin{aligned}
E_{out} &= E_{in} + E_{reg} \\ &\approx E_{in} + \epsilon\left(\sqrt{\frac{\ln M}{N}} \right), \\
&\approx  E_{in} + E_{reg}(\lambda).
\end{aligned} \end{equation}$$
where $M$ is the effective number of independent hypotheses and $\lambda$ is employed to adjust this number.

Figure 6.1.  What we care about is $E_{out}$, i.e., the upper bound of true performance with a high confidence $1-\delta$, where $\delta$ usually is some small value.  To minimize $E_{out}$, no longer should we minimize $E_{in}$ but also the confidence interval $\epsilon$, which depends on $N$ and $M$.  Increasing sample size $N$ reduces $\epsilon$, reducing hypothesis space, i.e., reducing $d_{vc}$ and effective hypothesis count $M$ reduces $\epsilon$.  As the functional form of $\epsilon$ is generally unknown, we typically model $E_{reg}$ using a penalty term to continuously adjust the effective size of our hypothesis space.

To minimize $E_{in}$, we need to employ sophisticated models that is close to the target function $f$ and increase the hypothesis search space.  To minimize $E_{reg}$, we need to either reduce the effective model counts $M$ or increase the sample size $N$, while the latter is often not an option.  From the Bayesian camp, $E_{in}$ is the evidence carried by the data set, $E_{reg}$ is our prior knowledge, $\lambda$ tunes our stubbornness in learning from data.

To find the balance of the two opposing requirements on model complexity, we modulate $M$ by adjusting a hyper-parameter $\lambda$.  The optimal value of $\lambda$, $\lambda^*$, can be empirically determined by a cross validation protocol, where estimated $E_{out}$ score is minimized.  $\lambda^*$ which subsequently determines the model $g$, which is expected to achieve the optimal $E_{out}^*$.  This is basically the core principle of learning theory introduced in Chapter 2 & 3.

Standing on top of the learning theory, we show how linear or non-linear models can be applied to both classification and regression problems, which rely on a hyper-parameter $\gamma$ to modulate the complexity of the hypothesis space.  The resultant model space could be as simple as only including some linear models or as sophisticated as including higher-order polynomial models.  SVM only rely on a few support vectors, therefore have much fewer model parameters compared to other models, thus it has extremely robust generalization properties.

Bayesian and Graphical Models


We have covered a wide range of learning theories for ML in this note, nevertheless, there are enormous amount of ML knowledge yet to be studied on our own.  For instance, we have not covered Ensemble methods, where multiple ML models can be pooled together to form a decision committee.  There are examples where a high-quality committee can be formed by rather crude individual models.  For this reason, people generally do not compare non-ensemble methods to ensemble method directly [1].  We should read about additional ML topics and one good resource is the excellent Machine Learning Technique lecture [2] (in Chinese though).

We also did not cover graphical model, or cases where input is not a well-structured feature matrix $X$.  For example, when ${\rm Xbox Kinect}^{TM}$ classifies a player’s movement into either kick or jump, the input is a series of images (or poses), where there are varying number of frames serving as the input.  Using hidden-Markov chain to describe the transition between poses would be more effective [3].  For problems based on complex inputs such as computer vision, voice recognition, nature language processing, techniques such as SVM no longer works well.  So one future direction is to tackle the learning problem from Bayesian approach or graphical approach.  In the regularization discussion, we mentioned how the Bayesian approach avoid the over-fitting topic, and replace it with a prior belief, so a ridge regularization is basically a Gaussian prior and a Lasso regularization is basically a Laplace prior.  To proceed along this direction of Bayesian learning, a good books might be "Pattern Recognition and Machine Learning" by Christopher Bishop.  To study graphical models, a good but challenging online course might be “Probabilistic Graphical Model” Coursera series by Daphne Koller (although I do not recommend the book, use Bishop book instead).

Figure 6.2.  A action is consisted of a series of states described by a hidden-Markov Model (HMM).  In this example, a low-kick action consists of three state transitions: from stand to foot lifted, to kick, and then back to stand.  At each time point, the body position $X(t)$ is observed as a sample from its underlying state $S(t)$.  The goal is to predict the action label $A$ (low-kick, high-kick, sit, jump, etc.) based on a time-series of observations $X(t)$.  One would need to maximize the likelihood: $p(X|A=k) = \sum_{S_k(t_0), S_k(t_1), \ldots} \; \left\{ p(S_k(t_0))p(X(t_0)|S_k(t_0)) \prod_{i=1,N} \; p(S_k(t_i)|S_k(t_{i-1})p(X(t_i)|S_k(t_i) \right\}$.


Deep Learning for Feature Learning


The learning theory above also heavily depends on the quality of features in the $X$ matrix.  Often times, informative features are very hard to obtain.  In cellular imaging, one typically has to carry out challenging segmentation task (a computer vision task) to identify individual cells, then apply programs such as CellProfiler [4] to compute several hundred features for each cell.  There are custom-designed features to describe the size, intensity, textural and the morphology of cells.  Some nice image features, such as Scale-invariant Feature Transform (SIFT) [5], requires decades of hard research and are still may not be very effective.  Is there a way to ask machine to do feature learning, i.e., having machine automatically construct informative feature vectors, is a very useful area of research.  Feature learning can also be treated as the goal of finding a mapping function $f: X \to Z$ by machine.  In this regard, Deep Learning (DL) has great advantage, in fact, some experts consider DL equivalent to feature learning, because its Convolutional Neural Network (CNN) is particularly apt at extracting a hierarchical set of features ranging from rudiment edge and corners to complex elements like wheels and faces.  So another future path is Deep Learning.  To that end the good book is DLB and a good course probably is “Neural Network for Machine Learning” by the DL godfather Geoffrey Hinton on Coursera.
Figure 6.3. Deep Learning can automatically learn features, without experts going through tedious manual feature construction process.  Here a DL network extracts lower-level features such as edge, corners, then gradually builds up higher-level features represents eyes, nose and finally faces. [https://s3.amazonaws.com/datarobotblog/images/deepLearningIntro/013.png]

Data Engineering: Big Data and Cloud


The learning theory we covered here obviously has very strong emphasis on over-fitting: “the ability to deal with over-fitting is what separates professionals from amateurs in the field of learning from data” according to LFD [6].  Based on the learning curve and VC theory, over-fitting will no longer be a concern if one has lots of data.  When $N$ is 100 or 1000 times larger than the number of model parameters, $E_{in}$ and $E_{out}$ will be very close.  However, keep in mind that although we live in a big data era, only some problems have the luxury of having big data as training data set.  The majority of the problems we are deal with as data scientists in non-Internet-data companies (such as Facebook, LinkedIn, Google, Amazon, etc.) are still very much under the constrains of limited $N$ and all the theories we have discussed in this note are crucial.  The word “big” data can be very misleading.  In biology next-generation sequencing (NGS) data could be a few GB, but it is not big due to its lack of complexity.  Such data file can be post-processed into sequence counts for twenty thousand human genes.  If we think each NGS sample is a record and each of the twenty thousand gene is a feature, one thousand NGS samples (TB of data) still does not qualify it as a big data problem.  You can image a cat with enormous resolution, therefore, generate a huge picture file, the size of file does not make this a big data.  Big data is not just about volume of data.

Nevertheless, let us pretend we are data scientists in Google for a moment and we trying to solve a real big data problem with basically infinite $N$ in our hands, our worry is no longer over-fitting.  We have to worry about how to store our training data – a big data storage problem, how to carry out the computation to find the optimal hypothesis in $\mathcal H$ – a big data computation problem, and how to retrieve particular future records efficiently for prediction – a big data transaction problem.  ML then becomes a real engineering challenge, and we start to rely more and more on data engineering technologies and less and less on data science knowledge.  Infinite $N$ means the variance term in $E_{out}$ is no longer the issue, the performance bottleneck is the bias term.  To reduce bias one should introduce more sophisticated models.  This is why DL, with tens of thousands, millions, or billions of model parameters, is only possible when firstly there are nearly infinite amount of training data, secondly there is a vast computational infrastructure, which is enabled by both big data and by GPU hardware.  Therefore, we see big data has strong ties to Deep Learning, and to data engineering.  Please bear in mind that when DL is applied to small data size, we will see jargon such as regularization, over-fitting being mentioned again and again in the DLB; so all we have learned in this note will not be wasted in your DL journey.

Data engineers study the three big data problems raised above: storage, computation, and transaction.  Due to the engineering nature of such challenges, the result is not a set of new theory, but a set of technical solutions, more precisely, are many alternative technical systems, where the industry standard is practically determined by whichever has the dominant user base.  The answer to big data storage originated from Hadoop File System (HDFS), an open source implementation of Google File System (GFS).  The answer to big computation was first proposed again by Google - its Map-Reduce (MR) framework.  The answer to big data transaction is the numerous NoSQL systems (Google's Big Table is a pioneer).  In the early days, one has to establish local big data infrastructures using these components, where fortunately most are open source and free.  Nowadays, we are more likely to focus on their utility and let Cloud providers such as Amazon Web Service (AWS) take care of setting up these infrastructure for us.  In this aspect, big data and Cloud do go hand-in-hand.  Amazon’s solution to big storage is S3, to big computation is EC2 cluster or EMR, to big transaction is DynamoDB, etc.  So our third future development path is data engineer/big data engineering/cloud engineering, I consider them nearly synonyms due to their tight connections.  They emphasize on the programming and automation aspect, compare to the data science’s emphasize on the statistics skill sets.  In reality, people tend to confuse data science with data engineering, hopefully there will be more clarification as the fields develop further.
Figure 6.4.  As $N$ grows to $\inf$, over-fitting is no longer a concern and $E_{out}$ basically equals $E_{in}$.  Our focus will switch from the science topics to practical engineering challenges.  Engineering challenges include big data storage, computing and transaction.  Currently Amazon Web Services (AWS) is the most popular big-data platform in the Cloud, so that we do not have to set up such an environment on our own.

Summary


In this note, we focus on how machine can select the best hypothesis from a hypothesis set $\mathcal H$, given a limited number of $N$ input training record $X$.  We need to take into account both within-sample error $E_{in}$ and generalization error (determined by VC dimension) to avoid under-fitting and more importantly over-fitting.  An effective way to determine the optimal balance of the two error terms is through hyper-parameter determination based on cross validation.  Support Vector Machines is a huge success of machine learning theory with all these concepts in combine.  The Gaussian kernel provides a continuous hypothesis space ranging from linear to high-order polynomial hypothesis, while still preserving decent generalization capability.  Once we have a solid theoretical foundation, we are ready to understand many machine learning techniques.  An excellent online course is “Machine Learning Techniques” on YouTube [2].


Reference

  1. DLB, Page 251: Model averaging is an extremely powerful and reliable method for reducing generalization error. Its use is usually discouraged when benchmarking algorithms for scientific papers, because any machine learning algorithm can benefit substantially from model averaging at the price of increased computation and memory. For this reason, benchmark comparisons are usually made using a single model.
  2. Machine learning technique video (in Chinese) by Hsuan-Tien Lin.
    https://www.youtube.com/watch?v=A-GxGCCAIrg&index=1&list=PLXVfgk9fNX2IQOYPmqjqWsNUFl2kpk1U2
  3. Probabilistic Graphical Models, Coursera online course by Daphne Koller.  Part III. PGM Programming Assignment: Learning with Incomplete Data.
  4. http://cellprofiler.org
  5. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
  6. LFD, page 119

No comments: