Monday, May 22, 2017

Chapter 1. What is Machine Learning?

Introduction


There is no question that we are living in the middle of a data revolution and data science is reshaping our way of living and our way of working.   According to the Gartner hype cycle analysis (Figure 1.1) [1], machine learning is at about the apex of its life cycle.  This is a particularly important time for data scientists to go beyond the face value of various machine learning tools and understand the core components that are likely to last through whatever bumps ahead till the eventual maturity of this field.  

Figure 1.1.  Gartner Hype Cycle for Emerging Technologies, where Machine Learning is considered to be at its apex of hype cycle.

Remember the early days of the Internet hype cycle, nearly any web company was able to attract loads of venture capital funds.  If a programming worker allowed himself to think job requirements in the Internet section was all about learning HTML and web site design, or was about PHP scripting, or about email spamming and anti-spamming, his job for sure did not last long.  Which technologies developed during the hype cycle will survive the test of time are extremely hard to predict, e.g., most of us did not expect JavaScript to come out as a winner of web programming.  Fortunately, despite which Internet programming technology won and which lost, those who received solid computer science training had no problem adapting to new Internet-based programming landscape, because the science behind web programming remains largely unchanged in the past decade.  Our heavy reliance on expertise in data structure, computer algorithms, and parallel computing theories by no means lessen a bit.  These core foundation theories will remain strong for the coming future.  Along the same logic, the best way to prepare for the data science era is to learn the “science” within data science, as science is definitely the part that will stay and it will eventually determine how fast one can navigate the turbulent seas of data science.  As Michael Nielsen elegantly put in the beginning of his open book on neural network: "although learning through using some hot software libraries has an immediate problem-solving payoff, we want durable and lasting insights that will still be relevant years from now. Technologies come and technologies go, but insight is forever." (rephrased here) [2]

This note is not about the usage of specific technologies or tools, such as deep learning, support-vector machines, random forest, etc.; it is about why machine learning is feasible and what principles data scientists should follow in order to enable our machines successfully “learn from data”. That is we focus on theory not on technology.

It is also important to understand that this is a note about “machine learning”, not about “artificial intelligence” or “deep learning”.  All three are buzz words these days, together with other high-frequency confusing terms, such as “big data” and “cloud computing”.  Nobody can clearly define these terms and citing politically-correct Wikipedia definitions does not really help, so we do not attempt to define them here. However, we should be able to illustrate our own interpretation that can reduce, if not eliminate, ambiguity when we exchange machine learning ideas among ourselves.  NVIDIA has a nice blog article explaining the difference among the three concepts well [3] (Figure 1.2), which is in line with the explanation given by some well-known professionals [4].

Figure 1.2. NVIDIA’s explanation of AI, ML and DL.  The example for AI is chess playing.  IBM’s Deep Blue defeated human champion Kasparov in 1997.  The example for ML is email SPAM detection, where Naïve Bayesian classifier is often used to classify emails into SPAM (junk emails) and HAM (valuable emails).  The example for DL is Google’s accomplishment in detecting cats from video with 75% accuracy back in 2012, a project Andrew Ng participated.  Andrew Ng is well known for his extremely popular online course "Machine Learning" on Coursera.

Artificial intelligence (AI) emphasizes on machines acquiring human-like intelligence, machines that can pass Turing test (such as the movie Ex Machina and R2D2 in Star Wars) (Figure 1.3).  Although this is not totally science fiction, it currently remains largely a future tense.  Many impressive artificial intelligence examples may not even be driven by the machine learning (ML) algorithms that we typically practice, a possibly valid example is achieving AI through knowledge base, based on rules discovered by human not by machine itself.  One example is Mathematica software can do extremely impressive symbolic computations, including difficult calculus (Figure 1.4), however, it archives this by following mathematical rules coded by us, and those rules were not discovered by machine through learning from many calculus examples [5].  So AI is bigger than machine learning (ML).  AI is often reserved to specifically emphasize "intelligence", thus we normally do not call ML techniques such as one-nearest neighbor an AI, as it simply performs similarity lookup.  While AI is largely the future, ML is a reality, a sciency framework has emerged after decades of research work and that is why we can now use the word “Data Science”, which mostly refers to the science of ML.

Figure 1.3 Human-like AI robot in movie Ex Machina.  R2D2 in Star Wars.
Figure 1.4. Although Mathematica can do very impressive symbolic operations, it is doing so based on existing mathematical rules, not based on strategies discovered by itself through learning lots of examples, therefore, it achieves AI not through ML.

Over the past decades, many ML techniques have contributed to the development of the learning theory. Such techniques include neural network, random forest, support vector-machines, and deep learning, etc.   This is why we say Deep Learning (DL) is one of the many ML techniques out there, it is a subset or a special case of ML.  DL $\subset$ ML $\subset$ AI.  However, DL is a very powerful one, because it has out-performed other ML techniques in many applications, especially those where traditional ML techniques failed miserably.  DL has rekindled new hopes that ML can match or even surpass human intelligence in some areas in the near future.  DL is one of the several key technologies behind AlphaGo.  The defeat of Lee Se-dol to AlphaGo caught most of us as a big surprise, but the defeat of “Master” (AlphaGo 2.0) over Ke Jie in late May was a highly anticipated.  As DL is taking the lead in some most impactful AI applications, such as Tesla’s self-driving cars, no wonder for journalists, DL is everything, it is AI, it is ML.  All three terms are being used interchangeably as most news in AI are about DL, which is the most shining piece of ML techniques.

Is Machine Learning Really a Hype?


DL sends us recent AI shock waves and gives people an over-optimistic feeling that AI is becoming a reality.  This explains why Gartner placed ML at the top of the hype cycle.  Gartner here most likely mistakenly equates ML to DL and to AI.  The recent success of DL in AlphaGo and Tesla's self-driving system sets a high expectation on AI.  On one hand, thanks to DL, AI has never been so close to our life.  General population hopes the success of AlphaGo can be easily replicated to other AI domains; and all the heated debates on the sociological impact of AI makes AI appear to be something ready for prime time at any moment.  This over-optimistic view of AI is basically the root of the hype. If ML is a hype, it is actually the AI hype.  On the other hand, even among professionals, the success of DL creates a misconception that DL is the best ML technology and good for all ML and AI application.  We will discuss in Chapter 6 why this is not true.  If ML is a hype, it is actually the DL hype.

For the ML itself, as we will learn in this note, is a matured science.  ML has solid theories, some will be discussed in this note, it is the science of data science.  In the past two decades, ML has numerous proven applications in many research fields, including computer vision, bioinformatics, cheminformatics.  It is still on a solid path of further expansion.  So for the ML we understand, it is many things but a hype.  What we will learn about ML here are going to last!

Figure 1.5.  For data scientists, ML is a matured field (left).  For others, ML is often mistakenly equated to DL or AI (right).  DL is still under heavy development and AI largely remains a dream.  When the expectation on DL and AI is high, hype arises.

What is Learning?


When a bank-employed programmer writes a credit card review application to only approve applicants with salary higher than $30k, that computer program only “automates” the salary-checking rule without learning.  The approval criterion, which is the intelligent piece, is provided by human.  Computer automation has improved our efficiency and eliminated many jobs, such as cashiers, where not much intelligence is required.  Many jobs were unaffected as computers became wide spread, as they require human intelligence and automation does not warrant intelligence.  If the bank instead feeds a computer program 1000 past approved applications made by its best money-making credit reviewer, together with the outcome whether bank made or lost money from those approvals, the program then somehow updates its internal parameters and be able to review future applications and make decisions in a way very similar to that star reviewer, learning definitely happened in this example.  Based on this example, ML is machines develop code without rules given by human.  Code development is driven by some given examples.  The criterion for a successful ML is its prediction for future input data (Figure 1.6).
Figure 1.6.  Machine Learning is a machine capable of learning from training data by itself.  The resultant machine is able to predict unseen future test data.

When computer (machine) can learn and gain some intelligence, it will reshape the job marker in a much more impactful manner.  In this example, not only the bank can review more applications faster and more accurately, the machine will take over the job of mediocre-performing credit reviewers in the bank and generally in the banking industry, I believe this is probably already true.  Many jobs we are performing nowadays requires only marginal intelligence, therefore, data science/ML will significantly reshape the job market.  According to McKinsey’s analysis, “currently demonstrated technologies could automate 45 percent of the activities people are paid to perform and that about 60 percent of all occupations could see 30 percent or more of their constituent activities automated, again with technologies available today” [6].

The credit card approval example is a classic textbook machine learning problem called classification (classify an application into approve or deny actions) and this is the type of problems we will use to explain the ML theories.  In fact we will heavily study such two-label classification problem most of the time.  Such classification problem might appear special, it is nevertheless the most-encountered one in real applications.  Even self-driving cars have to carry out many classification decisions, such as whether the object ahead is a pedestrian, whether the object on the side is a road sign, then is it a stop sign or a speed-limit sign, etc.  Often times a classification task trivial to a human may become ridiculously difficult for machines, e.g., whether the object on the road ahead is a rock or a brown paper bag![7]  So do not underestimate the challenge ahead.

Learning Resource


This note is heavily based on a few resource I found extremely useful during my “human” learning of the “machine learning” subject.  In particular, this note should be credited to the book “Learning from Data”, including the online Caltech lecture taught by one of the book author Yaser Abu-Mostafa.  The selection of topics and the clarity of explanations are just superb in every way.  The online lecture “Machine Learning Foundations” taught by another co-author Hsuan-Tien Lin is more elaborate and enlightening (in Chinese though).  Since we may cite various book/video source, let us list a few of them and label them with acronyms:

The books are:

I recommend DSB as an introductory book; it helps orient yourself with some ML concepts and popular ML techniques.  KDS is an intermediate book, with a bit more math but not too heavy, it uses SVM to touch the surface of ML theories, but not as systematic as LFD.  LFD is an excellent book, where this note is based on.  LFD may be a bit heavy on math in some places, however, the online videos help a lot.  DLB is destined to be a classic book in DL, I am still in the process of reading and might quote some small parts.

The two online videos are:

The Learning Problem


Let us formulate the machine learning problem to be studied in this note [8].

Figure 1.7 Formulation of the learning problem.
The data source generates data records as $(\mathbf{x}, y)$ according a target function $f(y \vert \mathbf{x})$, let us consider the simple case where $\mathbf{x}$ is a $d$-dimensional vector $(x_1, x_2, \ldots, x_d)$, i.e., $\mathbf{x} \in \mathbb{R}^d$, and $y$ in this note is either a binary class variable $\{-1, +1\}$ (for classification problem), or a real number $\mathbb{R}$ (for regression problem).  $\mathbf{x}$ follows an unknown but fixed distribution $p(\mathbf{x})$, from which both the training data set and the test data set are sampled.  The exact form of $p(\mathbf{x})$ does not matter, as long as training and test samples share the same $p(\mathbf{x})$.  The hidden process samples $N$ training data points $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N$ according to $p(\mathbf{x})$, then it generates output $y_1, y_2, \ldots, y_N$ according to the target function $f(y \vert \mathbf{x})$.  That is the training data set $\mathcal{D}$ contains $N$ examples $(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_N, y_N)$.  The machine is supposed to use these $N$ input points to make its best guess on $f$, how?

Before seeing any training data, computer decides to model the unknown $f$ with a collection of hypotheses $\mathcal{H}$, e.g., $\mathcal{H}$ can contains all possible linear class-separation hyper planes as hypothesis candidates: $y = w_0 + w_1x_1 + w_2x_2 + \ldots + w_dx_d$.  Each hypothesis here contains $d+1$ parameters: $(w_0, w_1, \ldots, w_d)$, and each $\mathbf{w}$ vector corresponds to one hypothesis $h_{\mathbf{w}}$ in $\mathcal{H}$.  $\mathcal{H}$ here contains infinite number of hypotheses.  Given the $N$ input points, we applied a learning algorithm $\mathcal{A}$, which will pick one hypothesis $g$ within $\mathcal{H}$ to be the best guess of $f$ (picking $g$ basically is picking a specific value of $\mathbf{w}$).  $g$ typically is identified by minimizing a function including an error measurement $e(\mathcal{D}, h)$ using some optimization algorithm.  How well our $g$ truly models $f$?  It should not be determined based on the error measure within $\mathcal{D}$, but has to be determined by its future application.  Otherwise, a $g$ memorizing the $N$ answers would be the best one.  To evaluate $g$, we sample an independent test data set $\mathcal{D}^\prime$ that the computer has never seen, in which the error of $g$ mimicking $f$ is measured.  Notice human knowledge is incorporated in the choice of both $\mathcal{H}$ and $\mathcal{A}$, although we do not define $g$.  The exact $g$ is learned based on $\mathcal{D}$ without human intervention, i.e., the best model is learn from data, not provided by human, that is where learning happens.

A few notations used in this note. $\mathbf{x}$ refers a $d$-dimensional input data point.  We typically add $1$ as its first element for the convenient of linear modeling.  $X$ is the matrix for all $N$ input data points.  $\mathbf{y}$ is the corresponding response variable for the $N$ data points.
$$\begin{equation}
\mathbf{x} = \begin{bmatrix} 1 \; x_1 \; x_2 \; \ldots \; x_d \end{bmatrix}^\intercal \end{equation}$$
$$\begin{equation}
X = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1d} \\
1 & x_{21} & x_{22} & \cdots & x_{2d} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \cdots & x_{Nd} \\
\end{bmatrix}
\end{equation}$$
$$\begin{equation}
X = [ \mathbf{x}_1 \; \mathbf{x}_2 \; \ldots \; \mathbf{x}_N \; ]^\intercal
\end{equation}$$
$$\begin{equation}
\mathbf{y} = \begin{bmatrix} y_1 \; y_2 \; \ldots \; y_N \end{bmatrix}^\intercal
\end{equation}$$

Let us further assume each feature dimension, except the first dummy dimension $x_0 = 1$, has already been standardized, that is they are all zero-centered and scale.  This assumpt will help us understand popular regularization forms to be introduced in Chapter 2:
$$\begin{equation} \sum_{i}^N \; {x}_{ij}=0, \; {\rm for } \; j = 1,2,\ldots ,d, \end{equation}$$
$$\begin{equation} \frac{1}{N} \sum_{i=1}^{N} \; {({x}_{ij}-\bar{x}_{j})}^2=1. \end{equation}$$

In this note, we will study different hypothesis sets $\mathcal{H}$, containing linear and non-linear hypotheses alone or even in combine.  We will study different learning algorithms $\mathcal{A}$, including one-step analytical solution and multi-step gradient descent optimization.  The error measure could be accuracy (for classification problem), cross-entropy error (for classification with probability), or least square error (for regression problem).  Despite the fact that we only study the specific machine learning problem depicted in Figure 1.7, the learning workflow contains enough variations, that it can lead to a very rich manifestation of different machine learning applications.

Reference

  1. http://www.gartner.com/newsroom/id/3412017
  2. http://neuralnetworksanddeeplearning.com/about.html
  3. https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
  4. Deep learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville. http://www.deeplearningbook.org/. p9 Figure 1.4. 
  5. https://www.quora.com/How-does-Mathematica-work
  6. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/where-machines-could-replace-humans-and-where-they-cant-yet
  7. https://www.youtube.com/watch?v=40riCqvRoMs
  8. LFD, page 30.




8 comments:

malaysiaexcelr01 said...


wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and also add rss. keep us updated.


DATA SCIENCE COURSE MALAYSIA

Anonymous said...

Here at this site really the fastidious material collection so that everybody can enjoy a lot.

machine learning course malaysia

tejaswini said...


Nice information, valuable and excellent design, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here.

"360digitmg certified machine learning courses"

Big Data Analytics Malaysia said...

I really enjoy simply reading all of your weblogs. Simply wanted to inform you that you have people like me who appreciate your work. Definitely a great post. Hats off to you! The information that you have provided is very helpful.


Business Analytics Trainings institute

tejaswini said...

Personally I think overjoyed I discovered the blogs.
https://360digitmg6.blogspot.com/2020/03/all-you-need-for-data-science.html

MacroBell said...

You may grow your firm more quickly and economically by investing in business development support models. Choose a partner who provides customized solutions to fit your specific demands and goals. With their skills and experience, this team can assist you in achieving your corporate goals and growing your business, also this staff can provide you with polygon image annotation. For additional information, see the source right away!

Rag said...

great

for ict 99 said...

Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform specific tasks without using explicit instructions. Instead, machine learning systems use patterns and inference to make decisions or predictions based on data. Machine Learning Projects for Final Year

A branch of AI that enables systems to learn and improve from experience without being explicitly programmed. It involves algorithms that can identify patterns in data, make decisions, and improve over time as they are exposed to more data.