Monday, February 19, 2018

Gene Expression Inference with Deep Learning

Bioinformatics Application of Deep Learning Technology

Resource


We will discuss the following paper: Chen et al., Gene expression inference with deep learning. Bioinformatics. 2016 Jun 15;32(12):1832-9.  The paper is available here.

All figures shown here should be credited to the paper.

Introduction


The main goal of deep learning (DL) (or deep neural network - DNN) is feature learning, i.e., it finds a non-linear transformation function $\mathbf{z} = \phi(\mathbf{x})$ to project raw input feature $\mathbf{x}$ into new feature $\mathbf{z}$, hopefully the feature vectors in the Z space will be more linearly behaved, so that linear regression or logistic regression method performs better.  As DNN can theoretically model any function (see the DL Introduciton blog), we therefore can train a DNN; the parameters that give better predictions arguably has found a better approximation of $\phi$.  Chen et al. used a classical fully-connected DNN (basically is the black-box DNN model we imagined previously) for feature transformation and coupled with a traditional linear regression output layer.  I like this study, because its design is very clean, the paper is very readable and the conclusions are convincing.

Human genome contains about 20000 genes.  Genes are expressed at different levels within cells.  Different cell types, different life cycle stages, and different treatments all lead to changes in gene expression, the expression of genes controls what proteins to be expressed and at what level, which subsequently determines all the biological activities in and among cells.  So the expression level of 20k genes basically form a feature vector fingerprinting the cell status.  For example, to understand the effects of 1000 compounds, we would need to treat cells with each of compound and then measure 1000 gene expression profiles, one per compound.  As each expression profiling costs ${\rm $}500$, this is quite expensive.  In a previous work by NIH LINCS, it was proposed that gene expression data are highly redundant, therefore, one only needs to profile the expression values of a subset of 1000 landmark genes, called L1000; the expression values for the remaining genes can be computationally predicted.  Previously, for a given target gene $t$ to be predicted, linear regression is used and its expression is predicted based on:

$$\begin{equation} y_{i(t)} = \sum_{j=1}^{1000} w_j x_j + b_i, \end{equation}$$

where $x$ are the measured expression levels of L1000 genes.  However, in reality, gene expression is regulated through a non-linear system through the complicated interactions of molecular circuits.  Therefore the dependency of $y_i(t)$ on L1000 expressions must be a complicated non-linear function, this is where DNN shines compared to traditional machine learning methods.  The authors did exact this and showcase the power of DNN for feature transformation.

Results

D-GEX


A fully connected DNN is used as illustrated in the Figure below, called D-GEX (stands for Deep-Gene Expression).  A few details, first there are 943 *(not 1000) landmark genes, because the study uses three different data sets and some landmark and target genes are not included, as they are not contained in all three sets.  This also explains why there 9520 (not 20000) target genes in the future, as others do not have test data available for error/accuracy evaluation.  Second, there seems to be two DNNs, this is the memory requirements for predicting all 9520 genes is too much for a single GPU, therefore two GPUs had to be used; each was responsible for the predicting of 4760 (randomly allocated) genes.  For our purpose, we just need to consider this is one fully connected DNN with 943 inputs and 9520 outputs.

The authors have tested many different DNN hyperparameters, including number of hidden layers [1, 2, 3], number of neurons within each layer [3000, 6000, 9000].  It uses dropout regularization with values [0%, 10%, 25%].  It uses momentum optimization and Glorot weight initialization (all these concepts are explained in the Introduction blog).  These are the key ingredients and the author performed full grid search of the hyperparameter space formed by the combinations of layer count, layer size and dropout rate.


About 100k gene expression samples were retrieved from Gene Expression Omnibus (GEO) database, and was split into 80-10-10 training-validation-test datasets.  So the training matrix is $943 \times 88807$, the validation matrix is $943 \times 11101$ and the test matrix is $943 \times 11101$.  The loss function is defined as least-square error, as this is a regression problem:

$$\begin{equation} L= \sum_{t=1}^{T} \left[ \frac{1}{N} \sum_{i=1}^{N}{\left( y_{i(t)} - \hat{y}_{i(t)} \right)}^2 \right], \end{equation}$$

where the errors between the true values $y$ and the predicted values $\hat{y}$ were calculated and summed for all $T = 9520$ target genes and then summed over all $N$ training samples within a batch.  Typically $L$ is used to optimize network parameters $\theta$, one can define another accuracy function to measure the network performance, which is more interpretable and which is used to calculate validation and test errors (for the purpose of cross validation or early termination).  The accuracy function typically is not suitable for parameter optimization, where loss function has better gradient properties.  Here the error is defined intuitively as mean average error:

$$\begin{equation} MAE_{(t)}= \frac{1}{N^\prime} \sum_{i=1}^{N^\prime} \left\vert \left( y_{i(t)} - \hat{y}_{i(t)} \right)\right\vert. \end{equation}$$

Comparison


D-GEX was optimized and its performance obtained under various hyperparamters are shown in the figure below.  It says the best performance (the smallest overall error as determined based on Equation 3) was achieved with 3 9000-neuron layers and 10% dropout rate, i.e., the most complicated DNN structure (D-GEX-10%-9000X3).  The error 0.32 is significantly less (15.33% improvement) compared to the traditional linear regression (LR) method (the horizontal dashed line) (The figure is a bit misleading, its Y axis should have started from 0 instead of 0.3).


More performance data is shown in the table below.  It shows even a single-layer NN performs better than LR.  It also shows linear regression achieves error rate of 0.378, and regularization (either L1 or L2) did not really boost its performance.  This is understandable, as we have 88807 training samples and only 944 parameters to fit for a given target gene, overfitting should not be a concern.  The k-nearest-neighbor (KNN) method, a non-parametric method, performed the worst.



In the figure below, each point represents a gene, where the X axis is its prediction error based on D-GEX and the Y-axis is its prediction error based on either LR (left) or KNN (right).  We can see for almost all target genes, D-GEX provides more accurate expression level prediction.  Based on data in Table 1, the improvement of D-GEX to LR does not seems statistically significant, a 0.04 improvement compared to 0.08 standard deviation.  However, I believe the authors should have used the standard deviation of the mean instead, or use paired t-test based on the reduction of error per gene, as the improvements based on the Figure below is rather obviously statistically significant.


Why D-GEX Works Better?


We would hypothesize it is because it transforms the input feature non-linearly into a set of new features corresponding to the last hidden layer (3000, 6000, or 9000 new features), and those new features behaves much better linearity characteristics, therefore lead to better LR performance.  Bear in mind that the last layer of D-GEX is simply a LR method, all magic happens before that.

To verify the above hypothesis indeed is the reason behind, let us consider the two linear regressions below, one based on raw input $x$ and another based on D-GEX transformed feature $z (use 10%-9000x3 here):$

$$\begin{equation} y_{i(t)} = \sum_{j=1}^{943} w_j x_j + b_i \\
y_{i(t)} = \sum_{j=1}^{9000} w_j z_j + b_i
\end{equation}$$

The authors compared the two groups of goodness-of-fit $R^2$ (after correcting the degrees of freedom) and obtained the figure below:



It shows for all genes, the transformed features indeed are more linearly correlated with the target expression, archiving a higher adjusted $R^2$.  This clearly demonstration of the feature learning capability of DNN.

Inside D-GEX


DNN is generally a blackbox, it works magically better in most cases, but it is extremely hard to understand how it works.  Our brain works similarly, we often do not reason by rules.  A GO player has his own intuition why certain game layout seems better based on his gut feeling instead of rules of reasoning.  Nevertheless the authors here made a decent attempt to look into the network.

The figure below visualizes connections in D-GEX-10%-3000X1, where only connections with weights 4- or 5- standard-deviation away from the mean were retained and colored by sign (red for positive and blue for negative).



The observation is nearly all input features are useful for prediction (each input neuron has lines retained), however, only a few neurons in the hidden layer are important (called hubs).  These hub neurons also tend to be the ones that are useful for predicting outputs.  Each input gene tends to be connected to very few hubs and the same can be said to each output gene.  Biologically, this might be explained by each input genes is involved in one gene expression regulation module; expression level of module gene members are combined to determine whether the module is activated or deactivated.  The the activation pattern of the key regulation modules can control whether and how much all target genes are expressed.  The paper stopped here, however, I think it would have been interesting to take the genes associated with a certain large hub and run gene enrichment analysis to see if we can indeed associate known gene regulation functions to these hubs.  Furthermore, we can run the same enrichment analysis for target genes associated with the same hub, and even take the combination of landmark and target genes associated with the same hub.  It appears the activities of these regulation hubs tend to be more decoupled from each other, therefore they can be linearly combined for more accurate target prediction without worrying as much about non-linear interactions as before.

D-GEX achieves the best performance of 0.32, is this sufficient to say L1000+D-GEX can replace experimental whole-genome expression profiling?  Since the expression of each gene were normalized so that their standard deviation is 1.0, 0.32 roughly translate into a 32% accuracy in expression prediction.  Biologists mostly care about the fold change $FC_{ij} = \frac{y_i}{y_j}$ of gene expression under two different perturbations, $i$ and $j$, which implies:

$$ \begin{equation} \frac{\delta FC_{ij}}{FC_{ij}} = \sqrt{{\left(\frac{\delta y_{i}}{y_{i}}\right)}^2 + {\left(\frac{\delta y_{j}}{y_{j}}\right)}^2 } = 45\%. \end{equation}$$

This accuracy is not very high, but could be decent for many biological studies.

Discussions

More Realistic Applications


The real L1000 prediction problem is actually more challenging than what is described above.  The reason is the largest training data set comes from data acquired with microarrays in GEO database, where gene expression for both landmark and target genes are available from published studies.  Our real application is, however, to predict target gene expressions on non-microarray platform, where only gene expression for the landmark genes are measured via a different technology - Next Generation Sequencing (NGS) (microarray is a much less used expression profiling platform nowadays).  Although microarray and NGS are supposed to measure the same underlying expression objectives, due to the difference in the two technologies, measurements are correlated yet not always consistent.  Therefore, the prediction performance of using the D-GEX trained with GEO microarray dataset, but applied to input data of landmark genes acquired with NGS would be much lower.

The authors performed another study using GEO dataset as the training, but use 2921 samples from an NGS dataset called GTEx expression data as the validation and test set, where data are available for both landmark and target genes (in the paper, whether GTEx was used as the validation was not mentioned, let us just assume so).  Indeed the performance got much worse and the best DNN configuration is D-GEX-25%-9000X2 as shown in the table below.  This implies a 62% accuracy if fold change prediction, it becomes questionable whether the in silico predictions for real L1000 is sufficiently accurate.

The figure below shows how an increase dropout rate reduce the error by regularize the training more aggressively.






Two observations from the table above: (1) The optimal dropout rate increases from 10% to 25%.  The explanation is the expression distribution between training set and test set are now noticeably different, therefore, the influence of training data should be attenuated, which requires a more aggressive regularization to prevent overfitting, hence the higher dropout rate. (2) At individual gene level, D-GEX still outperforms LR and KNN, however, there are certain target genes, where KNN does a much better job (see scatter plot below) (right).



This is because KNN method is a similarity-based method, it can only use NGS data itself as the training data to predict NGS target data.  D-GEX has to use microarray data as the training data to predict NGS target data, therefore, it tackles a much harder problem.  Nevertheless, it is still very impressive to see D-GEX performs better than LR for over 81% genes, and better than KNN for over 96% of genes.


Comparison with SVM Regression

This study compared D-GEX to LR, the latter is a linear regression method.  It would have been much more interesting if the authors compared it to traditional non-linear regression methods, such as SVM regression, and see if the advantage of D-GEX still holds.  Compared with SVM relying on a particular kernel, i.e., using a fixed form of non-linear transformation, DNN can theoretically identify better non-linear transformations.  I admit it would be hard to conduct SVM regression on this large dataset in practice, as there are 88807 training data points and it is computationally expensive to use kernel methods.  Also in the kernel method, each landmark feature is equally weighted, which could also be problematic.  One can imagine DNN still has a conceptual advantage than SVM, despite they are not directly compared against each other in this study.

Conclusion


Fully connected DNN can be applied to effectively transform features non-linearly, so that the new features are more linear-regression friendly.  Its application to gene expression datasets shows DNN consistently outperforms linear regression, when feature learning component is omitted.



Friday, February 2, 2018

Introduction to Deep Learning

Resource


In the past year or so, there are several high-quality deep learning learning materials appeared.  If you start learning deep learning now, you are super lucky.

Recently my most favorite deep learning book has become Francois Collet's "Deep Learning with Python".  Chollet is also the author of the Keras package.  The example code in the book is significantly simpler to read, thanks to the Keras library.  Other books use code written in TensorFlow, an underlying library of Keras, and could be much harder to understand.  Chollet is not shy in providing his own insights on the field, especially the overview (Chapter 1) and the forward looking opinions (Chapter 9), which I found extremely helpful.

My second favorite is Aurelien Geron's Hands-on machine learning with Scikit-learn and TensorFlow.  I also read: Tom Hope, Yehezkel S. Resheff, Itay Lieder.  Learning TensorFlow: A Guide to Building Deep Learning Systems, which I liked.  I also like the book Deep Learning with Keras by Antonio Gulli and Sujit Pal, especially its first Chapter.  The authors carefully tested and compared how different hyperparameters affect the performance of MNIST classification task.  Last, I cannot not mention the deep learning bible: Ian Goodfellow, Yoshua Bengio and Aaron Courville's  Deep Learning.  However, it is too heavy to be a good introduction book.  Save this as an advanced-level book after you have gone through other introduction material.  I you can only read one book, definitely go with Deep Learning with Python.

I recommend the excellent video lectures: Stanford's cn231 (to lecture 10)
https://www.youtube.com/watch?v=NfnWJUyUJYU&t=2s (2016)
Some newer topics can be found in http://cs231n.stanford.edu/syllabus.html (2017)

For Chinese readers, an excellent video course is offered by Hung-yi Lee.  Distinct from other lectures, Dr. Lee offers many of his own insights; this is extremely helpful, as systematic theoretical framework is still unavailable for this field.
https://www.youtube.com/watch?v=CXgbekl66jc&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

If you happen to have subscription to https://www.safaribooksonline.com (not free), I recommend an excellent video course by Jon Krohn, named "Deep Learning with TensorFlow: Applications of Deep Learning to Machine Learning Tasks".

Feature Learning


First a quick review.  We explain Artificial Intelligence (AI) is "the effort to automate intellectual tasks normally performed by humans" [1].  So strictly speaking Excel is an AI program, although we don't normally think of it that way.  The concept of AI is very broad, in many cases we come up with rules based on our human intelligence, then code these rules into a program.  The program can then convert data into output, as if our brain is behind it.  We are the intelligent piece here.  Machine learning (ML) is a type of AI, where machine write the rules.  As demonstrated in Figure 1, human intelligence goes into classical programming, which is the hardest part of the job.  But in machine learning, we only provide example data and answers, then ML will writes the rules for us, therefore machine provides the intelligence (i.e., artificial intelligence), the output of ML are rules.  These rules can then be used in the classical programming paradigm to answer future data.  This blog series is dedicated to the machine learning portion of AI.

Figure 1.  Taken from Figure 1.2 in Chollet "Deep Learning with Python", page 5.


We discussed some core machine learning theories in our previous blogs.  Machine learning actually consists of two parts, "feature learning" and "model optimization".  Previously we mostly apply linear analysis tools such as linear regression, logistic regression, or SVM onto raw input feature vectors $\mathbf{x}.$  We have been assuming the input features are already well designed, e.g., they may be Scale-invariant Feature Transform (SIFT) features for images [2], can be ECFP fingerprints or physiochemical properties for compounds [3], or sequence motifs (position weighted profiles) for genes or proteins [4], etc.  Typically the performance of our machines heavily depend on the quality of such features; good features such as SIFT require decades of hard work by domain experts.  If we ignore the feature learning portion, the model optimization portion is actually just a numerical global parameter optimization problem, therefore, sometimes people jokingly call machine learning as "machine optimization", as we normally do not have a "learning" component.   Where is the "learning" in machine learn?  The answer is "feature learning", i.e., projecting raw data into meaningful features that makes the task easy, has been the most difficult part of the machine learning that requires serious human intelligence.  Previously, smart features such as SIFT or ECFP were learned predominantly by human being instead of by machine.  The traditional machine learning field has focused on the optimization part of the equation, while do very little on the feature learning part.  This situation has now been fundamentally changed by deep learning.  As we will learn later, Deep Learning devises a strategy for feature learning, which is new, while it mostly inherits the model optimization portion, which we have already learned.

Feature learning basically means we take the raw input feature vector $\mathbf{x}$, apply some non-linear transform and turn it into a new feature vector $\mathbf{z}$; the goal is to find the optimal transformation function, so that the best overall machine learning performance can be achieved.  The non-linear kernels in SVM took a nice step in addressing the feature learning challenge, where kernel effectively introduces a transformation function $\mathbf{z} = \phi(\mathbf{x}, \gamma)$, projecting $\mathbf{x}$ into $\mathbf{z}$.  $\phi$ is determined by what kernel is chosen and its exact form can be optimized through its hyperparameter $\gamma$ (see previous blog).  There are still two limitations for the SVM kernel approach.  First, we only explore a family of transformation functions given a specified kernel, such as a Gaussian kernel (e.g., Equation 22 in blog).  That is, this feature learning strategy is not very flexible, if the optimal transformation function does not look like what the Gaussian kernel implies, feature learning will not perform well.  In reality, we can imagine $\phi$ has to be many forms, so ideally we need to create a learning mechanism that is capable of exploring all possible functional forms.  Second, the kernel strategy requires us to calculate similarities among all pairs of training data points, as well as similarities between a future data input to all training data points, therefore, kernel-based machines are too computationally intensive, both memory and CPU, to be applicable to large data sets.

To summarize, a real machine learning process should include a feature learning capability, where we can explore the function space $\left\{ \phi \right\}$ to find the arbitrary optimal transformation through tuning model parameters.  That is we want to find a model can present a Gaussian function, a polynomial function, a sine function, or basically any function forms!  Neural network (NN) is an answer to this requirement.  Figure 2 shows a single-layer neural network, where the input is our raw feature $\mathbf{x}$ and output is the learned feature $\mathbf{z}$ and we assume:

$$\begin{equation} z_i= \sigma(\sum_{j} w_{ij}x_j+b_i) \end{equation}.$$


Figure 2. One layer of neural network, where each node is a linear combination of the previous layer, then subjected to a non-linear transformation to produce an activation for output.

For simplicity, we will assume the offset term $b_i$ is already included in the linear transformation $\mathbf{w}_i\mathbf{x}$, so $\mathbf{w}_i\mathbf{x}$ implies $\mathbf{w}_i\mathbf{x}+b_i$.  Let us first assume the non-linear function $\sigma$ is the $sign$ function for now (i.e., a step function) and we can prove such a large single-layer neural network can approximate any arbitrary transformation function $\phi$.  Considering the special case where the input only contains one feature -- a scalar $x$.  We have two neurons $z_1$ and $z_2$ defined as (Figure 3):

$$\begin{equation} \begin{aligned} z_1 &= \text{sign}(x-a) \\ z_2 &= \text{sign}(a+\delta-x) \\ f(x) &= \frac{h}{2}z_1 + \frac{h}{2}z_2 \end{aligned} \end{equation},$$

The equation above implies our NN parameters are: $w_1 = 1$, $b_1 = -a$, $w_2=-1$, and $b_2=a+\delta.$  This output $f(x)$ is the sum of $z_1$ and $z_2$, which basically corresponds to a function element starting at $a$ with width $\delta$ and height $h$ (Figure 2).

Figure 3. A pair of $z$ nodes combined with a non-linear sign function can produce a function element of width $\delta$ and height $h$, sitting at location $a$.

Now with many of such $z$-pairs, we can construct multiple function elements centered at various locations with various heights, which can then be combined into a "bar graph" to approximate any arbitrary function $f(x)$ in the output node (Figure 4).  Notice for each give $x$ value, only one pair of $z$ neuron is activated, which gives an output of $f(x)$.

Figure 4. Using many structure elements shown in Figure 2, we can approximate any $f(x)$ with a single layer of NN.

A more tedious proof can be found in [5], and there are more stringent proofs that $\sigma$ can be other non-linear forms.  Based on this simple example, let us take it for granted that NN can approximate any transformation function, and the exact shape of the function is represented by network parameters $\mathbf{w}$ and $\mathbf{b}$.  Certainly the ability of a NN to mimic a function depends on how many neurons it uses, but at least NN provides a framework to mimic any function.  Once the weights in a NN is fixed, $\mathbf{x}$ is then transformed into $\mathbf{z}$ correspondingly, which solves the feature learning problem; then a traditional linear regression or logistic regression engine discussed previously can be hooked to the $\mathbf{z}$ output to complete it with the machine optimization part of the puzzle.  The combination of NN feature transformation (or feature learning) and the traditional regression/classification engine produces a mighty machine that can consume raw feature input directly, as it is now theoretically capable of learning any $\phi$ without the need of human learning.  The optimal parameters of the machine can be determined by minimizing the loss function based on the task at hand.  Popular loss functions are described in the multiclass classifier chapterFrom now on, we can approximately equate a neural network as an arbitrary unknown function $f$, i.e., $NN = f(x)$.  I cannot over emphasize how this view has helped myself in understanding deep learning, i.e., as long as we can describe a problem as a function, we are able to apply deep learning to solve it by approximating that function.  We will use the GO game as an example later to further illustrate this philosophy.

The general NN architecture is outlined in Figure 5, where we use a NN layer for $X \to Z$ transformation (from an input layer to a hidden layer), then use a classical ML layer (may not contain $\sigma$) to regress or to classify (from the hidden layer to the final output layer).  The $X$ layer is called the input layer, the last ML part generating the output is called output layer, all the network structures between the input and the output layer (can be multiple layers, although only one is shown in Figure 5) is called hidden layer(s).  In practice, since the regression/classification components can also be represented as NN-style connections (Figure 5), all layers, from the input to the final output, is call a neural network.  Compared to the traditional ML approach, NN goes one step further in its capability of exploring the feature transformation space.  Therefore, sometimes people rightfully call deep learning as feature learning.  Francois even defined traditional machine learning as: "searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal."[6]  Where deep learning pushes the envelop of that "predefined" space (such as defined by an SVM kernel) into something more unbound; the "feedback signal" refers to loss function, represents the machine optimization aspect.


Figure 5. An example NN-binary logistic regression classifier.  Where the raw input features are first transformed into a new set of features, as represented in the hidden layer, then the new features are linearly combined, softmax scaled into a two-element probability vector, as explained in the multiclass classifier chapter.  Although the main contribution of NN is on feature learning, we call the complete network, including the logistic regression components, as one NN.  There can be multiple hidden layers stacked one after another, which provide some depth, and therefore leads to the name of deep neural network (DNN) and deep learning.

GO Game as a Function


Alpha GO (and its second generation Alpha Zero) is a well-known example of what deep learning can do.  The GO board is a $19 \times 19$ grid, where each location can have three status, unoccupied or occupied by a black or a white stone.  The number of possible GO layout is unimaginably huge, $3^{361} = 10^{172}$.  Even after excluding non-sensible moves, there are still $10^{170}$ possibilities [7].  For GO players, given the current board layout $L$ as the input, they identify some regions of interest as key battle grounds or strategic territories, within which they need to consider the next move.  Then they evaluate those move candidates for their contributions to the final victory, oftentimes carrying out mental simulations for a number of virtual steps, and pick the move that feels optimal to them.  So making a GO move is equivalent to solving the following problem:

$$\begin{equation} m^* = \arg \max_{m} J(L^\prime(L, m)), \end{equation} $$
where m is a move candidate, which changes the current layout $L$ into a new layout $L^\prime$, and $J$ is the scoring function for a layout regarding how likely it is going to win at the end.  $m^*$ is the best move that generates the best possible new layout.  $J$ obviously is an extremely complicated function, it has to integrate over all possible future moves by both players till the end of the game and weights each move by some probability functions based on the opponent.   Each GO player has his/her own $J$, when they look at a board layout, they have the "gut feeling" of how promising the current layout $L$ is (taking all its future evolving trajectories into account as well), therefore, they are able to compare several moves, repeatedly apply gut feelings and pick the best move at the end.

For comparably simpler games such as chess, $J$ may rely on heavy simulations and knowledge derived from past chess games and can be calculated reasonably well, given sufficient computing power like IBM's Deep Blue [8]. The scoring function $J$ in GO has to take a feature vector of size 361 elements, each can take one of 3 values.  This function for sure is intractable.  Each GO player's years of training are to help him/her model this function as best as possible, but is still limited to the number of games they managed to play and with whom they played.  As there is no viable solution to construct $J$, what Alpha GO does conceptually is to use a deep neural network to approximate this function, and with the millions of training data it receives and millions of artificial neurons, its approximation to $J$ is more accurate than any human player can archive using biological neural networks.  Alpha GO certainly uses lots of other ingenuous solutions, e.g., deal with how to reduce the sample space of ${m}$, so that it only evaluates those moves that are promising (where lots of previous research have been focused on), but the core component is its using a NN to model the scoring function $J$.  In fact, theoretically we can even use $J$ to model the $\arg \max$ operation as well, as outputting the best move $m^*$ with a layout input is also a function!

Real life is complex, we have learned to develop intuitions or experience for many complex tasks.  It is very possible that these intuitions are basically a neural network approximation in our brain of the optimal true decision function in reality.  We continually reevaluate and improve the model based on the lessons life teaches us, as the results, some people have developed models better than others, but all are approximations to the ultimate truth.  The important concept here is to appreciate NN essentially does function approximation.

For traditional machine learning problems, the form of $J$ is a loss function, which can be precisely defined as previously described here.  However, $J$ in GO is largely unknown, beside it is a function.  The training there comes from the final outcome of a game, win or lose, which propagates back as a decaying reward for all moves in a game.  Therefore, the shape of $J$, i.e., the parameters of the NN, is optimized based on a framework developed by maximizing the winning outcome of millions of game playing experience; this type of learning problem is different from either supervised or unsupervised learning, it is called reinforcement learning problem (I do not have experience in this area yet).


Deep Learning Implementation

"Deep"


The above described single-layer NN is not used in practice, as it is inefficient in function approximation.  A decent approximation would have required too many $z$ neurons on that layer.  Function modeling would be much more efficient if we allow multiple layers of neurons.

In the example below, our target function is a red curve consisting of three similar peaks.  Say each peak could be approximately reasonably simulated by ten blue neurons, it would then require 30 neurons for us to approximate this red function using a single-layer neural network.  At the right of Figure 6, a two-layer network architecture is used instead, where the ten neurons in the first layer construct the structure element of one peak; in the second layer, only three neurons would be sufficient to place the peak three times at different locations and modulate their heights accordingly.  By reusing the structure element (peak) constructed at the earlier layer to model the repetitive components within the target function, we make a much more efficient use of network parameters.  There are often many repetitive patterns within real-life functions, therefore, multi-layer neuron network is more cost-effective and easier to train.

Figure 6. The red target function is modeled by a neural network.  On the left, a single layer of neurons is used, where it requires ten neurons to model each peak, so a total of 30 neurons are needed.  On the right, the first layer uses ten neurons to model a peak, the second layer only uses three neurons.  Each neuron in the second layer just reuse the feature (peak) already made available by the first layer, stretch/shrink and place it at three locations.  This way, the two-layer architecture uses much less parameters.


Another example of preferring DNN is the hierachical feature transformation, layer by layer, it allows simpler features to be constructed at earlier layers and then such simpler features are combined and reused to construct more complex features at deeper layers.  It would have required many neurons, if one has to construct the complex features in only one step.  Figure 7 shows many icons could be built at the third hidden layer by only allowing a horizontal and a vertical line segment features to be constructed in the first layer.  Typically, the number of possible features grows exponentially as the depth increases.  Another example is instead of drawing every leave and branch to show a forest, we draw a leave first, copy and paste the leave into a branch, copy and paste the branch into a tree, copy and paste a tree into a forest.  Therefore, many non-linear functions require a DNN to be modeled efficiently, i.e., DNN could model more complex functions by allocating neurons into different layers than placing them all onto one layer.


Figure 7. By using multiple layers of NN, the complexity of features it can model grows exponentially.  Such complex features would have required lots of nodes, if only a single layer is used.

Non-linear Transformation


In the early days of deep learning, $\sigma$ took the form of the sigmoid form, because it was believed to mimic biological neurons.  However, parameter optimization of such DNN are hard.  To find the optimal weight parameters in a DNN, we typically use stochastic gradient descend (SGD) method to descend down the surface of loss function $J$ in the NN parameter space $\theta$.  The gradient, $\frac{\partial J}{\partial \theta}$ must be calculated using a back propagation process.  We will skip the details of back propagating process, as it is well described in literature and it is encapsulated as a built-in feature in TensorFlow or Keras library.  Just to think of it in a hugely simplified form:

$$\begin{equation} J(w, x) = J(\sigma(wx)), \end{equation}$$

where $w$ modulates $x$ and the result is then transformed nonlinearly by $\sigma$, $\sigma$ is then used to eventually calculate $J$.  Therefore, we have:

$$\begin{equation} \begin{aligned} \frac{\partial J(w, x)}{\partial w} &= \frac{\partial J}{\partial \sigma}\frac{\partial \sigma(w, x)}{\partial w} \\
&= \frac{\partial J}{\partial \sigma}\sigma(wx)(1-\sigma(wx))x. \end{aligned} \end{equation}$$

When an intermediate neuron carries a large value $wx$, either largely positive or largely negative, $\sigma(wx)$ is close to 1 or 0.  Therefore the gradient $w$ receives is very small (Equation 5).  How often does a neuron saturates?  Quite often, as to be explained later (weight initialization section), as one neuron receives inputs from many parent neurons, its input tends to be much larger than the output of a single parent neuron, when the weights are not very small, its signal sum tend to be large enough to fall into the plateau region of the sigmoid function.  This is called vanishing gradient problem and it explains why a deeper NN is hard to train, because $w$ in earlier layers do not effectively receive feedback (gradient) from the loss function, they cannot be tuned to improve NN performance [9].  People developed strategies such as training DNN layer by layer, one layer at a time. Only the parameters for the first layer are optimized then fixed, parameters of the second layer is then optimized and fixed, etc.  However, this strategy is no longer used, thanks to developments described below.

The current most popular practice is to replace the sigmoid function with a much simpler ReLU function in the form of $\max(wx, 0)$.  This naive non-linear transformation basically keeps the positive output and zero out negatives.  The gradient therefore no longer decays for positive $wx$, no matter how large its value is.  For negative $wx$, gradient is zero and corresponding weight receives no update in this particular training sample.  But when other training samples were used, $wx$ might become positive and it will receive a chance of update.  Even if gradient is zero within a training batch, therefore a $w$ skips an update iteration (the neuron is "deleted" temporarily), the neurons in its earlier layer do not depends on this short-circuit neuron alone.  They depend on all the neurons in the current lay, therefore, neurons in the earlier layer are still likely to receive non-zero gradient information, i.e., short-circuiting does not propagate.  In practice, ReLU is one of the two main factors that make DNN trainable.  The layer-by-layer training strategy may still help to initialize DNN with a decent starting point, but it seems no longer necessary, when there is a reasonable amount of training data.  Most DNN can now be trained in one shot, as gradient information can be propagated many layers back thanks to ReLU function.  With people understanding gradient propagation better, very deep DNN, such as Microsoft ResNet of 152 layers, can be trained, as long as there are wiring in the DNN helps pass the gradient to early layers [10].

We may still wonder why biological system uses sigmoid-like response function and performs well.  My thought is the issue of sigmoid probably originates from our simplification of the NN into a layered structure.  In a biological NN, it probably does not contains clear layer separations, i.e., there are neurons directly connecting the input layer to the output layer; others connecting them with one hidden layer, two hidden layers, etc.  So a real biological NN is a mixture of NN of varying number of layers, it is a general network instead of nicely layered sandwiches.  This implies there exists shortcuts to pass signals by skipping some intermediate hidden layers and they also serve as express ways to route gradient traffic to deeper earlier layers during back propagation stage, so the network remains trainable even when it is deep.   We have observed similar ideas used in ResNet, where "skip connections" were introduced [10] and ResNet can have more than one hundred layers and remain perfectly trainable.

Initialization


When the vanishing gradient problem was identified to be associated with the difficulties in training DNN, it was realized that the initial values of weights play an important role for the trainability of DNN.  Taking sigmoid function as our $\sigma$, if weights are really small, we are basically using its middle linear region, which loses the benefit of non-linearity.  However, when weights are non-small, the sum of input sent to the sigmoid function would be large (either positively or negatively).  Due to the many input neurons a neuron on the next layer connected to, the weighted sum will easily lead to signal saturation and cause the vanishing gradient problem described above.  Therefore, the idea is to initialize the weight parameters in such a way that the variance of input and output of a neuron layer roughly maintains the same amplitude.

This is clearly explained in a blog here [11], where we aim to keep the ${\rm var}(y) = {\rm var}(x)$ for:
$$\begin{equation} y = \sum_i w_i x_i. \end{equation}$$
Given:
$$\begin{equation} \begin{aligned} {\rm var}(y) &= \sum_i {\rm var}(w_i x_i) \\
&= \sum_i {\rm var}(w_i) {\rm var}(x_i) \\
&\approx n {\rm var}(x) {\rm var}(w) \\
&\approx n {\rm var}(y) {\rm var}(w).
 \end{aligned} \end{equation}$$
Here we assume ${\rm mean}(w)$ and ${\rm mean}(x)$ to be zero in the second step, we also assume the variance are the same across $i$ for ${\rm var}(w_i)$ and ${\rm var}(x_i)$, respectively.  By requiring ${\rm var}(x_i) = {\rm var}(y)$, we obtain ${\rm var}(w) = \frac{1}{n}$, where $n$ is the number of input neurons.

Currently it is popular to use Xavier initialization (a.k.a. Glorot initialization) or He Initialization [10]. For ReLu function, according to He initialization, we can initialize weights with a uniform distribution $[-r, r]$, where:

$$\begin{equation} r = \sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} \end{equation}$$

Here $n$ becomes the average of input and output neuron counts, so that the initialization takes into account both forward and backward propagation.  Initialization can prevent the signal saturation for some time, but weights are to be adjusted and this strategy may not last.  The ReLU solution mentioned above is a more fundamental solution (both should be used).


Batch Normalization


To continue to prevent signal saturation as optimization iterations proceed, we cannot totally rely on the initialization process.  There is another strategy called batch normalization that automatically scale the output of each layer into the right range empirically.  It works as the following:

$$ \begin{equation} \begin{aligned}
\mu_B &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; \mathbf{x}_i \\
{\sigma_B}^2 &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; (\mathbf{x}_i - \mu_{B})^2 \\
\hat{\mathbf{x}}_i &= \frac{\mathbf{x}_i - \mu_B}{\sqrt{ \sigma_B^2 + \epsilon}} \\
\hat{\mathbf{z}}_i &= \gamma \hat{\mathbf{x}}_i + \beta. \end{aligned} \end{equation}$$

For a given mini-batch containing ${m_B}$ data records, it calculates mean $\mu_B$ and standard deviation $\sigma_B$ and use them to first scale input $\mathbf{x}$ into a normal distribution $\hat{\mathbf{x}}$.  However, since this normal distribution may not be the optimal distribution for training, it then transforms $\mathbf{x}$ further into $\mathbf{z}$ through scaling and shifting with $\gamma$ and $\beta$, respectively.  $\gamma$ and $\beta$ are treated as model parameters, which will be optimized as part of $\theta$.  At test time, we do not have a mini-batch, but only one test sample.  In that case, $\mu$ and $\sigma$ come from estimations made during training time, reflecting the average mean and standard deviation of the training data set.


Optimization

Traditional machine learning methods use vanilla gradient descent (GD) to minimize the loss function.  GD/SGD rolls a ball down hill, the speed of the ball is proportional to the slope, so the ball rolls slower and slower as it approaches the bottom.  GD works well for simple functions, but not for $J$ in deep learning.

DNN contains tens of thousands of parameters.  For a three-layer DNN, say the input feature vector contains 10 elements and we have 100 neurons each for the first and the second hidden layer, then an output layer of one neuron.  The first hidden layer contains $10 \times 100 + 100 = 1100$ parameters, where 1000 are weight parameters associated with the lines connecting 10 inputs and 100 neurons; and 100 are bias parameters, one per neuron.  The second hidden layer contains $100 \times 100 + 100 = 10100$ parameters; the last output layer contains $101$ parameters.  So there are 11301 parameters in total.  DNN system can easily contain millions of parameters.  Some new challenges occur when our parameter space is so vast.

First, we would need to calculate gradients based on exact gradient formula, instead of estimating gradients numerically.  This is because, to estimate gradients numerically, we would need to change one parameter $\theta_i$ slightly with all the remaining 11300 parameters fixed, carry out a forward propagation with the training batch in order to calculate a difference for $\frac{\partial J}{\partial \theta_i}$.  Then we repeat this training process for all 11301 parameters, it will be too expensive.  We need to have all the 11301 partial derivative formula somehow known, so we can plug values in to get the 11301 gradient values.  We skip the details (you should read Appendix D: Autodiff in [12] to appreciate how it works), but this problem implies all deep learning packages have to represent the DNN in a graph, so that it knows how all parameters contributes to the loss function and is able to compute derivatives accurately.  Programming with deep learning modules therefore is drastically different from traditional machine learning programming, as the statement are not executed sequentially, but rather graphically.  (BTW, how do we compute the derivative of ReLU function, as it is not differentiable at $x = 0$.  Do not let this bother you, we typically use the average of the derivatives from both directions, so its derivative at zero can be set to 0.5.)  If one programs with lower-level libraries, such as TensorFlow, the logic can be scattered everywhere and the code is hard to read.  However, higher-level libraries, such as Keras, wraps lots of those logic so that the code appears a lot like traditional procedure programming workflow.  Therefore, I would recommend coding with Keras.

Second, there are millions of global minima.  All $m$ neurons within a layer basically are equivalent in the architecture, so if we find a global minimum solution, we could permute the order of neurons $m!$ ways and still get the same solution.  Consider $n$ layers, we will have $(m!)^n$ equivalent solutions.  This does not mean we could easily find one of the global minimum, in fact, we very likely end up in a local minimum as our final solution, as  there are even more local minima and finding global minima among many local minima is a very hard problem.  But do not let this bother you too much.  Say each of our brain is a giant neural network, our weight parameters are drastically different from each others' and weights are changing every second as new connections are formed, old ones are destroyed, existing ones are strengthen or weaken.  All these different DNN solutions have no trouble in guiding us to walk, to drive, to speak, to listen effectively.  The me today is still largely the me yesterday.  None of us has the best DNN weights for walking, yes, some might walk slightly better than others, but we are nearly all perfect walkers for all practical needs.  Even if we cannot be Einstein, it does not mean we cannot be a good physicist.  Our goal for DNN application is to create a tool that performs well, the difference between well and the best may not be that big of a deal.

Third, it is actually hard to find even a local minimum.  A local minimum requires $\frac{\partial J}{\partial \theta} = 0$ for all parameter $\theta$.  However, when the partial derivative is zero, the function along that parameter dimension could be at maximum, minimum or a saddle point.  With ten of thousands of parameters, when we find an optima point initially, pretty much there is almost always the case that $J$ is at a saddle point along certain dimensions, i.e., we deal with saddle points, not minimum points most of the time.  The biggest challenge for DNN optimization is to overcome saddle directions.  The ball will move slower and slower when it approaches a saddle point under the vanilla GD strategy, therefore, it does not work well.  We need the ball to behave like a ball in the physical world, carrying some momentum and roll over a saddle point in order to continue a down hill journey.   This can certainly also help the ball to climb over a small bump and potentially land in a much better local minimum.  This observation has led to momentum optimization and other strategies.  We here only explain a popular one called Adam optimization.

Adam optimization is based on two ideas.  The first is the speed of the ball should not be totally determined by the current slope, it should largely inherit what it carries previously and modifies it (momentum optimization we just described above).  The second is vanilla GD uses a constant learning rate $\eta$ for all parameters, which should ideally be parameter-specific.  The learning should happen faster in the low-gradient direction as they are likely saddle points and slower in the high-gradient direction (this is typically explained with the example of an elongated bowl problem, where large gradient implies noise.[13]), therefore, we want to adjust $\eta$ to reduce the impact of large gradients (this idea was originated from AdaGrad optimization, later improved by RMSProp optimization).  Adam algorithm, combining research results in both areas, is outlined by the following three update rules:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& \leftarrow \beta_1 \mathbf{m} + (1-\beta_1) \nabla_\theta J(\theta), \\
\mathbf{s}& \leftarrow\beta_2 \mathbf{s} + (1-\beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta), \\ \theta & \leftarrow \theta - \eta \mathbf{m} \oslash \sqrt{ \mathbf{s} }. \end{aligned} \end{equation}$$

Here, $\beta_1$ and $\beta_2$ are decay rates controlling how heavy information the past steps affect the current movement, $\eta$ is a learning rate parameter.  Smaller $\beta$ means less influence from history.  $\mathbf{m}$ is a history-averaged gradient, where gradients with consistent directions build up and gradients frequently changing directions cancel out (the momentum optimization idea).  This part is easy to appreciate.  $\mathbf{s}$ is the squared-amplitude of the history-averaged gradient, ignoring the directions in gradients.  The magnitude of the resultant smoothed-out gradient is then used to module $\eta$ for each individual parameter direction; it encourages a large-step change, if $\mathbf{s}$ is small.  The $\otimes$ and $\oslash$ operator just mean they are element-wise calculation, not matrix/tensor operations.

I actually have not understood the part where $\mathbf{s}$ can be used to modulate $\mathbf{m}$.  For the special example where $\nabla_\theta J(\theta)$ is a constant; $\mathbf{m}$ and $\mathbf{s}$ will converge to a stable value:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& = {\nabla_\theta J(\theta)} \\
\mathbf{s}&= \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
\theta & \leftarrow \theta - \eta  \end{aligned} \end{equation}$$

So the step will not contain any gradient information, this does not make sense.  Probably our special case here does not happen often, therefore, should be ignored.  An interesting explanation is based on the frequency of a parameter receiving gradient feedback [14].  If many training data are insensitive to a particular parameter, i.e., its gradient is about zero most of the time, but once in awhile, it receives some feedback from some interesting training samples, the history-averaged $\mathbf{s}$ is small, so we miss the golden opportunity to adjust the parameter.  Then we should increase the learning rate in order to seize such a rare-occurring training opportunity.  Here, we can argue moment averaging also hurts the infrequent parameters, but maybe that part can be alleviated by adjusting $\beta_1$.  Adam performs larger updates for infrequent parameters and smaller updates for frequent parameters.  Certainly this is a double-edge sword, as it magnify noise as well.  In practice, Adam is currently the most popular optimization algorithms, it is considered to be capable of automatically adjusting learning rate instead of relying on some unknown learning rate scheduling protocols.  I guess we might just have to accept that Adam optimization somehow works in practice, even if we cannot explain its magic power at this point.  This also shows the current status of the deep learning field, which is still far from maturing into a science.


Dropout


We cannot run machine learning optimization without regularization, otherwise, we inevitably run into overfitting.  The more complex a DNN is, the more likely it will demonstrate unrealistically good performance under the limited training data set.  Our previous weapon for regularization is to include extra terms such as Lasso or ridge penalty terms into the loss function in order to prevent network parameters $\theta$ from taking huge values.  However, a new strategy called dropout has been discovered for DNN and has proven to be extremely effective against overfitting.

When dropout is applied to a layer with an example rate of 0.4, the output from each neuron on the layer has 40% chance to be zeroed out.  So at each training step, on average, only 60% of the layer output is piped into the next layer.  Since dropout is random during the learning process, neurons receive gradient updates 60% of the time.  The benefits of dropout might be explained in two ways.

First, since each parent neuron could be dropped, that means no single neuron can be critical to the network.  This encourage some feature redundancy.  For example, we can recognize a human face either based on eyes or a nose alone or in combine.  If the network already contains an eye feature, there is no motivation for it to discovery the nose feature, as the eye feature already does a pretty good job.  With dropout, the eye feature might be removed, therefore, the discovery of an additional nose feature becomes advantageous.  Therefore, dropout encourages more independent features, so that a subset of them can be dropped without affecting the accuracy of face recognition.  If the network has already discovered both nose and eye features, dropout can also help tune the activation parameter, so that a face neuron can be correctly fired by nose or eye alone, without requiring both nose and eyes to be present in future test set.  The final network, after tortured by randomly losing an eye neuron or a nose neuron, is more robust for dealing with photos where part of the face is covered or is noise contaminated (Figure 8).

Figure 8.  Dropout encourage a network containing an eye neuron to discovery another nose neuron. As the result, the presence of both neurons enables the network to recognize faces with only eyes or nose.


Second, dropout might effectively create an ensemble machine learning solution.  The random combination of neurons might be equivalent to the linear combination of many trimmed-down DNN models, each contains a random sampling of 60% subset of the original DNN (Figure 9).  As ensemble model has a theoretical advantage compared to a single model, where they reduce variance in prediction, i.e, the dropout-applied model are more generalizable.   Bear in mind that deep learning at this stage more like a technology than a science, so many statements presented here are people's or my own interpretations to help us accept why certain techniques seem to work better than others.  They might not be exactly correct in the future.

Figure 9. Dropout effectively makes the neural network to be an ensemble of many smaller networks.  If dropout rate is 50%, an M-neuron network could contain $2^M$ possible dropout network architectures.  The ensemble classifier typically improves performance by reducing variance.


Although dropout and regularization can be simultaneously applied, it seems dropout alone is often sufficient.  One can tell whether further regularization is needed by checking whether training error and validation error are close.  Dropout does not need to be applied to every layer, or dropout rate should not be too high, as that might be too conservative and lead to underfitting.  Another point to be aware of is dropout should be disabled at test (prediction) time, as we want all nodes to contribute, otherwise, predictions would not be deterministic.  In our example, the expected input when all neurons contribute in test time is about $\frac{1}{0.6} = 1.67$ fold larger, so at test time, outputs from each neuron should be scaled down by the factor 0.6.  Here we use 40% dropout rate as an example, it certainly is a hyperparameters subjected to optimization in applications.


Hyperparameter Optimization


There are still many unanswered questions for the DNN architecture, including but not limited to how many layers to use, how many neurons for each layer, what non-linear function to apply, what parameter initialization strategies should be adopted, what optimization strategy is the most effective one, etc.  All these are called hyperparameters.  We learned previously to answer these questions, one would explore all combinations in the hyperparameter space, and then find the one yields the best performance based on the training-testing cross validation strategy as described here.  In practice, it is not realistic to explore this huge hyperparameter space, due to the computational burden, therefore many heuristic exploring strategies have been seen in applications.  We often need to rely on gut feelings and assume certain parameters should be fixed, so that we can reduce the number and the range of the hyperparameters to tune.  Typically it is considered more advantage to adopt a random sampling approach than to use a grid search strategy in exploring the hyperparameter space [15] (Figure 10).  This is because, for each individual hyperparameter dimension, each random sampling is likely to pick a different value, compared to the large redundancies embedded in the grid search strategy.  Conceptually, however, a seemingly beautiful approach is to rely on Gaussian Process (GP).  GP will make smooth Bayesian interpretations of the loss function formed by the hyperparameters, based on the existing sample runs available.  It then recommends a new sample point in the hyperparameter space, which is expected to produce the best gain in performance.  GP might be a topic for a separate blog.

Figure 10.  Not all hyperparameters are equally important.  In this example, the $X$ parameter is important and $Y$ is unimportant.  On the left grid search strategy, 1/3 of the searches are wasted, as exploring different $Y$ with $X$ fixed is a waste of time.  On the right, a random search strategy allows us to explore more $X$ data points, therefore, leading to an $X$ value much closer to its optimal..

A Black-Box DNN Solution


Our introduction here for DNN is brief, there are many training materials now available that cover fully-connected DNN to great details.  Our focus of this blog is to explain the theoretical advantage of DNN and the key components for its implementation.  We can now mentally put these pieces together to form a general-purpose black-box DNN solution, which will serve as a generic machine for any typical regression/classification problem.

The thought exercise is to design a fully-connected DNN (all the NNs we have described so far are fully connected, i.e., all the neurons in the previous layer are connected to all neurons in the current layer, in contrast to convolutional NN to be discussed in the future) consisting of $n$ layers and each layer contains the same $m$ number of neurons.  The last layer is either a linear layer without non-linear transformation (if the problem domain is regression) or a softmax layer (if the problem domain is classification).  The corresponding loss function is the cross-entropy function for classification problems and least-square error function for regression problems.  For each given set of hyperparameters (i.e., a certain DNN structure), we split input training samples into small batches, send them to the DNN batch by batch.  For each batch, we calculate loss function and its gradient against each network parameter $\theta$, update parameters based on gradient information to hopefully minimize the loss function.  A complete iteration over all training samples is called an epoch.  At the end of each epoch, we evaluate validation errors on the held-out data set.  After multiple epochs, the process is stopped when test error starts to arise.  As resource permits, DNN for other hyperparameters (could be suggested by GP process) are trained as well.  Then the DNN with the best expected performance is chosen.  A very brief outline of the code is shown here.

Compared to linear regression or logistic regression methods, this DNN solution has the conceptual advantage of first transforming input features nonlinearly into a set of new features, i.e., it can learn new features automatically.  In this regard, the traditional ML methods are just special cases of DNN, where they use the input features with a unity  transformation. Therefore, we generally expect the performance of DNN to be better or at least as good as traditional machine learning methods, as long as the DNN is appropriately constructed (i.e., can capture the true transformation function $\phi$).  This is confirmed by many published deep learning applications.

Let us look at a bioinformatics application next, where DNN is used to regress gene expression levels (next blog).


Reference


  1. Chollet. Deep Learning with Python, page 4.
  2. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
  3. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50(5): 742-754.
  4. https://en.wikipedia.org/wiki/Position_weight_matrix
  5. http://neuralnetworksanddeeplearning.com/chap4.html
  6. Chollet. Deep Learning with Python, page 8.
  7. https://en.wikipedia.org/wiki/Go_(game)
  8. https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
  9. http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
  10. https://arxiv.org/pdf/1512.03385v1.pdf
  11. https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
  12. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Appendix D, page511-518.
  13. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Page298-299.
  14. Ruder, https://arxiv.org/pdf/1609.04747
  15. http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf


Tuesday, December 19, 2017

Multiclass Classifier

In Chapter 4 we discussed how to solve a binary classification problem using either logistic regression (LR) or support vector machines (SVM).  For more complex tasks, such as classifying an image from ImageNet into one of the 1000 pre-defined object classes, it is a multiclass classification problem.  If the true class for a give data record $\mathbf{x}_{i}$ is $k$, we encode its true label $\mathbf{y}_i$ as a one-hot vector (vector elements are all zero, except its $k$-th element is one):

$$\begin{equation} \mathbf{y}_{ji} = \delta_{kj}, \\
\text{where}\; \delta_{kj} = 1 \;\text{iff}\; k = j\end{equation}.$$

Logistic Regression


LR is a linear classifier, i.e., we first seek linear functions, one per class (we omit the bias term for simplicity):

$$\begin{equation}s_{ki} = \mathbf{w}_{k}\mathbf{x}_i\end{equation},$$
where $k = 1, 2, \cdots, K$ and $i = 1, 2, \cdots, N$.

Since we aim to predict the probability of the $x_i$ belonging to class $k$, we need to convert the vector $\left[ s_{1i}, s_{2i}, \cdots, s_{Ki} \right]$ into a probability vector.  Borrowing the concept behind Equation (3.6), we can use the $softmax$ function to obtain the probability vector as:

$$\begin{equation}f_{ki} =  \frac{e^{s_{ki}}}{\sum_{j=1}^{K} \; e^{s_{ji}}}\end{equation}$$

The above transformation guarantee that the resultant vector satisfy the definition of probability, i.e., all $K$ elements are in $[0, 1]$ and with their sum into one (Figure 1).

Figure 1. A neural-network-style diagram for the multiclass LR.  Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector is normalized by $softmax$ function into a probability vector $\mathbf{f}$ for output.

Following the same Bayesian approach as described in Equation (4.10), we then seek to obtain the optimal $\mathbf{w}^*$ through minimizing a loss function $L$:

$$\begin{equation} \begin{aligned} \mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1}^K \;-y_{ji} \log f_{ji} \\ &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; -y_{ki} \log f_{ki} \end{aligned}\end{equation}, $$
which is basically a cross entropy loss term, a.k.a., K-L divergence (Equation 3.14).  In Equation 4, we try to have our probability function $f_{ji}$ model as close as possible to the observations $y_{ji}$ by considering all data points across all possible classes.

Gradient descend (GD) method can be applied to minimize Equation 4.  The gradient with respect to parameter $\mathbf{w}$ is well behaved:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ij} \right) &= \sum_{i=1}^N \; -\frac{y_{ki}}{f_{ki}} \nabla_{\mathbf{w}}f_{ji} \end{aligned} \end{equation} $$.

Considering Equation 3 gives:
$$\begin{equation} \nabla_{\mathbf{wk}} f_{ki} = f_{ki}(1-f_{ki}) \mathbf{x}_i \end{equation}$$

Equation 5 can be simplified into:

$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1, j=y_i}^N \; -\log f_{kj} \right) &= -(1-f_{ki})\mathbf{x}_i \end{aligned} \end{equation}. $$

When our prediction is very wrong, i.e., $f_{ki}$ approach zero, $\mathbf{w}$ then moves in the negative gradient direction in the next iteration, which will lead to an increase in $f_{ki}$, exactly what we want.

Readers unfamiliar with the Bayesian framework might intuitively propose minimizing least-square error as an alternative loss function in the form of:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= \sum_{i=1}^N \; \sum_{j=1}^K \; {( f_{ji} - y_{ji})}^2. \end{aligned} \end{equation} $$

However, this form is inferior and should be avoided.  The reason being the gradient of Equation 8 can be derived as:
$$\begin{equation} \begin{aligned} \nabla_{\mathbf{w}} \left( \sum_{i=1}^N \; \sum_{j=1}^K \; L_{ji} \right) &= 2(f_{ji}-y_{ji})f_{ji}(1-f_{ji}) \mathbf{x}_i. \end{aligned} \end{equation}$$

This poses a serious issue.  Imagine the true class $y_{ji}$ is one, but our prediction $f_{ji}$ is very small.  The feedback $(f_{ji}-y_{ji})$ in attenuated by a very small factor $f_{ji}$, so the feedback signal to tune $\mathbf{w}$ in the next round is too weak. The same is true when $y_{ji}$ is zero and our prediction is very wrong with $f_{ji}$ close to one.  Basically, when we made a very wrong prediction, where $y_{ji}$ is close to zero or one, the gradient value becomes too weak and $\mathbf{w}$ can get stuck in the wrong value.  Therefore, for classification problems, we always use cross-entropy as the loss function and never use least-square form.

Multitask Classifier


There is a related problem called multitask classification.  For instance, for a given compound, we would like to predict whether it is active in a panel of $K$ assays [1].  Here the status is non-exclusive, i.e., a compound can be found active in multiple assays.  We basically predict a target vector of $K$ element, where each element $f_{ji}$ can be either 0 or 1, independently.  This can be solved by treating the task as constructing $K$ independent binary LR classifier (Figure 2), where for the $j$-th classifier:

$$\begin{equation} \begin{aligned} s_{ji} &= \mathbf{w}_{j}\mathbf{x}_i, \\
f_{ji} &= \frac{1}{1+ e^{s_{ji}}}. \end{aligned} \end{equation}$$

Figure 2. A neural-network-style diagram for the multitask LR.  Intermediate features $\mathbf{s}$ is a linear combination of the raw input features $\mathbf{x}$, then the $\mathbf{s}$ vector element is individually normalized by logistic transformation into a probability value for the corresponding class membership for the output.

The reason we mention multitask classifier is because in many cases, we do not apply LR on the initial input feature vectors $\mathbf{x}$, but instead apply a non-linear transformation $\mathbf{x} \to \mathbf{z}$ first, and then do the linear classification (see Equation 4.23).  Multitask classification described here becomes a very effective learning solution, when the $K$ tasks share the same underlying transformation function $\mathbf{z} = \phi(\mathbf{x})$, and $\phi$ is independent of task class.  In future deep learning chapters, we will learn that $\mathbf{z}$ is a new set of features learned from the raw input $\mathbf{x}$ and the $K$ tasks use the same set of feature learning approach for different classification purposes (Figure 3).  By sharing the $\phi$ function, the learning of optimal feature transformation can take advantage of all training data sets for all $K$ classes, instead of training a separate $\phi$ function per task.  The aggregated training data sets provides a better learning curve and can lead to more robust $\phi$, therefore, more robust new features $\mathbf{z}$.

Figure 3. A neural-network-style diagram for the multitask LR with feature sharing.  Raw input features $\mathbf{x}$ is first transformed, typically nonlinearly, into a new linear space $\mathbf{z}$, which is then used for the multitask classification problem in exactly the same manner as shown in Figure 2.  The main difference is the transformed feature $\mathbf{z}$ is shared by all tasks, therefore $\mathbf{x} \to \mathbf{z}$ transformation can be trained by using all training data points regardless of their task labels, instead of trained independently one per task.


Support Vector Machines


SVM is designed to be a binary classifier, therefore, there are typically two ways to apply SVM to multiclass classification [2].  The first choice is to build $K$ SVM classifiers, each trained by data from class $k$ against data from the remaining sets, i.e., so called one-vs-all approach.  Then the class corresponding to the largest margin (i.e., the data point is the furthest away from the SVM boundary) is chosen as the predicted label.  The second choice is to build $n(n-1)$ pairwise SVM between data sets from all two classes.  Then the class obtaining the most number of winning votes wins, and this is called one-vs-one approach.  If all $K$ training data sets are roughly of the same size, one-vs-all can be unbalanced, so one-vs-one might be preferred.  In addition, one-vs-one is typically faster to train due to the much smaller training size involved.  The popular library libsvm uses one-vs-one[3].

Here instead we apply SVM directly to multiclass classification problems using a soft-margin SVM loss function.  The basic idea of soft-margin SVM is to only penalize those data points that are not far enough from a predefined margin $\Delta$.  Data points that are already correctly predicted and has a distance longer than $\Delta$ receives no penalty.  Data points that are further away do not receive extra reward as well.  With this in mind, we can write the loss function for a multiclass SVM classifier and training all classes in one shot, a.k.a a "single machine" approach.  Assume $s_{ij}$ represents the distance of data point $i$ to the linear boundary of class $j$.  For data point $i$ of class label $y_i = k$, it will be a clean prediction if its distance to class $k$, $s_{ki}$, is at least $\Delta$ greater than its distances to all the remaining $K-1$ classes, otherwise, penalty occurs [4]:

$$\begin{equation} \begin{aligned}\mathbf{w}^* &= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \;  L_{i} \\
&= \arg\min_{\mathbf{w}} \; \sum_{i=1}^N \; \sum_{j=1, j \neq y_i}^K \;  \max(- (s_{y_i i} - s_{ji} - \Delta), 0) \end{aligned} \end{equation}. $$

It should be noted that the above is not the only loss form for SVM.  There can be other forms, such as proposed in the original multi-class SVM paper [5].  However, for deep learning applications, we rarely see multiclass SVM is used, probably because it does not provide a probabilistic prediction and it is not well documented.  Multiclass LR has become the method of choice.

Reference



  1. https://arxiv.org/pdf/1406.1231.pdf
  2. http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf
  3. https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f419
  4. http://cs231n.github.io/linear-classify/
  5. http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf

Thursday, November 2, 2017

What is Cloud Computing?

The Electricity Analogy

More than a century ago, if there had been the acronym CEO at the time, it would have stood for Chief Electricity Officer.   The electricity at the time had become an essential taxable ingredient of the manufacturing industry, however, due to its existence in the form of direct current, electricity could not be transmitted far before its energy dissipated within metal wires.  Back in the 1800s, individual businesses built their own power generators. Sitting next to a company, whether it was a steel mill or a factory, was its own power plant (Figure 1).  Only after Nikola Tesla invented alternating current at the end of 1800s, long-distance electricity transmission became feasible and gradually led to the modern electrical grid.  Today nobody worries about electricity generation, given a new electronic appliance, we simply plug and play.

Figure 1. Thomas Edison’s Pearl Street Station (New York City) electrified a one-square-mile area (yellow highlighted) of Lower Manhattan beginning September 4, 1882, providing power to illuminate his new electric light bulbs. The world’s first commercial electric power plant operated until it was destroyed by fire in 1890.  Upper-right is the Dynamo direct current generator used. (credit: https://www.embedded.com/print/4404090)
The analogy of Cloud Computing being the new Electrical Grid makes it far easier to understand. Power generator was local, was part of the power consumption unit, it gradually migrated to remote site enabled by electricity transmission technology.  Similarly, CPU (computing resource) was mostly local, it is in the process of migrating to remove enabled by the Internet.  Resources in the Cloud is forming a gigantic network of computer grid, very much in a way traces the historical trajectory of the forming of the electricity grid.  Here we will explain why Cloud Computing is different from your company's own data center and why it is not a fancier hype word for remote computing.

As far as it relates to computers, the term "cloud" dates back to early network design, when engineers would map out all the various components of their networks, but then loosely sketch the unknown networks (like the Internet) theirs was hooked into. What does a rough blob of undefined nodes look like? A cloud (Figure 2).  Imagine I am giving this Cloud Computing presentation in a company's conference room, with PowerPoint slides opened from a file hosted on a server sitting within company's data center.  For our sake of understanding how the file was transferred to the presenting computer and how video signal is sent to the overhead projector, the exact server and network architectures within the data center is irrelevant, therefore, we can simply represent our data center as a Cloud symbol.

Figure 2. The Cloud symbol is something that take little skill to draw. It's a squiggly line formed into a rough ellipse. Over time, clouds were adopted as the stand-in image for the part of a computer or telephone network outside one's own.

High Availability & Fault Tolerant

By using network resource versus a local device, data is available on multiple devices and remains available when some device or cloud component went down (due to built-in redundancy) (Figure 3).  However, these are just necessary conditions for the Cloud definition, not sufficient conditions.  The reason is we mostly gain these two benefits by using our company's data center, but our data center is not a Cloud.  Then what are the unique characteristics of the Cloud?

Figure 3. But storing data remotely in the Cloud, we gain high availability and fault tolerance. (credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1) 
Scalability

In this big data era, we are all facing the explosion of data, the only difference is the extend varies.  IT departments in all companies are busy managing constantly-growing demands for larger storage capacity, more powerful databasing, and subsequently more computing power.  

Imagine we are a small startup company that has 3 servers supporting 1000 users in 2016.  In 2017, our user base has grown into 5000 and we doubled our servers to 6.  Our projection for 2018 is 20,000 users, therefore, we propose a budge to buy 9 more servers.  We need to work on a convincing proposal, manage to get capital approval, obtain quotes from multiple vendors and go through the procurement process, find/build additional rack space in our data center, wire the hardware, install and configure software ecosystem on these servers.  Even for the most efficient companies, the process can easily take months, besides the large lump-sum capital investment.  Scalability is what a company's data center cannot provide.  Using the electricity analogy, this headache exists because we are running our own power generator.

Figure 4. On-premise scale up is costly both financially and time wise.
(credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1) 
What if we can go to a Cloud provider instead, with a few clicks, we specify the CPU, memory and storage requirements for our nine servers, as well as the operating systems we desire.  In a few minutes, the servers are up and running and yet better, we only pay less than one dollar per hour for each of them, no more capital budget required.  That is what Cloud Computing promises and that fundamentally changes the way we think and use computing resource, more towards the plug-and-play utility model.

Elasticity

Growth is always hard to be predicted accurately.  In the previous example, what if our user based will not grow to 20,000 but only 7,000 next year?  Eight out of nine servers are in waste, this is especially undesirable, as seeding fund is particularly limited for our little startup company.  Even if we made the accurate prediction, the nine servers are only gradually come into use as we reach the end of 2018.  So on-premise computing infrastructure cannot be sized down.  As the scalability limitation previously mentioned is often addressed by overshooting the resource needs, waste is hard to avoid.
Figure 4. On-premise scale down is costly due to the waste of resource.
(credit:https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1) 
With our Cloud provider, since the overhead of getting a new server is so light, our server farm only scales up when we really need to, thus avoid the need to size down.

Let us give another example to illustrate the importance of elasticity.  Retails such as Walmart earn their revenues mostly in the last two months of the year.  If its in-house IT department purchases a battery of servers in order to accommodate the computing needs in busy November and December, those resource will sit in idle when January hits and until the next shopping season.  With Cloud provider, it can shut down and return any extra computing resource in January and immediately stop paying for them.

The Cloud provider can absorb the risk of scalability and elasticity by the economy of scale.  Just like power plants form a giant electrical grid to balance power consumption, Cloud provider leverage the fluctuations among the needs of its tenants and in a much better position to manager to provide or take back resources on demand.

What is Cloud?

Therefore, the Cloud can be viewed as a virtually infinite storage and computing resource, where we can size up and down as we wish and we only pay for what is really being used.  It is just like our plug-and-play model on electricity usage.  The Cloud leads to efficiency by eliminating the operational overhead of scale up, as well as significantly reducing the cost by saving up-front investment and avoid size-down waste.

However, Cloud Computing is not entirely plug-and-play yet.  The Electrical Grid has been polished over the century to provide a set of standard parameters: voltage and frequency.  But the computing resource is much more complicated to define.  Computing resource relies on software ecosystem, where we do not yet have a standardized platform.  Linux, Windows, OSX, iOS, Android all have their own advantages for specific applications and there is no one-size-fit-all for the near future.  In addition, Cloud Computing is provided remotely via the Internet, network latency is rather significant not due to the speed of light, but rather due to the relay of our data package over dozens of servers sitting between us and the Cloud provider.  When Cloud computing needs to be fueled by our own data (which is not a case for electricity), it probably does not make sense to transfer terabytes of data into the Cloud just for a short computation. 

Last but not least is the pace of innovation.  Innovations in electricity generation and distribution happens on the scale of decades or centuries.  In contrast, Moore's Law is measured in months. As Cloud is largely a technology, not a science, what survives is not necessarily the most correct one, but usually the most popular one will eventually become the standard.  Cloud is at its nascent stage, which means there are lots of new technologies come and go, constant learning is the key.  So which Cloud platform is winning at the current point?  At this point, Amazon Web Service (AWS), Microsoft Azure, and Google Cloud Platform are the three largest Cloud providers.  AWS pretty much dominates the market, especially in terms of providing raw computing resource (IaaS, Infrastructure as a Service), therefore we will briefly explain the main concepts of AWS.

Amazon Web Service (AWS)

AWS has data centers in many geo-locations (Figure 5).  Each center represents a "region", e.g., we have a region in San Francisco, one in Oregon, another in Virginia, etc. Each region is a circle on the world map.  For a company to use AWS, it should first choose a region to physically establish a virtual data center, called Virtual Private Center (VPC).  Within a region, say Oregon, there are multiple data centers called Availability Zones (AZ).  Oregon has three AZs, called a, b, and c.  Each AZ (data center) consists of racks and racks of computers and AWS allocates and retakes these computers to/from different tenants based on their demands.  Our VPC is a logical concept, when it first created, it contains no computer at all.  Only when we request one, AWS take a free instance and virtually places it into our VPC (more just label it in its resource management database).  To use a VPC, we need to first create a few data structures called subnets, in order to pre-allocate our IP address space to avoid future confliction.  Each subnet has to be on one AZ (cannot span AZs), although we draw an orange box in Figure 5 to represent a subnet, a subnet actually consists of computers randomly allocated by AWS on-demand scattered around within that AZ.  This is Cloud, so we do not care about physically where our computers are and how they are networked, it is sufficient for us to imagine we have one virtual data center, consists of a few sub-networks that we can apply different routing and network security settings on each.  E.g., a subnet for our database servers that do not need to talk to anything else except web servers, and another subnet hosting web servers that can talk to the database subnet and also to the Internet for processing http requests, etc.

In Figure 5, we create a VPC in Oregon and pre-allocate four subnets, each has an unique IP range for holding up to 256 computers (251 to be precise due to some reserved IPs), say 10.0.0.*/24, 10.0.1.*/24, 10.0.2.*/24 and 10.0.3.*/24.  Logically, we can imagine any computers we launched within our VPC are on one giant local Ethernet (LAN) and can talk to each other as routing permits.

For data scientists to use AWS, it does put burden on us learning some basic skills that we used to rely on our UNIX administrators.  Cloud Computing is most effective when we use a self-serving model, otherwise, relying on a ticketing system handled by Cloud Administrator, we will again have to wait for hours or days for a resource we could have obtained within minutes.  Cloud Computing is here to stay and it is definitely the future, therefore, it is worthwhile for data scientists to learn the basics of AWS.

Some basic tasks include how to spin up an computer called EC2 instance, how to "sudo" install the packages needed (say Python3) and how to install modules (say use python "pip"), etc.  But do not be scared, to compile and/or install a software package, we have the freedom to specify the hardware, to choose the operating system and a release we need.  This is a luxury in-house UNIX administrators dream to have.  Furthermore, we can even take advantage of pre-packaged software containers (such as Docker containers) or even use a whole Amazon Machine Image (AMI) to provision a machine ecosystem.  This is not an AWS usage tutorial, AWS is best learnt through hands-on lab sessions, which is not covered in this short blog.

Figure 5.  The high-level architecture of a company's virtual data center in AWS.

We here explain what Cloud Computing is and I hope you can now appreciate Cloud is not simply a buzz word, but it brings us a data engineering revolution, just like going from using a power generator to relying on an electrical grid.

Reference

Information are complied based on a few resources:
  1. Video: AWS Concepts: Understanding the Course Material & Features
    https://www.youtube.com/watch?v=LKStwibxbR0&list=PLv2a_5pNAko2Jl4Ks7V428ttvy-Fj4NKU&index=1
  2. https://cacm.acm.org/magazines/2010/5/87274-cloud-computing-and-electricity-beyond-the-utility-model/fulltext
  3. https://www.theatlantic.com/technology/archive/2011/09/clouds-the-most-useful-metaphor-of-all-time/245851/
  4. https://www.quora.com/How-is-cloud-computing-analogous-to-electricity-grid