Sciency Stuff: Introduction to Deep Learning

Resource

In the past year or so, there are several high-quality deep learning learning materials appeared. If you start learning deep learning now, you are super lucky.

Recently my most favorite deep learning book has become Francois Collet's "Deep Learning with Python". Chollet is also the author of the Keras package. The example code in the book is significantly simpler to read, thanks to the Keras library. Other books use code written in TensorFlow, an underlying library of Keras, and could be much harder to understand. Chollet is not shy in providing his own insights on the field, especially the overview (Chapter 1) and the forward looking opinions (Chapter 9), which I found extremely helpful.

My second favorite is Aurelien Geron's Hands-on machine learning with Scikit-learn and TensorFlow. I also read: Tom Hope, Yehezkel S. Resheff, Itay Lieder. Learning TensorFlow: A Guide to Building Deep Learning Systems, which I liked. I also like the book Deep Learning with Keras by Antonio Gulli and Sujit Pal, especially its first Chapter. The authors carefully tested and compared how different hyperparameters affect the performance of MNIST classification task. Last, I cannot not mention the deep learning bible: Ian Goodfellow, Yoshua Bengio and Aaron Courville's Deep Learning. However, it is too heavy to be a good introduction book. Save this as an advanced-level book after you have gone through other introduction material. I you can only read one book, definitely go with Deep Learning with Python.

I recommend the excellent video lectures: Stanford's cn231 (to lecture 10)
https://www.youtube.com/watch?v=NfnWJUyUJYU&t=2s (2016)
Some newer topics can be found in http://cs231n.stanford.edu/syllabus.html (2017)

For Chinese readers, an excellent video course is offered by Hung-yi Lee. Distinct from other lectures, Dr. Lee offers many of his own insights; this is extremely helpful, as systematic theoretical framework is still unavailable for this field.
https://www.youtube.com/watch?v=CXgbekl66jc&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

If you happen to have subscription to https://www.safaribooksonline.com (not free), I recommend an excellent video course by Jon Krohn, named "Deep Learning with TensorFlow: Applications of Deep Learning to Machine Learning Tasks".

Feature Learning

First a quick review. We explain Artificial Intelligence (AI) is "the effort to automate intellectual tasks normally performed by humans" [1]. So strictly speaking Excel is an AI program, although we don't normally think of it that way. The concept of AI is very broad, in many cases we come up with rules based on our human intelligence, then code these rules into a program. The program can then convert data into output, as if our brain is behind it. We are the intelligent piece here. Machine learning (ML) is a type of AI, where machine write the rules. As demonstrated in Figure 1, human intelligence goes into classical programming, which is the hardest part of the job. But in machine learning, we only provide example data and answers, then ML will writes the rules for us, therefore machine provides the intelligence (i.e., artificial intelligence), the output of ML are rules. These rules can then be used in the classical programming paradigm to answer future data. This blog series is dedicated to the machine learning portion of AI.

Figure 1. Taken from Figure 1.2 in Chollet "Deep Learning with Python", page 5.

We discussed some core machine learning theories in our previous blogs. Machine learning actually consists of two parts, "feature learning" and "model optimization". Previously we mostly apply linear analysis tools such as linear regression, logistic regression, or SVM onto raw input feature vectors $\mathbf{x}.$ We have been assuming the input features are already well designed, e.g., they may be Scale-invariant Feature Transform (SIFT) features for images [2], can be ECFP fingerprints or physiochemical properties for compounds [3], or sequence motifs (position weighted profiles) for genes or proteins [4], etc. Typically the performance of our machines heavily depend on the quality of such features; good features such as SIFT require decades of hard work by domain experts. If we ignore the feature learning portion, the model optimization portion is actually just a numerical global parameter optimization problem, therefore, sometimes people jokingly call machine learning as "machine optimization", as we normally do not have a "learning" component. Where is the "learning" in machine learn? The answer is "feature learning", i.e., projecting raw data into meaningful features that makes the task easy, has been the most difficult part of the machine learning that requires serious human intelligence. Previously, smart features such as SIFT or ECFP were learned predominantly by human being instead of by machine. The traditional machine learning field has focused on the optimization part of the equation, while do very little on the feature learning part. This situation has now been fundamentally changed by deep learning. As we will learn later, Deep Learning devises a strategy for feature learning, which is new, while it mostly inherits the model optimization portion, which we have already learned.

Feature learning basically means we take the raw input feature vector $\mathbf{x}$, apply some non-linear transform and turn it into a new feature vector $\mathbf{z}$; the goal is to find the optimal transformation function, so that the best overall machine learning performance can be achieved. The non-linear kernels in SVM took a nice step in addressing the feature learning challenge, where kernel effectively introduces a transformation function $\mathbf{z} = \phi(\mathbf{x}, \gamma)$, projecting $\mathbf{x}$ into $\mathbf{z}$. $\phi$ is determined by what kernel is chosen and its exact form can be optimized through its hyperparameter $\gamma$ (see previous blog). There are still two limitations for the SVM kernel approach. First, we only explore a family of transformation functions given a specified kernel, such as a Gaussian kernel (e.g., Equation 22 in blog). That is, this feature learning strategy is not very flexible, if the optimal transformation function does not look like what the Gaussian kernel implies, feature learning will not perform well. In reality, we can imagine $\phi$ has to be many forms, so ideally we need to create a learning mechanism that is capable of exploring all possible functional forms. Second, the kernel strategy requires us to calculate similarities among all pairs of training data points, as well as similarities between a future data input to all training data points, therefore, kernel-based machines are too computationally intensive, both memory and CPU, to be applicable to large data sets.

To summarize, a real machine learning process should include a feature learning capability, where we can explore the function space $\left\{ \phi \right\}$ to find the arbitrary optimal transformation through tuning model parameters. That is we want to find a model can present a Gaussian function, a polynomial function, a sine function, or basically any function forms! Neural network (NN) is an answer to this requirement. Figure 2 shows a single-layer neural network, where the input is our raw feature $\mathbf{x}$ and output is the learned feature $\mathbf{z}$ and we assume:

$$\begin{equation} z_i= \sigma(\sum_{j} w_{ij}x_j+b_i) \end{equation}.$$

Figure 2. One layer of neural network, where each node is a linear combination of the previous layer, then subjected to a non-linear transformation to produce an activation for output.

For simplicity, we will assume the offset term $b_i$ is already included in the linear transformation $\mathbf{w}_i\mathbf{x}$, so $\mathbf{w}_i\mathbf{x}$ implies $\mathbf{w}_i\mathbf{x}+b_i$. Let us first assume the non-linear function $\sigma$ is the $sign$ function for now (i.e., a step function) and we can prove such a large single-layer neural network can approximate any arbitrary transformation function $\phi$. Considering the special case where the input only contains one feature -- a scalar $x$. We have two neurons $z_1$ and $z_2$ defined as (Figure 3):

$$\begin{equation} \begin{aligned} z_1 &= \text{sign}(x-a) \\ z_2 &= \text{sign}(a+\delta-x) \\ f(x) &= \frac{h}{2}z_1 + \frac{h}{2}z_2 \end{aligned} \end{equation},$$

The equation above implies our NN parameters are: $w_1 = 1$, $b_1 = -a$, $w_2=-1$, and $b_2=a+\delta.$ This output $f(x)$ is the sum of $z_1$ and $z_2$, which basically corresponds to a function element starting at $a$ with width $\delta$ and height $h$ (Figure 2).

Figure 3. A pair of $z$ nodes combined with a non-linear sign function can produce a function element of width $\delta$ and height $h$, sitting at location $a$.

Now with many of such $z$-pairs, we can construct multiple function elements centered at various locations with various heights, which can then be combined into a "bar graph" to approximate any arbitrary function $f(x)$ in the output node (Figure 4). Notice for each give $x$ value, only one pair of $z$ neuron is activated, which gives an output of $f(x)$.

Figure 4. Using many structure elements shown in Figure 2, we can approximate any $f(x)$ with a single layer of NN.

A more tedious proof can be found in [5], and there are more stringent proofs that $\sigma$ can be other non-linear forms. Based on this simple example, let us take it for granted that NN can approximate any transformation function, and the exact shape of the function is represented by network parameters $\mathbf{w}$ and $\mathbf{b}$. Certainly the ability of a NN to mimic a function depends on how many neurons it uses, but at least NN provides a framework to mimic any function. Once the weights in a NN is fixed, $\mathbf{x}$ is then transformed into $\mathbf{z}$ correspondingly, which solves the feature learning problem; then a traditional linear regression or logistic regression engine discussed previously can be hooked to the $\mathbf{z}$ output to complete it with the machine optimization part of the puzzle. The combination of NN feature transformation (or feature learning) and the traditional regression/classification engine produces a mighty machine that can consume raw feature input directly, as it is now theoretically capable of learning any $\phi$ without the need of human learning. The optimal parameters of the machine can be determined by minimizing the loss function based on the task at hand. Popular loss functions are described in the multiclass classifier chapter. From now on, we can approximately equate a neural network as an arbitrary unknown function $f$, i.e., $NN = f(x)$. I cannot over emphasize how this view has helped myself in understanding deep learning, i.e., as long as we can describe a problem as a function, we are able to apply deep learning to solve it by approximating that function. We will use the GO game as an example later to further illustrate this philosophy.

The general NN architecture is outlined in Figure 5, where we use a NN layer for $X \to Z$ transformation (from an input layer to a hidden layer), then use a classical ML layer (may not contain $\sigma$) to regress or to classify (from the hidden layer to the final output layer). The $X$ layer is called the input layer, the last ML part generating the output is called output layer, all the network structures between the input and the output layer (can be multiple layers, although only one is shown in Figure 5) is called hidden layer(s). In practice, since the regression/classification components can also be represented as NN-style connections (Figure 5), all layers, from the input to the final output, is call a neural network. Compared to the traditional ML approach, NN goes one step further in its capability of exploring the feature transformation space. Therefore, sometimes people rightfully call deep learning as feature learning. Francois even defined traditional machine learning as: "searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal."[6] Where deep learning pushes the envelop of that "predefined" space (such as defined by an SVM kernel) into something more unbound; the "feedback signal" refers to loss function, represents the machine optimization aspect.

Figure 5. An example NN-binary logistic regression classifier. Where the raw input features are first transformed into a new set of features, as represented in the hidden layer, then the new features are linearly combined, softmax scaled into a two-element probability vector, as explained in the multiclass classifier chapter. Although the main contribution of NN is on feature learning, we call the complete network, including the logistic regression components, as one NN. There can be multiple hidden layers stacked one after another, which provide some depth, and therefore leads to the name of deep neural network (DNN) and deep learning.

GO Game as a Function

Alpha GO (and its second generation Alpha Zero) is a well-known example of what deep learning can do. The GO board is a $19 \times 19$ grid, where each location can have three status, unoccupied or occupied by a black or a white stone. The number of possible GO layout is unimaginably huge, $3^{361} = 10^{172}$. Even after excluding non-sensible moves, there are still $10^{170}$ possibilities [7]. For GO players, given the current board layout $L$ as the input, they identify some regions of interest as key battle grounds or strategic territories, within which they need to consider the next move. Then they evaluate those move candidates for their contributions to the final victory, oftentimes carrying out mental simulations for a number of virtual steps, and pick the move that feels optimal to them. So making a GO move is equivalent to solving the following problem:

$$\begin{equation} m^* = \arg \max_{m} J(L^\prime(L, m)), \end{equation} $$

where m is a move candidate, which changes the current layout $L$ into a new layout $L^\prime$, and $J$ is the scoring function for a layout regarding how likely it is going to win at the end. $m^*$ is the best move that generates the best possible new layout. $J$ obviously is an extremely complicated function, it has to integrate over all possible future moves by both players till the end of the game and weights each move by some probability functions based on the opponent. Each GO player has his/her own $J$, when they look at a board layout, they have the "gut feeling" of how promising the current layout $L$ is (taking all its future evolving trajectories into account as well), therefore, they are able to compare several moves, repeatedly apply gut feelings and pick the best move at the end.

For comparably simpler games such as chess, $J$ may rely on heavy simulations and knowledge derived from past chess games and can be calculated reasonably well, given sufficient computing power like IBM's Deep Blue [8]. The scoring function $J$ in GO has to take a feature vector of size 361 elements, each can take one of 3 values. This function for sure is intractable. Each GO player's years of training are to help him/her model this function as best as possible, but is still limited to the number of games they managed to play and with whom they played. As there is no viable solution to construct $J$, what Alpha GO does conceptually is to use a deep neural network to approximate this function, and with the millions of training data it receives and millions of artificial neurons, its approximation to $J$ is more accurate than any human player can archive using biological neural networks. Alpha GO certainly uses lots of other ingenuous solutions, e.g., deal with how to reduce the sample space of ${m}$, so that it only evaluates those moves that are promising (where lots of previous research have been focused on), but the core component is its using a NN to model the scoring function $J$. In fact, theoretically we can even use $J$ to model the $\arg \max$ operation as well, as outputting the best move $m^*$ with a layout input is also a function!

Real life is complex, we have learned to develop intuitions or experience for many complex tasks. It is very possible that these intuitions are basically a neural network approximation in our brain of the optimal true decision function in reality. We continually reevaluate and improve the model based on the lessons life teaches us, as the results, some people have developed models better than others, but all are approximations to the ultimate truth. The important concept here is to appreciate NN essentially does function approximation.

For traditional machine learning problems, the form of $J$ is a loss function, which can be precisely defined as previously described here. However, $J$ in GO is largely unknown, beside it is a function. The training there comes from the final outcome of a game, win or lose, which propagates back as a decaying reward for all moves in a game. Therefore, the shape of $J$, i.e., the parameters of the NN, is optimized based on a framework developed by maximizing the winning outcome of millions of game playing experience; this type of learning problem is different from either supervised or unsupervised learning, it is called reinforcement learning problem (I do not have experience in this area yet).

Deep Learning Implementation

"Deep"

The above described single-layer NN is not used in practice, as it is inefficient in function approximation. A decent approximation would have required too many $z$ neurons on that layer. Function modeling would be much more efficient if we allow multiple layers of neurons.

In the example below, our target function is a red curve consisting of three similar peaks. Say each peak could be approximately reasonably simulated by ten blue neurons, it would then require 30 neurons for us to approximate this red function using a single-layer neural network. At the right of Figure 6, a two-layer network architecture is used instead, where the ten neurons in the first layer construct the structure element of one peak; in the second layer, only three neurons would be sufficient to place the peak three times at different locations and modulate their heights accordingly. By reusing the structure element (peak) constructed at the earlier layer to model the repetitive components within the target function, we make a much more efficient use of network parameters. There are often many repetitive patterns within real-life functions, therefore, multi-layer neuron network is more cost-effective and easier to train.

Figure 6. The red target function is modeled by a neural network. On the left, a single layer of neurons is used, where it requires ten neurons to model each peak, so a total of 30 neurons are needed. On the right, the first layer uses ten neurons to model a peak, the second layer only uses three neurons. Each neuron in the second layer just reuse the feature (peak) already made available by the first layer, stretch/shrink and place it at three locations. This way, the two-layer architecture uses much less parameters.

Another example of preferring DNN is the hierachical feature transformation, layer by layer, it allows simpler features to be constructed at earlier layers and then such simpler features are combined and reused to construct more complex features at deeper layers. It would have required many neurons, if one has to construct the complex features in only one step. Figure 7 shows many icons could be built at the third hidden layer by only allowing a horizontal and a vertical line segment features to be constructed in the first layer. Typically, the number of possible features grows exponentially as the depth increases. Another example is instead of drawing every leave and branch to show a forest, we draw a leave first, copy and paste the leave into a branch, copy and paste the branch into a tree, copy and paste a tree into a forest. Therefore, many non-linear functions require a DNN to be modeled efficiently, i.e., DNN could model more complex functions by allocating neurons into different layers than placing them all onto one layer.

Figure 7. By using multiple layers of NN, the complexity of features it can model grows exponentially. Such complex features would have required lots of nodes, if only a single layer is used.

Non-linear Transformation

In the early days of deep learning, $\sigma$ took the form of the sigmoid form, because it was believed to mimic biological neurons. However, parameter optimization of such DNN are hard. To find the optimal weight parameters in a DNN, we typically use stochastic gradient descend (SGD) method to descend down the surface of loss function $J$ in the NN parameter space $\theta$. The gradient, $\frac{\partial J}{\partial \theta}$ must be calculated using a back propagation process. We will skip the details of back propagating process, as it is well described in literature and it is encapsulated as a built-in feature in TensorFlow or Keras library. Just to think of it in a hugely simplified form:

$$\begin{equation} J(w, x) = J(\sigma(wx)), \end{equation}$$

where $w$ modulates $x$ and the result is then transformed nonlinearly by $\sigma$, $\sigma$ is then used to eventually calculate $J$. Therefore, we have:

$$\begin{equation} \begin{aligned} \frac{\partial J(w, x)}{\partial w} &= \frac{\partial J}{\partial \sigma}\frac{\partial \sigma(w, x)}{\partial w} \\
&= \frac{\partial J}{\partial \sigma}\sigma(wx)(1-\sigma(wx))x. \end{aligned} \end{equation}$$

When an intermediate neuron carries a large value $wx$, either largely positive or largely negative, $\sigma(wx)$ is close to 1 or 0. Therefore the gradient $w$ receives is very small (Equation 5). How often does a neuron saturates? Quite often, as to be explained later (weight initialization section), as one neuron receives inputs from many parent neurons, its input tends to be much larger than the output of a single parent neuron, when the weights are not very small, its signal sum tend to be large enough to fall into the plateau region of the sigmoid function. This is called vanishing gradient problem and it explains why a deeper NN is hard to train, because $w$ in earlier layers do not effectively receive feedback (gradient) from the loss function, they cannot be tuned to improve NN performance [9]. People developed strategies such as training DNN layer by layer, one layer at a time. Only the parameters for the first layer are optimized then fixed, parameters of the second layer is then optimized and fixed, etc. However, this strategy is no longer used, thanks to developments described below.

The current most popular practice is to replace the sigmoid function with a much simpler ReLU function in the form of $\max(wx, 0)$. This naive non-linear transformation basically keeps the positive output and zero out negatives. The gradient therefore no longer decays for positive $wx$, no matter how large its value is. For negative $wx$, gradient is zero and corresponding weight receives no update in this particular training sample. But when other training samples were used, $wx$ might become positive and it will receive a chance of update. Even if gradient is zero within a training batch, therefore a $w$ skips an update iteration (the neuron is "deleted" temporarily), the neurons in its earlier layer do not depends on this short-circuit neuron alone. They depend on all the neurons in the current lay, therefore, neurons in the earlier layer are still likely to receive non-zero gradient information, i.e., short-circuiting does not propagate. In practice, ReLU is one of the two main factors that make DNN trainable. The layer-by-layer training strategy may still help to initialize DNN with a decent starting point, but it seems no longer necessary, when there is a reasonable amount of training data. Most DNN can now be trained in one shot, as gradient information can be propagated many layers back thanks to ReLU function. With people understanding gradient propagation better, very deep DNN, such as Microsoft ResNet of 152 layers, can be trained, as long as there are wiring in the DNN helps pass the gradient to early layers [10].

We may still wonder why biological system uses sigmoid-like response function and performs well. My thought is the issue of sigmoid probably originates from our simplification of the NN into a layered structure. In a biological NN, it probably does not contains clear layer separations, i.e., there are neurons directly connecting the input layer to the output layer; others connecting them with one hidden layer, two hidden layers, etc. So a real biological NN is a mixture of NN of varying number of layers, it is a general network instead of nicely layered sandwiches. This implies there exists shortcuts to pass signals by skipping some intermediate hidden layers and they also serve as express ways to route gradient traffic to deeper earlier layers during back propagation stage, so the network remains trainable even when it is deep. We have observed similar ideas used in ResNet, where "skip connections" were introduced [10] and ResNet can have more than one hundred layers and remain perfectly trainable.

Initialization

When the vanishing gradient problem was identified to be associated with the difficulties in training DNN, it was realized that the initial values of weights play an important role for the trainability of DNN. Taking sigmoid function as our $\sigma$, if weights are really small, we are basically using its middle linear region, which loses the benefit of non-linearity. However, when weights are non-small, the sum of input sent to the sigmoid function would be large (either positively or negatively). Due to the many input neurons a neuron on the next layer connected to, the weighted sum will easily lead to signal saturation and cause the vanishing gradient problem described above. Therefore, the idea is to initialize the weight parameters in such a way that the variance of input and output of a neuron layer roughly maintains the same amplitude.

This is clearly explained in a blog here [11], where we aim to keep the ${\rm var}(y) = {\rm var}(x)$ for:
$$\begin{equation} y = \sum_i w_i x_i. \end{equation}$$
Given:
$$\begin{equation} \begin{aligned} {\rm var}(y) &= \sum_i {\rm var}(w_i x_i) \\
&= \sum_i {\rm var}(w_i) {\rm var}(x_i) \\
&\approx n {\rm var}(x) {\rm var}(w) \\
&\approx n {\rm var}(y) {\rm var}(w).
\end{aligned} \end{equation}$$
Here we assume ${\rm mean}(w)$ and ${\rm mean}(x)$ to be zero in the second step, we also assume the variance are the same across $i$ for ${\rm var}(w_i)$ and ${\rm var}(x_i)$, respectively. By requiring ${\rm var}(x_i) = {\rm var}(y)$, we obtain ${\rm var}(w) = \frac{1}{n}$, where $n$ is the number of input neurons.

Currently it is popular to use Xavier initialization (a.k.a. Glorot initialization) or He Initialization [10]. For ReLu function, according to He initialization, we can initialize weights with a uniform distribution $[-r, r]$, where:

$$\begin{equation} r = \sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} \end{equation}$$

Here $n$ becomes the average of input and output neuron counts, so that the initialization takes into account both forward and backward propagation. Initialization can prevent the signal saturation for some time, but weights are to be adjusted and this strategy may not last. The ReLU solution mentioned above is a more fundamental solution (both should be used).

Batch Normalization

To continue to prevent signal saturation as optimization iterations proceed, we cannot totally rely on the initialization process. There is another strategy called batch normalization that automatically scale the output of each layer into the right range empirically. It works as the following:

$$ \begin{equation} \begin{aligned}
\mu_B &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; \mathbf{x}_i \\
{\sigma_B}^2 &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; (\mathbf{x}_i - \mu_{B})^2 \\
\hat{\mathbf{x}}_i &= \frac{\mathbf{x}_i - \mu_B}{\sqrt{ \sigma_B^2 + \epsilon}} \\
\hat{\mathbf{z}}_i &= \gamma \hat{\mathbf{x}}_i + \beta. \end{aligned} \end{equation}$$

For a given mini-batch containing ${m_B}$ data records, it calculates mean $\mu_B$ and standard deviation $\sigma_B$ and use them to first scale input $\mathbf{x}$ into a normal distribution $\hat{\mathbf{x}}$. However, since this normal distribution may not be the optimal distribution for training, it then transforms $\mathbf{x}$ further into $\mathbf{z}$ through scaling and shifting with $\gamma$ and $\beta$, respectively. $\gamma$ and $\beta$ are treated as model parameters, which will be optimized as part of $\theta$. At test time, we do not have a mini-batch, but only one test sample. In that case, $\mu$ and $\sigma$ come from estimations made during training time, reflecting the average mean and standard deviation of the training data set.

Optimization

Traditional machine learning methods use vanilla gradient descent (GD) to minimize the loss function. GD/SGD rolls a ball down hill, the speed of the ball is proportional to the slope, so the ball rolls slower and slower as it approaches the bottom. GD works well for simple functions, but not for $J$ in deep learning.

DNN contains tens of thousands of parameters. For a three-layer DNN, say the input feature vector contains 10 elements and we have 100 neurons each for the first and the second hidden layer, then an output layer of one neuron. The first hidden layer contains $10 \times 100 + 100 = 1100$ parameters, where 1000 are weight parameters associated with the lines connecting 10 inputs and 100 neurons; and 100 are bias parameters, one per neuron. The second hidden layer contains $100 \times 100 + 100 = 10100$ parameters; the last output layer contains $101$ parameters. So there are 11301 parameters in total. DNN system can easily contain millions of parameters. Some new challenges occur when our parameter space is so vast.

First, we would need to calculate gradients based on exact gradient formula, instead of estimating gradients numerically. This is because, to estimate gradients numerically, we would need to change one parameter $\theta_i$ slightly with all the remaining 11300 parameters fixed, carry out a forward propagation with the training batch in order to calculate a difference for $\frac{\partial J}{\partial \theta_i}$. Then we repeat this training process for all 11301 parameters, it will be too expensive. We need to have all the 11301 partial derivative formula somehow known, so we can plug values in to get the 11301 gradient values. We skip the details (you should read Appendix D: Autodiff in [12] to appreciate how it works), but this problem implies all deep learning packages have to represent the DNN in a graph, so that it knows how all parameters contributes to the loss function and is able to compute derivatives accurately. Programming with deep learning modules therefore is drastically different from traditional machine learning programming, as the statement are not executed sequentially, but rather graphically. (BTW, how do we compute the derivative of ReLU function, as it is not differentiable at $x = 0$. Do not let this bother you, we typically use the average of the derivatives from both directions, so its derivative at zero can be set to 0.5.) If one programs with lower-level libraries, such as TensorFlow, the logic can be scattered everywhere and the code is hard to read. However, higher-level libraries, such as Keras, wraps lots of those logic so that the code appears a lot like traditional procedure programming workflow. Therefore, I would recommend coding with Keras.

Second, there are millions of global minima. All $m$ neurons within a layer basically are equivalent in the architecture, so if we find a global minimum solution, we could permute the order of neurons $m!$ ways and still get the same solution. Consider $n$ layers, we will have $(m!)^n$ equivalent solutions. This does not mean we could easily find one of the global minimum, in fact, we very likely end up in a local minimum as our final solution, as there are even more local minima and finding global minima among many local minima is a very hard problem. But do not let this bother you too much. Say each of our brain is a giant neural network, our weight parameters are drastically different from each others' and weights are changing every second as new connections are formed, old ones are destroyed, existing ones are strengthen or weaken. All these different DNN solutions have no trouble in guiding us to walk, to drive, to speak, to listen effectively. The me today is still largely the me yesterday. None of us has the best DNN weights for walking, yes, some might walk slightly better than others, but we are nearly all perfect walkers for all practical needs. Even if we cannot be Einstein, it does not mean we cannot be a good physicist. Our goal for DNN application is to create a tool that performs well, the difference between well and the best may not be that big of a deal.

Third, it is actually hard to find even a local minimum. A local minimum requires $\frac{\partial J}{\partial \theta} = 0$ for all parameter $\theta$. However, when the partial derivative is zero, the function along that parameter dimension could be at maximum, minimum or a saddle point. With ten of thousands of parameters, when we find an optima point initially, pretty much there is almost always the case that $J$ is at a saddle point along certain dimensions, i.e., we deal with saddle points, not minimum points most of the time. The biggest challenge for DNN optimization is to overcome saddle directions. The ball will move slower and slower when it approaches a saddle point under the vanilla GD strategy, therefore, it does not work well. We need the ball to behave like a ball in the physical world, carrying some momentum and roll over a saddle point in order to continue a down hill journey. This can certainly also help the ball to climb over a small bump and potentially land in a much better local minimum. This observation has led to momentum optimization and other strategies. We here only explain a popular one called Adam optimization.

Adam optimization is based on two ideas. The first is the speed of the ball should not be totally determined by the current slope, it should largely inherit what it carries previously and modifies it (momentum optimization we just described above). The second is vanilla GD uses a constant learning rate $\eta$ for all parameters, which should ideally be parameter-specific. The learning should happen faster in the low-gradient direction as they are likely saddle points and slower in the high-gradient direction (this is typically explained with the example of an elongated bowl problem, where large gradient implies noise.[13]), therefore, we want to adjust $\eta$ to reduce the impact of large gradients (this idea was originated from AdaGrad optimization, later improved by RMSProp optimization). Adam algorithm, combining research results in both areas, is outlined by the following three update rules:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& \leftarrow \beta_1 \mathbf{m} + (1-\beta_1) \nabla_\theta J(\theta), \\
\mathbf{s}& \leftarrow\beta_2 \mathbf{s} + (1-\beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta), \\ \theta & \leftarrow \theta - \eta \mathbf{m} \oslash \sqrt{ \mathbf{s} }. \end{aligned} \end{equation}$$

Here, $\beta_1$ and $\beta_2$ are decay rates controlling how heavy information the past steps affect the current movement, $\eta$ is a learning rate parameter. Smaller $\beta$ means less influence from history. $\mathbf{m}$ is a history-averaged gradient, where gradients with consistent directions build up and gradients frequently changing directions cancel out (the momentum optimization idea). This part is easy to appreciate. $\mathbf{s}$ is the squared-amplitude of the history-averaged gradient, ignoring the directions in gradients. The magnitude of the resultant smoothed-out gradient is then used to module $\eta$ for each individual parameter direction; it encourages a large-step change, if $\mathbf{s}$ is small. The $\otimes$ and $\oslash$ operator just mean they are element-wise calculation, not matrix/tensor operations.

I actually have not understood the part where $\mathbf{s}$ can be used to modulate $\mathbf{m}$. For the special example where $\nabla_\theta J(\theta)$ is a constant; $\mathbf{m}$ and $\mathbf{s}$ will converge to a stable value:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& = {\nabla_\theta J(\theta)} \\
\mathbf{s}&= \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
\theta & \leftarrow \theta - \eta \end{aligned} \end{equation}$$

So the step will not contain any gradient information, this does not make sense. Probably our special case here does not happen often, therefore, should be ignored. An interesting explanation is based on the frequency of a parameter receiving gradient feedback [14]. If many training data are insensitive to a particular parameter, i.e., its gradient is about zero most of the time, but once in awhile, it receives some feedback from some interesting training samples, the history-averaged $\mathbf{s}$ is small, so we miss the golden opportunity to adjust the parameter. Then we should increase the learning rate in order to seize such a rare-occurring training opportunity. Here, we can argue moment averaging also hurts the infrequent parameters, but maybe that part can be alleviated by adjusting $\beta_1$. Adam performs larger updates for infrequent parameters and smaller updates for frequent parameters. Certainly this is a double-edge sword, as it magnify noise as well. In practice, Adam is currently the most popular optimization algorithms, it is considered to be capable of automatically adjusting learning rate instead of relying on some unknown learning rate scheduling protocols. I guess we might just have to accept that Adam optimization somehow works in practice, even if we cannot explain its magic power at this point. This also shows the current status of the deep learning field, which is still far from maturing into a science.

Dropout

We cannot run machine learning optimization without regularization, otherwise, we inevitably run into overfitting. The more complex a DNN is, the more likely it will demonstrate unrealistically good performance under the limited training data set. Our previous weapon for regularization is to include extra terms such as Lasso or ridge penalty terms into the loss function in order to prevent network parameters $\theta$ from taking huge values. However, a new strategy called dropout has been discovered for DNN and has proven to be extremely effective against overfitting.

When dropout is applied to a layer with an example rate of 0.4, the output from each neuron on the layer has 40% chance to be zeroed out. So at each training step, on average, only 60% of the layer output is piped into the next layer. Since dropout is random during the learning process, neurons receive gradient updates 60% of the time. The benefits of dropout might be explained in two ways.

First, since each parent neuron could be dropped, that means no single neuron can be critical to the network. This encourage some feature redundancy. For example, we can recognize a human face either based on eyes or a nose alone or in combine. If the network already contains an eye feature, there is no motivation for it to discovery the nose feature, as the eye feature already does a pretty good job. With dropout, the eye feature might be removed, therefore, the discovery of an additional nose feature becomes advantageous. Therefore, dropout encourages more independent features, so that a subset of them can be dropped without affecting the accuracy of face recognition. If the network has already discovered both nose and eye features, dropout can also help tune the activation parameter, so that a face neuron can be correctly fired by nose or eye alone, without requiring both nose and eyes to be present in future test set. The final network, after tortured by randomly losing an eye neuron or a nose neuron, is more robust for dealing with photos where part of the face is covered or is noise contaminated (Figure 8).

Figure 8. Dropout encourage a network containing an eye neuron to discovery another nose neuron. As the result, the presence of both neurons enables the network to recognize faces with only eyes or nose.

Second, dropout might effectively create an ensemble machine learning solution. The random combination of neurons might be equivalent to the linear combination of many trimmed-down DNN models, each contains a random sampling of 60% subset of the original DNN (Figure 9). As ensemble model has a theoretical advantage compared to a single model, where they reduce variance in prediction, i.e, the dropout-applied model are more generalizable. Bear in mind that deep learning at this stage more like a technology than a science, so many statements presented here are people's or my own interpretations to help us accept why certain techniques seem to work better than others. They might not be exactly correct in the future.

Figure 9. Dropout effectively makes the neural network to be an ensemble of many smaller networks. If dropout rate is 50%, an M-neuron network could contain $2^M$ possible dropout network architectures. The ensemble classifier typically improves performance by reducing variance.

Although dropout and regularization can be simultaneously applied, it seems dropout alone is often sufficient. One can tell whether further regularization is needed by checking whether training error and validation error are close. Dropout does not need to be applied to every layer, or dropout rate should not be too high, as that might be too conservative and lead to underfitting. Another point to be aware of is dropout should be disabled at test (prediction) time, as we want all nodes to contribute, otherwise, predictions would not be deterministic. In our example, the expected input when all neurons contribute in test time is about $\frac{1}{0.6} = 1.67$ fold larger, so at test time, outputs from each neuron should be scaled down by the factor 0.6. Here we use 40% dropout rate as an example, it certainly is a hyperparameters subjected to optimization in applications.

Hyperparameter Optimization

There are still many unanswered questions for the DNN architecture, including but not limited to how many layers to use, how many neurons for each layer, what non-linear function to apply, what parameter initialization strategies should be adopted, what optimization strategy is the most effective one, etc. All these are called hyperparameters. We learned previously to answer these questions, one would explore all combinations in the hyperparameter space, and then find the one yields the best performance based on the training-testing cross validation strategy as described here. In practice, it is not realistic to explore this huge hyperparameter space, due to the computational burden, therefore many heuristic exploring strategies have been seen in applications. We often need to rely on gut feelings and assume certain parameters should be fixed, so that we can reduce the number and the range of the hyperparameters to tune. Typically it is considered more advantage to adopt a random sampling approach than to use a grid search strategy in exploring the hyperparameter space [15] (Figure 10). This is because, for each individual hyperparameter dimension, each random sampling is likely to pick a different value, compared to the large redundancies embedded in the grid search strategy. Conceptually, however, a seemingly beautiful approach is to rely on Gaussian Process (GP). GP will make smooth Bayesian interpretations of the loss function formed by the hyperparameters, based on the existing sample runs available. It then recommends a new sample point in the hyperparameter space, which is expected to produce the best gain in performance. GP might be a topic for a separate blog.

Figure 10. Not all hyperparameters are equally important. In this example, the $X$ parameter is important and $Y$ is unimportant. On the left grid search strategy, 1/3 of the searches are wasted, as exploring different $Y$ with $X$ fixed is a waste of time. On the right, a random search strategy allows us to explore more $X$ data points, therefore, leading to an $X$ value much closer to its optimal..

A Black-Box DNN Solution

Our introduction here for DNN is brief, there are many training materials now available that cover fully-connected DNN to great details. Our focus of this blog is to explain the theoretical advantage of DNN and the key components for its implementation. We can now mentally put these pieces together to form a general-purpose black-box DNN solution, which will serve as a generic machine for any typical regression/classification problem.

The thought exercise is to design a fully-connected DNN (all the NNs we have described so far are fully connected, i.e., all the neurons in the previous layer are connected to all neurons in the current layer, in contrast to convolutional NN to be discussed in the future) consisting of $n$ layers and each layer contains the same $m$ number of neurons. The last layer is either a linear layer without non-linear transformation (if the problem domain is regression) or a softmax layer (if the problem domain is classification). The corresponding loss function is the cross-entropy function for classification problems and least-square error function for regression problems. For each given set of hyperparameters (i.e., a certain DNN structure), we split input training samples into small batches, send them to the DNN batch by batch. For each batch, we calculate loss function and its gradient against each network parameter $\theta$, update parameters based on gradient information to hopefully minimize the loss function. A complete iteration over all training samples is called an epoch. At the end of each epoch, we evaluate validation errors on the held-out data set. After multiple epochs, the process is stopped when test error starts to arise. As resource permits, DNN for other hyperparameters (could be suggested by GP process) are trained as well. Then the DNN with the best expected performance is chosen. A very brief outline of the code is shown here.

Compared to linear regression or logistic regression methods, this DNN solution has the conceptual advantage of first transforming input features nonlinearly into a set of new features, i.e., it can learn new features automatically. In this regard, the traditional ML methods are just special cases of DNN, where they use the input features with a unity transformation. Therefore, we generally expect the performance of DNN to be better or at least as good as traditional machine learning methods, as long as the DNN is appropriately constructed (i.e., can capture the true transformation function $\phi$). This is confirmed by many published deep learning applications.

Let us look at a bioinformatics application next, where DNN is used to regress gene expression levels (next blog).

Reference

Chollet. Deep Learning with Python, page 4.
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50(5): 742-754.
https://en.wikipedia.org/wiki/Position_weight_matrix
http://neuralnetworksanddeeplearning.com/chap4.html
Chollet. Deep Learning with Python, page 8.
https://en.wikipedia.org/wiki/Go_(game)
https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
https://arxiv.org/pdf/1512.03385v1.pdf
https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Appendix D, page511-518.
Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Page298-299.
Ruder, https://arxiv.org/pdf/1609.04747
http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf

Sciency Stuff

Friday, February 2, 2018

Introduction to Deep Learning