Wednesday, February 28, 2018

Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning

Bioinformatics Application of Deep Learning Technology


Resource


We will discuss the following paper: Alipanahi et al., Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.  The paper is available here, the supplementary method is available here (subscription required).

All figures shown here should be credited to the paper, unless the source is explicitly provided.

Introduction


Genes are encoded by DNA sequences.  Whether a gene is expressed and how much it is expressed can be regulated by proteins called transcription factors (TF).  A TF protein can bind to certain DNA motifs, then recruit other factors to initiate the transcription of a piece of DNA sequence into an mRNA sequence.  In eukaryotic cells, mRNA consists of multiple exons separated by introns.  Introns are spliced out and exons are concatenated into a matured mRNA, which will then be translated into a protein.  RNA-binding proteins can regulate the splicing process and lead to the inclusion or exclusion of certain exons, therefore, producing different splicing isoforms, eventually different protein products of different functions.  Mutations in the genomic sequence can perturb the binding of proteins, then lead to undesirable under/over transcription of genes, or undesirable splicing forms, which can induce disease.  Therefore, it is of interest to understand what nucleotide pattern (motifs) a DNA/RNA-binding protein can recognize and how mutations impact such molecular recognition.

Let us describe two popular experimental approaches for identifying sequences enriched for protein binding sites.  First, protein binding microarrays (PBM).  The microarray contains many different short nucleotide sequences called probes at different array locations.  For example, an array can contains probes representing many variants of 10-nucleotide patterns (10-mers).  Even for a given 10-mer, it can include many probes where the same 10-mer is embedded within different sequence context of the longer probe sequence.  A TF protein is hybridized over the array and it will be stuck to the places where binding occurs and produce fluorescent readouts.  Therefore, PBM gives readout of probe sequences and their intensity readout indicates the strength of the binding to a specific TF.  A deep learning (DL) system aims to take these probe sequences as input $X$ and predict the binding score $y$ (i.e., intensity).  The natural choice of the loss function would be least-square error, as this is a regression task.  Second, ChIP-seq technology.  When a TF is cross-linked to genomics materials, DNA were chopped and all DNA fragments bound to the TF will be extracted through chromatin immunoprecipitation (ChIP) process, bound DNA/RNA fragments are NGS-sequenced to produce readout (NGS stands for next-generation sequencing).  ChIP-seq data only provides positive sequence fragments, one needs to added negative sequences (either taken from genome background, or the authors here generated negative sequence by permuting positive sequences), a DL system aims to classify an input sequence $X$ into binary $y$ of 1/0, i.e., bind or not bind.  Since this is a classification problem, cross-entropy loss function is a natural choice.  The paper also studied datasets generated using a HT-SELEX technology, where the data format is similar to that of ChIP-seq.

The motif discovery problem is a well-studied bioinformatics problem for the past twenty years (I also contributed an algorithm called GEMS).  The positive sequences are collected and algorithms look for patterns that are over-represented among them, compared to a pool of negative background sequences.  Alipanahi et al. used a DL approach, more specifically, a CNN + FNN architecture that archived better performance than existing tools.  It is a very clean application of CNN to bioinformatics.


Results


Architecture

In the diagram below, the system processes five input sequences in parallel.  This is rather a technical detail in order to make full use of the GPU resource.  For our purpose, we only need to focus on one sequence.  The sequence is first cast into two sequences, the original one and its reverse complement (e.g., ATGG is cast into CCAT as well).  This is because a TF could theoretically recognize either one of the DNA strand, or both in rare cases (Figure 4.e in the paper is such a rare case).  A motif is a sequence pattern represented by a 1-D CNN convolution filter.  More precisely, the input sequence is actually a 4-channel 1-D vector, where 4 channels encode the base composition A, T, C, G and the vector is a 1/0 binary vector one-hot encoding the input sequence.  For example, a short sequence "ATGG" can be represented as:


$$S_{ATGG} = \begin{bmatrix} A & C & G & T\\
. 25  & . 25 & . 25 & . 25\\
. 25 & . 25 & . 25 & . 25 \\
1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
. 25 & . 25 & . 25 & . 25 \\
. 25 & . 25 & . 25 & . 25 \end{bmatrix}. $$

Those rows containing 0.25 are paddings (represent "any" nucleotide) to extend the original sequence, so that convolution will not miss motifs partially overlapping with the ends.  If the CNN is designed to capture motifs of length $m$ (it will capture motifs length $\lt m$ as well), there will be $m-1$ extra 0.25- rows on both ends.  A motif is represented as a convolution filter $M$ that is known as position-weighted matrix (PWM) in bioinformatics, it basically specifies the weighting of each nucleotide at each location.  The output of the convolution layer is:

$$\begin{equation} \begin{aligned}
 X_{i,k} &= \sum_{j=1}^{m} \sum_{l=1}^{4} S_{i+j,l}M_{k,j,l}, \\
Y_{i,k} &= \max(0, X_{i,k}-b_k),
\end{aligned} \end{equation}$$

where $k$ is the index of a particular motif (a motif here is called a feature in CNN), $i$ is a location on the input sequence, $m$ is the length of the motif or the convolution window, the convolution therefore is taken over all positions $j$ and over all four nucleotide channels $l$.  The sum is then passed through a ReLU function as shown in the second step (weights can be negative, so neurons with convolution sum less than $b_k$ will not be activated).  The result is a feature map representing where a given motif is detected along the input.  A pooling layer is followed.  The authors used a max-pooling layer for DNA-binding proteins and both max- and average-pooling for RNA-binding proteins.  In the latter case, $k$ features results in $2k$ pooling maps.  The pooled output will then serve as the input for a fully-connected neural network (FNN) to generate target scores, one per input sequence.

This architecture is very simple, as it only contains one CNN convolution layer and a pooling layer.  The usage of CNN here is because the motif can be located at arbitrary positions in the input sequence, just like a dog can appear in any location of an image, CNN, beings translation-invariant, is a natural option for this problem.




Performance



The CNN is then trained with a set of experimental data (either PBM, ChIP-seq, or HT-SELEX) and its performance was shown to be consistently superior than established bioinformatics motif discovery tools.  Without getting into the details of biology applications, the figure below shows their CNN solution, named DeepBind,was able to out-perform other methods on PBM, ChiP-seq, SELEX datasets according to various performance metrics (AUC or Pearson correlation scores).



Why CNN is Better?


Unfortunately, the paper did not discuss why DeepBind performed better.  Although it shows many examples of how the motifs identified makes biological sense, but that is not something unique, as traditional motif discovery tools can produce those results as well (the author did not say traditional tools failed in doing that).  It was only mentioned in the beginning of the paper:

"The sequence specificities of a protein are most commonly characterized using position weight matrices (PWMs), which are easy to interpret and can be scanned over a genomic sequence to detect potential binding sites. However, growing evidence indicates that sequence specificities can be more accurately captured by more complex techniques".

This hints the motivation of their study is to describe motifs with a language more sophisticated than PWM.  However, motifs in CNN are described by filters, which is basically the same as PWM, what is the strength of this approach then?

I think what happened probably is DeepBind contains $k$ motif sub-elements.  In previous methods, only one PWM is used to explain biological measurements (for that specific transcription factor or RNA-binding protein).  However, in DeepBind, multiple elements are combined through a FNN for activity prediction.  Although each motif element is equivalent to a PWM, their combination could non-linear motif descriptor cannot be captured by PWM.  Hypothetically, imagine there are two motif elements in a real system: AGGGT and TGGGA.  That is "GGG" in the middle, when "A" appears on the left, "T" is favored on the right, similarity a left "T" pairs with a right "A".  However, AGGGA and TGGGT are not valid patterns.  However, if we only allow this to be represented using one PWM, the result is an "averaged" motif looks like "A/T-GGG-A/T", which would encode wron patterns such as AGGGA and TGGGT.  We therefore hypothesize the CNN system provides a way to capture motifs in a non-linear way, more powerful than PWM.


Another possible reason is NN allows a motif consisting of wide-spacing sub elements, which are very hard to be discovered by traditional algorithms due to the vast search space.  But wide-spacing motifs are important particularly for RNA-binding proteins, as they bind to RNA structures in non-trivial manner.  As shown below, unlike DNA-binding, RNA-binding proteins may not recognize only one binding site, but may actually recognize multiple binding sites on the surface of RNA.  Unlike double-strand DNA fragments (the last structure in the figure below), single-strand RNA can form complex secondary structures, oftentimes, the RNA structure contributes significantly to the binding no less than the specific oligonucleotide bases.  The authors commented: "RNA recognition motif, the most abundant RNA binding domain, is highly flexible, and usually the assembly of multiple domains are needed for proper binding. Third, “indirect readout” in which the structure of nucleotides guide the binding process, is more prominent in RBPs; Gupta and Gribskov analyzed 211 protein-RNA structural complexes and reported that nearly three-quarters of protein-RNA interactions involve the RNA backbone, not just the bases."  A RNA-binding protein could recognize multiple binding sites on the surface in contact, therefore, this is better modeled by multiple motif features in DeepBind, where they are combined through FNN to predict the binding affinity.  Traditional tools identify motifs as one continuous binding element, without synergistic combinations of underlying elements.  I suspect this might be the main reason DeepBind performs well, at least for the more difficult RNA-binding proteins.
Protein-binding interface on RNA (DNA, the last example) (source).  Purple: binding sites, yellow: non-binding sites.

We here only see one convolution layer, instead of multiple repeats of convolution plus pooling layers.  Hierarchical motif patterns may not be necessary for motif discovery problems, because although shorted motif elements, say "ATC" might appear in multiple longer motifs, the frequency of A, T, and C in the longer motif might be very different compared to their frequencies in the short elements.  So unlikely eyes, noses and mouths made up of faces, short motif PWMs may not be very effective in concatenated to form larger PWMs.


How to Visualize CNN?


A PWM generated by traditional bioinformatics tools can be easily visualized as a motif logo, where the height of a character encodes the frequency or information content of a base in its PWM.  CNN here contains multiple motifs and combines them in a non-linear way, therefore, it is conceptually incorrect to visualize each convolution filter alone.  How to visualize DeepBind is a rather challenging topics, the authors did not address this in my view.  Instead they adopted a much naive approach -- basically visualizing individual feature.  For each sequence in the test set, if the sequence contains a positive output signal for a given motif feature map, the corresponding input sequence fragment (gives the maximum activation) is retained.  Then all such CNN activating fragments are aligned and base counts are compiled to construct a PWM and subsequently generating a motif logo.  Some example logos are shown in the figure below, where it shows DeepBind model indeed reproduced many known motifs.  As we mentioned the downside of this approach is it really does not visualize the non-linearity introduced in the FNN component as we hypothesized, why DeepBind performs better is really not answered in this simplified representation.

Examples of known motifs discovered by DeepBind.

Mutation Map


An alternative visualization that is conceptually more powerful is called mutation map.  Given an input sequence $s$, DeepBind predicts a binding score $p(s)$.  We can then mutation the nucleotide at each position into a different bases, i.e., resulting in a new sequence $\hat{s}$ and predict a new binding score $p(\hat{s})$.  The importance of a particular single-base mutation, therefore, can be quantified by:

$$\begin{equation} \Delta S_{ij} = p(\hat{s})-p(s). \end{equation}$$

The nice thing is the mutation score is based on the whole CNN plus FNN system, therefore, include all the non-linear effects.  However, I guess the above equation did not work well as it is.  As the predicted scores do not necessarily represents the change in the unit of binding energy; The scores and binding energy might be related through some non-linear function.  The authors empirically used the following formula instead, where the effect of a mutation is magnified by the binding strength of the mutation position:

$$\begin{equation} \Delta S_{ij} = (p(\hat{s})-p(s)) \cdot \max(0, p(\hat{s}), p(s)). \end{equation}$$

Two examples of mutation maps are shown below.  On the left, it shows the promoter region of LDL-R gene; LDL-R is the LDL receptor gene and LDL is known as the "bad cholesterol".  The wild type genome contains based C in the middle of promoter, which can be recognized by SP1 protein, which then drives the expression of the LDL-R, and that helps recognize excess levels of LDL and regulates its removal from the blood.  Patients with this base mutated to G seriously decrease the binding score (see the red dot at the bottom), as the mutant version is no longer recognized by SP1.  This leads to the deleterious condition known as hypercholesterolemia (high level of cholesterol) due to the low expression of LDL-R.  On the right, a mutation in the globin gene, from A to G, creates a new binding site TGATAA attracting a TF protein called GATA1.   The binding of GATA1 blocked the genome region originally reserved to serve as the promoter of globin.  As the result, globin is no longer expressed at sufficient level, which leads to a disease called $\alpha$ thalassemia (insufficient amount of hemoglobin to carry oxygen in blood).



How DeepBind is Optimized?


The DeepBind system contains many hyperparameters, including motif length $m$, number of motifs $k$, learning rate, momentum, batch size, weight decay, dropout rate, etc.  The training process first randomly sample 30 sets of such hyperparameters (see Figure below), using 3-fold cross validation to evaluate their performance.  The best performing set of hyperparameter is then used to repeat the DeepBind training process using the whole training dataset three times.  This is to avoid statistical fluctuations during the optimization process, the best performing set of DeepBind model is then chosen as the final model for the given protein and used to predict the test dataset and produce performance metrics.  One DeepBind model was trained per protein per experimental data type.




Conclusion


CNN is a DL architecture suitable to discover translation-invariant patterns in inputs.  DeepBind system uses CNN to capture protein-binding motifs on DNA/RNA, where NN architecture enables it to capture some non-linearity effects and/or aggregation effects of multiple underlying motif patterns, as well as aggregate motifs that are widely spaced.  The study showed DeepBind surpassed traditional bioinformatics motif discovery tools and was able to explain a number of disease causing mutations.  Our visual estimation shows the improvement, compared to traditional approach, is at about a few percentage, similar to what we generally see from DL applications in bioinformatics.  However, the architecture of DeepMind is a lot simpler compared to the many sophisticated motif discovery algorithms developed over the passed two decades.  The authors did not offer explanations on why DeepBind performed better.






Friday, February 23, 2018

Convolutional Neural Network (CNN)

Disclaimer: most contents in this blog are my own speculations on how human vision works; I did not research if they are supported by literature.  I am enthusiastic about CNN; these speculations have greatly helped boost my own understanding of CNN, therefore, I hope they could be useful to you as well, even if they might be nonsense. Bear in mind that the purpose of this blog series is try to check the "why" aspect of deep learning.  Nowadays, you for sure can find many "how-to" tutorials on CNN elsewhere.


Introduction


Given a cat photo, we can recognize the cat absolutely effortlessly.  How does it work?

First, we do not think the brain relies on an object model approach.  That is it has a cat model consisting of an assembly of some geometric shapes representing eyes, ears, whiskers, tails, paws, etc.  In fact the object model approach is intractable, as you would need a model for a sitting cat, for a sleeping cat, for a cat with paws hiding, ..., just seem too complicated to be feasible [1].  Anyway, in this approach, the brain will also need to contain models of millions of other objectives, then apply all these models to the input image to determine which model is the closest match.  This modelling approach does not really explain how we see cats in the photos below, as they would not match any reasonable cat model well.

 Figure 1. Cat photos that do not match any cat models well.

Second, computer vision experts have spent decades to construct image features that are robust against translation, rotation and scaling [2], as our brains seem to have no problem in recognizing an object under such transformations.  Do our brains use similar tricks to represent objects with some mathematically-elegant transformations?  It does not seem so, at least, our vision recognition system is not very capable of correcting rotation transformations.  The two faces below are $180^\circ$ rotations of each other.  If our brains transform them into some rotation-invariant representation, we would expect to see the second face by staring at the first, but we do not.

Figure 2.  We cannot see the other face, if we look at one, as our brains do not transform the image into rotation-invariant representation (source).

Third, we do not seem to see details, but only see concepts instead.  Look at the scenery photo below for two seconds, now cover it up.  I assume you can tell it is about a beach, and you might probably saw some other objects such as beach, mountain, ocean, bridge, sky.  Now are you able to tell if there are many people on the beach? Is there a lagoon on the left? Are there wild flowers on the hill slope? are there people in the ocean? are there clouds in the sky?  All those pixels have been sent to our brains as input, yet we cannot answer these questions, as we "looked at" these pixels but we did not "see" them.  This means the output of our vision system probably is a few abstract concepts and lots of details were not really passed to the memory or reasoning regions of the brain.  This is great, as we immediately see beach, without the need of prioritizing roads, people, waves, lagoons, bridges, and then try to follow some complicated decision-making rules to come to the conclusion that this is a beach.  We simply can see the meaning of each image without reasoning.  It only happens in movies, when an agent can name everything in a room after a few seconds; photo memory does not seem to be what our normal people see things, otherwise, the brain will be drained by too many objects and have a hard time capture a key event.  This means in computer vision, details are often not as important as you might have expected.

Figure 3.  Beautiful Torrey Pines beach at Del Mar, San Diego (source).

Computer vision is the field where deep learning really shines, as it provides a feasible approach to see objects the way no previous machine learning technologies was capable of.  Personally, I am convinced this might actually be how our vision system works more or less, as it could help explain many vision phenomena, including the examples mentioned above and some examples to be discussed at the end of the blog.  Here, we will look at how a specialized neural network (NN) called convolutional neural network (CNN) works.


The Rationale for CNN



Vision recognition for ImageNet is basically a function $f(x)$, where $x$ is the input image and output is the object id(s) representing the concepts in the image.  We mentioned in the previous chapter that a fully-connected NN (FNN) can be used to model any function, such as the FNN in Figure 4 (left).  In this example, not all pixels on the input image are relevant, therefore the NN can be simplified into a localized NN (where weights for non-dog pixel inputs are zeroed out, Figure 4, right) with the dog at its center of the input.

Figure 4.  FNN for image object recognition is a localized NN, as only the pixels belonging to the object contribute to the recognition.

An image can contain multiple instances of dogs, to recognize all the instances, we only need to slide the localized NN over the input image, whenever there is a dog, the NN produces a response signal.  This operation of sliding a function over an image to obtain a new response image (Figure 5) is known as convolution in computer vision.  For the ImageNet task, we only care about whether the image contains dogs or not without needing to know how many and where they are.  Therefore, our NN has to be translation invariant, i.e., $f(x - \delta) = f(x)$.  This can be easily implemented by applying the $max$ operator onto the output convoluted image.


Figure 5.  Apply the same localized NN over the input image allows it to recognize multiple instances, where it produces a signal at each matched location.


In traditional computer vision, we construct a template (a dog model) and slide that template over the input image, then the output convoluted image shows bright spots where template matches the input.  However, it is unrealistic to have a dog template (model) that can be used to recognize all dogs, as not only dogs look very different, maybe only a partial of its body is exposed, or it is a highly distorted cartoon dog, etc.  A more robust way is to model a dog as a collection of its body parts that are loosely connected.  If we have a function $\mathbf{g}(x)$ that has multiple dimensions aiming to recognize dog eyes, nose, ears, spot pattern, tails, respectively.   Each dimension is a convolution function mentioned above that produces an output image, we call a feature map.  Each dimension still uses a convolution filter, which is basically a weight matrix describing what pattern this feature is looking for, and the intensity of the output feature map represents the likelihood (not strictly in the probability sense, but just monotonically reflecting the likelihood) of the presence of that particular body parts at a corresponding underlying location.  I.e., we assign weights to each dimension and produce a final output to indicate if the dog is present, i.e., $f(x) = \mathbf{w(\delta)}\cdot \mathbf{g}(x-\delta)$.  $\delta$ here accounts for the relative location shift among the signals from different feature maps.

What is important here is this weighting (i.e., convolution) operation does not need to be applied to the raw image pixels, it can be applied to previous feature maps, where pixels encoding the presence of smaller body parts.  To understand this, say we aim to construct a feature representing a dog face, which consists of preliminary parts such as eyes, nose, ears and are arranged in some lose geometric positions.  This means the feature maps of eyes, nose and ears have already been generated and they will become the input images for the detection of the dog face feature; each input feature map (nose or eye) is a separate input image channel (imagine the raw image has three color channels); they are convoluted to produce a new dog face feature map.  The convolution relies on a weight matrix, which can implement loose logic, such as the eyes are within a fuzzy region and the nose can be in another fuzzy location.  Such filters sitting on top of lower-level feature maps can therefore encode a very flexible face model, unlike the traditional template-based match (Figure 6).

Figure 6.  One convolution weight matrix (left) can encode very different faces.  For the weight matrix, red and orange indicates the possible locations of eyes, blue for nose and green for mouth.  This matrix can match the three very different faces on the right equally well.  This is not something the traditional template-based method is capable of.


  Similarly, the feature maps of dog face, legs, bodies will need to be further combined to produce an even higher-level feature map say represents a sitting dog.  In the other direction of the feature hierarchy, the feature map of dog eyes can be decomposed into more simpler imaging feature maps of circles, different color patches, lines, curves, corners, etc.  Therefore, at the end we are looking at a multi-layer hierarchical convolutional neural network, where each layer relies on a small input image section (either the raw input image or the output feature maps of a previous layer), it performs convolution and produce the output feature map encoding whether the feature it is looking for is present at specific locations.  At the top layer, we form a collection of feature maps, where each tell us whether certain high-level body parts are observed with high probability, the weighted collection of this information allows us to make a decision on whether the object of interest is present.  This way, a dog can be detected even if a significant portion of its body parts are missing, or the relative geometry of its parts are highly distorted.

Another practical requirements for object recognition is to be able to scale down an image.  Despite there are many differences in details for the faces in Figure 7, those details are irrelevant for our purpose of recognizing them all as face.  So if we scale down the image into low-resolution representations, where the irrelevant details are smoothed out, it will be much easier for us to match the smaller image with some convolution filter encoding a face.  The process of taking a higher-resolution image and scaling it down into a lower-resolution one is called pixel pooling, i.e., pool multiple pixels into one pixel.  Pooling offers a few advantages.  First, pooling get rid of details, so that a face filter (weights) only requires two eyes on top followed by a nose below and a mouth below, but it does not mandate the exact looks of the eyes, nose or mouth, there are lots of flexibility in such filter elements.  Second, pooling brings the underlying part objects closer in space, so that the filter to convolute these objects into higher-level object can be a lot smaller.  Third, it is computationally much more efficient to compute and store without compromising our goal of object recognition, as fewer pixels are involved.  The most-often used pooling mode in CNN is max pooling, where the maximum value of the several original pixels is retained as the pooling output.

Figure 7. Scaling down a complex image helps remove irrelevant details for a recognition task.

In the traditional template matching strategy in image analysis, the recognition of individual parts and the relative geometric location of these parts are simultaneously encoded within the template at the pixel level.  This seriously restricted the variation of the objects a template can represent.  In the CNN approach, each hierarchical feature allows room for variation within individual features (e.g., there can be multiple eye features of different styles) as well as the spatial variations among them, and the pooling further provides even more rooms for the "disfiguring", therefore, CNN no longer recognizes objects at the pixel level, but start to incorporate tolerance and somewhat represent objects at the conceptual level.  I think this might be the main reason behind the superb performance of CNN in real-life object recognition.

CNN for Face Recognition

With convolution and pooling layers as building elements, we are ready to perform what used to be the most difficult computer vision challenge, classify photos according to the object(s) it contains (the ImageNet challenge).  Let us look at how CNN can be applied to recognize a face.
Figure 8. Architecture of a CNN that recognize a $16 \times 16$ image containing a face.

As shown in Figure 8, our input (A) is a black-and-white face of size $16 \times 16$.  The first hidden layer (B) actually is a group of layers of the same size $16 \times 16$ each, and each is the result of $2 \times 2$ convolution across the input image (this means each layer is the heatmap resulted from convoluting a particular neuron across the original image).  The weight matrix $w$ for one of the feature (blue) reads (the matrix is shown at the lower-left corner of the feature maps):

$$w = \begin{bmatrix} 0 & 0 \\ 1 & 1 \end{bmatrix},$$

where 1 stands for black and 0 for white.  It basically detects an edge (black object on white background) facing top.  The resultant heatmap is depicted below the layers associated with group B in Figure 8.  The convolution sum is passed through a non-linear function, typically ReLU function.  ReLU has no effect in our specific example, as we only consider positive weights, but weights can be negative in real applications and there is a threshold parameter $b$ to suppress signals that are too weak (remember ReLU = $\max(0, signal)$.  Similarly we can build three other features to detect the bottom-facing (orange), right-facing (green) and left-facing (purple) edges, the resultant heatmaps are shown in Figure 8 as well.  There will be other feature maps in this group capturing other shapes, however, let us focus on these four feature maps in this particular example.

The image is big, so let us down sample it.  The next step C is a max-pooling step.  The example here is to shrink a $2 \times 2$ area into one pixel, where the value of the output pixel is the maximum of the four original pixels.  This will produce four new feature layers in the second hidden layer group C, each feature map is now only one forth of the size of those in the previous layer B, accounting for a 75% saving on computational resource.  There is no cross-talking among the feature maps in this pooling step, all what we do is to cut the resolution in half feature by feature, which also brings the response signal 50% closer to each other.

Then it uses another convolution layer D.  The difference here is the convolution is a 3D convolution, i.e., the weights not only covers the width and height of a sub-region in the previous layer C, but also the depth, i.e., it mixes signals from all the feature maps in the previous layer under that sub 2D region.  A convolution filter corresponding to the $2 \times 2 \times 4$ weight pattern as shown in Figure 8 produces an new output feature map in the D group representing an eye feature.  Where it looks for an activated pattern consisting of 2-pixel top, a 2-pixel bottom, a 2-pixel left and a 2-pixel right -facing patterns in the previous layer C.  This filter means the neuron is detecting a black square, it produces a strong activation signals on its output heatmap, where an eye (represented by a black square) is present.  Notice the color for the right eye is a bit lighter, as it is not a perfect match (but still a good match) to the eye filter (therefore the activation is less strong).

Figure 9. Weights (filter) associated with the eye feature.

Similarly, layer D also contains a feature map representing where noses are and another representing where mouths are.  We then do another max-pooling, which also pulls the detected eyes, nose, and the mouth closer to each other, actually now they all fall within a $3 \times 3$ neighborhood.  Layer F contains a new features representing a face, whose $3 \times 3 \times 3$ filter consists of the convolution of signals from eye, nose and mouth feature maps.  It generates an activated pixel in the output layer G, where we know a face is detected.  Here we see an important role of pooling is to bring lower-level features closer together, so that a higher-level feature can be constructed with a much smaller convolution windows.  If our CNN does not contain any pooling layers, a face filter would require a convolution matrix of size $16 \times 16$, which would use too many parameters and contains unnecessary details.

The face filter is a 3D weight matrix, which can be written as:

$$w_{face} = \begin{bmatrix} w_e \\ w_n \\ w_m \end{bmatrix}, \\
w_e = \begin{bmatrix} 1 & 0 & 1  \\ 0 & 0 & 0 \\ 0&0&0 \end{bmatrix}, \\
w_n = \begin{bmatrix} 0 & 0 & 0  \\ 0 & 1 & 0 \\ 0&0&0 \end{bmatrix}, \\
w_m = \begin{bmatrix} 0 & 0 & 0  \\ 0 & 0 & 0 \\ 0&1&0 \end{bmatrix}.
$$

So now we use convolution and max-pooling to build a CNN that is capable of producing an activated output, when we show it a face.  If the input image is twice larger and contains a tiling of four faces, our output layer will contains four activated pixels on the face feature map, each signaling a face.

Now imagine there are many other feature maps within each hidden layer, capturing all kinds of interesting features, from simple features such as edges, diagonal lines, to medium features such as rectangles, circles, to more sophisticated features such as wheels, front of a car, then to rather complex features representing face, car, bike, dogs, etc.  We can see how single features at the earlier hidden layers can be combined (depth-convoluted) to produce median complex features and then further combined to create more and more complex features.  The last output layer of CNN can contains thousands of high-level features and their activation status.  Those features then serve as the input for a FNN, which typically ends with a softmax layer to produce probabilities for each object class.  The top scoring classes represents the objects we see in the image (we use softmax, as this is a multi-class classification problem, see previous blog).

In practice, we certainly do not hand construct the feature maps ourselves.  The parameters of the CNN filters would be initialized with random values, we then simply feed the CNN with lots of input images and they true labels.  The loss function will be a cross-entropy term to quantify the prediction accuracy.  The result of training and parameter optimization will enable the CNN (including the FNN layers) to determine the optimal filters, i.e., features, for each layer.  As the CNN are trained by many faces: long, round, man, woman, etc., there will be many neurons representing different faces, therefore, we do not relying on one face feature to recognize all faces.

Although the features found by CNN are rarely as clean as our face-detection example, they generally resembles such concepts.  An extremely nice videos is available here, it should convince you some CNN neurons (representing a feature) indeed represent rather meaningful objects, e.g., an edge, a face, or text.


VGGNet for ImageNet


The huge success of CNN is reflected it its ability to classify images provided by ImageNet into 1000 classes, this is the well-known ImageNet competition. The current state-of-the-art CNN classification already surpassed human performance of 5.1% error rate [3], therefore, this task is considered a solved problem.  VGGNet (Figure 10) is the champion of 2014 ImageNet competition [4], we discuss it because it has the nice architecture very similar to our face-recognition example, i.e., very easy to understand.




Figure 10. VGG16 architecture. (source)

It takes the input image of size $224 \times 224 \times 3$, do two $3 \times 3$ convolutions followed by a max pooling. Then repeat this a few times.  It gradually doubles the number of features, from 64 to 128, then to 256 and finally to 512.  Let us walk through the statistics in Figure 10, the input image has a size of 224 pixels, the size is preserved after convolution, then halved by POOL2 into 112, continued the cycle of convolution and pooling until it gets $7 \times 7 \times 512$ at the last pool layer, that is we identify 512 features, each corresponds to a likelihood heatmap of size $7 \times 7$.  All these neurons become the input features for a 3-layer FNN to finally generate 1000 probability values for the 1000 object classes.   Why would one uses two $3 \times 3$ convolution layers sequentially instead of just one convolution layer?  Two of such $3 \times 3$ filters basically play the role of a single $5 \times 5$ filter, however, the former has only 20 parameters while the latter has 26 parameters, plus two layers can model more non-linearity than a single layer.  The saving is more significant if we replace a $7 \times 7$ filter with three $3 \times 3$ filters.  So CNN tends to use multiple small filters than a large filter.  Strictly speaking, we are actually doing 3D convolution, as the first filter is $3 \times 3 \times 3$, given a three-channel color input image.  The second filter is $3 \times 3 \times 64$ sitting after the first hidden layer.

In addition, we see there are a total of 138 million parameters in VGGNet!  Among them, 102 million are used for the first fully connecting layer.  So CNN layers are very efficient in terms of the number of parameters it consumes, as it uses small filters.  On the other hand, the fully connected layers are expensive for optimization.  Microsoft's ResNet [5] does a better job in extracting CNN features, and it only requires a single fully connected layer at the end for the classification, therefore, it is way more efficient.  ResNet was the champion of 2015 competition, with an error rate of 3.57%!

Machines such as VGGNet took a long time to train.  In our own projects, we often cannot offer such long training time and the CPU/GPU resource, instead we take advantage of a technique called transfer learning.  E.g., if we are tasked to classify a subset of ImageNet photos into either dog or cat, instead of the original 1000 classes.  VGGNet was trained to classify 1000 classes, therefore, is not optimized for our simpler problem.  Why?  Say bone is one of the 1000 classes, VGGNet cannot use bone features to help dog recognition, as it was also tasked to distinguish bone from dog.  Now with only two classes, each can take advantage of features representing other classes to help boost its confidence.  But to train the VGGNet from beginning would be too costly.  We know almost all the key features required to distinguish a dog from a cat are already captured in the CNN portion of the VGGNet, therefore, we can freeze that part. Instead, we only replace the last three fully connected layers by one fully connect layer producing a single output, representing the probability of the dog class.  The new CNN can be optimized much quicker.


Miscellaneous Topics


Here we discuss some non-technical topics of interest and see how CNN can help us better understand some recognition activities.

Objects of Different Scales


The CNN described above can deal with object translocation, but it does not deal with scaling.  If our CNN was only trained with faces of the same scale, then we feed it with faces of other scales during its application, it is likely to fail.  Because if the eyes are way bigger than what it saw, CNN is still looking at a pupil at its last layer, there will be no activation for the eye feature map in the earlier layer.  For the same argument, the face neuron will not be activated as well.  When the input face is too small, it already converges into a tiny $2 \times 2$ eye, nose, mouth-like features at a layer earlier than where the these features are expected, therefore it will be missed as well.  To overcome this, some people use multi-scale CNN [6] consisting of multiple independent CNNs, each trained with input training images at different scales.  Their outputs are combined to be used in the final classification.  You can picture this approach as taking the input image, shrink it and magnify it into multiple copies, and then sent them to different people specialized at looking at large face or small face, respectively, one of them will probably be able to recognize the object.

Instead of using multiple CNNs, we could probably train one CNN using image augmentation technique.  That is we turn one training image into multiple images of varying resolutions, then use all of them to train one single CNN.  With the CNN being trained by objects of different scales, it will automatically construct features of varying scales and be able to deal with scaling challenge.  This certainly implies we need a lot more feature maps in this CNN, a large eye feature and a small eye feature, ...  Human brain contains 86 billion neurons, comparable to 300 billion stars in our the Milky Way [7].  Therefore, our brain can afford lots of neurons to form a rather gigantic network to cope with scaling issue.  Considering the pair of human eyes of a three-year old kid already consumed hundreds of millions of images, human CNN does not lack of training data [1], so this could work for real.

Why a CNN trained above could recognize both a tiny face and a large face? A tiny face might be recognized by a small-face feature map in one of the early layers of our CNN trained by data augmentation, how will this activation survive the multiple convolution and pooling layers afterward?  This is at least mathematically possible.  First, max pooling is different from average pooling, an activated signal on a heatmap does not disappear after pooling.  Second, an activation can survive future convolution, if the convolution filter happens to be a trivial filter, for instance:

$$w = \begin{bmatrix} 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix}$$

is a $5 \times 5$ filter that will preserve the signals on a feature map in its previous layer.  Thus CNN trained by small faces will construct a neuron capturing a tiny face at one of the earlier layer, it then  build an identity neuron in each of the following layers to simply propagate the activation face signal all the way to the output layer.  Human brain likely consists of CNNs of mixed layering structure, so the tiny-face neuron could be wired directly to the output layer.  ResNet introduce shortcut wiring structure, probably can also enable it to be efficient in relaying an early-detected features to the back.


Illusion - Empirical Activation


In the CNN model, the last feature layer contains thousands of features, where each one gives an activity representing how strong the feature is present in the input.  Our brain only see some concepts, which might be just an unordered or loosely-ordered collection of a list of such activated features.  E.g., the concept of a beach scene is a list of features such as beach, sea, sky.  In the left side of Figure 11, we easily see a smiling president.  This might be because we have a "smiling" feature neuron activated due to its underlying features such as smiling mouth, smiling cheeks are activated.  A "face" neuron is also activated.  Although our CNN does not rotate the image $180^\circ$, we have seen enough upside down faces, so that we have an upside-down face neuron for such cases.  With both neurons activated, we see the concept of a "smiling"+"face" equals "smiling face".  Nothing is alarming to our brain.  On the contrary, the right side photo looks weird, although we all know its is simply the left photo.  Here, our "upside up face" neuron is activated.  However, none of the typically facial expression neuron is strong activated, at least not those facial expression neurons normally associated with the president.  So we see an "eerie face".  We recognize the president, as there are sufficient amount of president-specific features, but we have reservations, as some features that expected to be activated go silent.

Figure 11. Smiles and face are recognized independently.

Figure 12 shows a study, where researchers found out the time it took to recognize a person is shortest with the normal photo, slower with one eye moved up, and the slowest with two eyes moved up [8].  This seems rather straightforward to explain with CNN.  With the normal photo, many feature neurons fire up and collectively let us see Elvis or JFK easily.  When one eye is moved, the half face with a moved eye does not activate some feature neurons, so our concept of Elvis/JFK does not contain all the necessary feature members, we are unsure whom we see.  Then our eyes explore the photo and focus on the normal half of the face, it immediately activate all the necessary feature neuron needed and the Elvis/JFK concept is seen.   With both eyes moved, our eyes wonder around,  Elvis/JFK face neuron is still not firing.  Then we focus on the top portion and ignore the bottom, then focus on the bottom half and ignore the top, the Elvis/JFK concept is probably the closest concept to match either output, therefore, we go along with the decision.  I can feel I am doing a decision making (reasoning) step here, unlike no decision making is needed for the normal photo.
Figure 12. Research by Cooper and Wojan (source)
In Figure 13, they sure look like two roads extending into two different directions, but in fact, these are identical photos.  What might have happened is we have seen many scenes in our life similar to the Flatiron building in the New York City (Figure 14), where roads branching out in a "V" shape.  If there is a "V" shape feature neuron in our CNN, that same neuron fires strongly when the two photos in Figure 13 are viewed at once.  Our brain has no way to suppress/ignore this "V" firing, this feature contribute strongly to the concepts that we are looking at two different roads.


Figure 12. Two identical photos look very different.

Figure 13. The Flatiron building in New York City.

CNN is also the hero behind many most well known AI systems, such as Alpha GO/Zero.  For bioinformatics applications, CNN can also be applied to recognize one-dimensional patterns, such as to binding motifs in DNA sequences.  For medical applications, it can be applied to identify 3D tumor patterns in MRI or CT scans.

Let us look at a bioinformatics application next, where CNN is used to identify motifs for DNA/RNA-binding proteins (next blog).


Reference


1. https://www.ted.com/talks/fei_fei_li_how_we_re_teaching_computers_to_understand_pictures#t-370742
2. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
3. http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
4. http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
5. https://arxiv.org/pdf/1512.03385v1.pdf
6. https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx069
7. https://www.nature.com/scitable/blog/brain-metrics/are_there_really_as_many
8. http://www.public.iastate.edu/~ecooper/Facepaper.html

Monday, February 19, 2018

Gene Expression Inference with Deep Learning

Bioinformatics Application of Deep Learning Technology

Resource


We will discuss the following paper: Chen et al., Gene expression inference with deep learning. Bioinformatics. 2016 Jun 15;32(12):1832-9.  The paper is available here.

All figures shown here should be credited to the paper.

Introduction


The main goal of deep learning (DL) (or deep neural network - DNN) is feature learning, i.e., it finds a non-linear transformation function $\mathbf{z} = \phi(\mathbf{x})$ to project raw input feature $\mathbf{x}$ into new feature $\mathbf{z}$, hopefully the feature vectors in the Z space will be more linearly behaved, so that linear regression or logistic regression method performs better.  As DNN can theoretically model any function (see the DL Introduciton blog), we therefore can train a DNN; the parameters that give better predictions arguably has found a better approximation of $\phi$.  Chen et al. used a classical fully-connected DNN (basically is the black-box DNN model we imagined previously) for feature transformation and coupled with a traditional linear regression output layer.  I like this study, because its design is very clean, the paper is very readable and the conclusions are convincing.

Human genome contains about 20000 genes.  Genes are expressed at different levels within cells.  Different cell types, different life cycle stages, and different treatments all lead to changes in gene expression, the expression of genes controls what proteins to be expressed and at what level, which subsequently determines all the biological activities in and among cells.  So the expression level of 20k genes basically form a feature vector fingerprinting the cell status.  For example, to understand the effects of 1000 compounds, we would need to treat cells with each of compound and then measure 1000 gene expression profiles, one per compound.  As each expression profiling costs ${\rm $}500$, this is quite expensive.  In a previous work by NIH LINCS, it was proposed that gene expression data are highly redundant, therefore, one only needs to profile the expression values of a subset of 1000 landmark genes, called L1000; the expression values for the remaining genes can be computationally predicted.  Previously, for a given target gene $t$ to be predicted, linear regression is used and its expression is predicted based on:

$$\begin{equation} y_{i(t)} = \sum_{j=1}^{1000} w_j x_j + b_i, \end{equation}$$

where $x$ are the measured expression levels of L1000 genes.  However, in reality, gene expression is regulated through a non-linear system through the complicated interactions of molecular circuits.  Therefore the dependency of $y_i(t)$ on L1000 expressions must be a complicated non-linear function, this is where DNN shines compared to traditional machine learning methods.  The authors did exact this and showcase the power of DNN for feature transformation.

Results

D-GEX


A fully connected DNN is used as illustrated in the Figure below, called D-GEX (stands for Deep-Gene Expression).  A few details, first there are 943 *(not 1000) landmark genes, because the study uses three different data sets and some landmark and target genes are not included, as they are not contained in all three sets.  This also explains why there 9520 (not 20000) target genes in the future, as others do not have test data available for error/accuracy evaluation.  Second, there seems to be two DNNs, this is the memory requirements for predicting all 9520 genes is too much for a single GPU, therefore two GPUs had to be used; each was responsible for the predicting of 4760 (randomly allocated) genes.  For our purpose, we just need to consider this is one fully connected DNN with 943 inputs and 9520 outputs.

The authors have tested many different DNN hyperparameters, including number of hidden layers [1, 2, 3], number of neurons within each layer [3000, 6000, 9000].  It uses dropout regularization with values [0%, 10%, 25%].  It uses momentum optimization and Glorot weight initialization (all these concepts are explained in the Introduction blog).  These are the key ingredients and the author performed full grid search of the hyperparameter space formed by the combinations of layer count, layer size and dropout rate.


About 100k gene expression samples were retrieved from Gene Expression Omnibus (GEO) database, and was split into 80-10-10 training-validation-test datasets.  So the training matrix is $943 \times 88807$, the validation matrix is $943 \times 11101$ and the test matrix is $943 \times 11101$.  The loss function is defined as least-square error, as this is a regression problem:

$$\begin{equation} L= \sum_{t=1}^{T} \left[ \frac{1}{N} \sum_{i=1}^{N}{\left( y_{i(t)} - \hat{y}_{i(t)} \right)}^2 \right], \end{equation}$$

where the errors between the true values $y$ and the predicted values $\hat{y}$ were calculated and summed for all $T = 9520$ target genes and then summed over all $N$ training samples within a batch.  Typically $L$ is used to optimize network parameters $\theta$, one can define another accuracy function to measure the network performance, which is more interpretable and which is used to calculate validation and test errors (for the purpose of cross validation or early termination).  The accuracy function typically is not suitable for parameter optimization, where loss function has better gradient properties.  Here the error is defined intuitively as mean average error:

$$\begin{equation} MAE_{(t)}= \frac{1}{N^\prime} \sum_{i=1}^{N^\prime} \left\vert \left( y_{i(t)} - \hat{y}_{i(t)} \right)\right\vert. \end{equation}$$

Comparison


D-GEX was optimized and its performance obtained under various hyperparamters are shown in the figure below.  It says the best performance (the smallest overall error as determined based on Equation 3) was achieved with 3 9000-neuron layers and 10% dropout rate, i.e., the most complicated DNN structure (D-GEX-10%-9000X3).  The error 0.32 is significantly less (15.33% improvement) compared to the traditional linear regression (LR) method (the horizontal dashed line) (The figure is a bit misleading, its Y axis should have started from 0 instead of 0.3).


More performance data is shown in the table below.  It shows even a single-layer NN performs better than LR.  It also shows linear regression achieves error rate of 0.378, and regularization (either L1 or L2) did not really boost its performance.  This is understandable, as we have 88807 training samples and only 944 parameters to fit for a given target gene, overfitting should not be a concern.  The k-nearest-neighbor (KNN) method, a non-parametric method, performed the worst.



In the figure below, each point represents a gene, where the X axis is its prediction error based on D-GEX and the Y-axis is its prediction error based on either LR (left) or KNN (right).  We can see for almost all target genes, D-GEX provides more accurate expression level prediction.  Based on data in Table 1, the improvement of D-GEX to LR does not seems statistically significant, a 0.04 improvement compared to 0.08 standard deviation.  However, I believe the authors should have used the standard deviation of the mean instead, or use paired t-test based on the reduction of error per gene, as the improvements based on the Figure below is rather obviously statistically significant.


Why D-GEX Works Better?


We would hypothesize it is because it transforms the input feature non-linearly into a set of new features corresponding to the last hidden layer (3000, 6000, or 9000 new features), and those new features behaves much better linearity characteristics, therefore lead to better LR performance.  Bear in mind that the last layer of D-GEX is simply a LR method, all magic happens before that.

To verify the above hypothesis indeed is the reason behind, let us consider the two linear regressions below, one based on raw input $x$ and another based on D-GEX transformed feature $z (use 10%-9000x3 here):$

$$\begin{equation} y_{i(t)} = \sum_{j=1}^{943} w_j x_j + b_i \\
y_{i(t)} = \sum_{j=1}^{9000} w_j z_j + b_i
\end{equation}$$

The authors compared the two groups of goodness-of-fit $R^2$ (after correcting the degrees of freedom) and obtained the figure below:



It shows for all genes, the transformed features indeed are more linearly correlated with the target expression, archiving a higher adjusted $R^2$.  This clearly demonstration of the feature learning capability of DNN.

Inside D-GEX


DNN is generally a blackbox, it works magically better in most cases, but it is extremely hard to understand how it works.  Our brain works similarly, we often do not reason by rules.  A GO player has his own intuition why certain game layout seems better based on his gut feeling instead of rules of reasoning.  Nevertheless the authors here made a decent attempt to look into the network.

The figure below visualizes connections in D-GEX-10%-3000X1, where only connections with weights 4- or 5- standard-deviation away from the mean were retained and colored by sign (red for positive and blue for negative).



The observation is nearly all input features are useful for prediction (each input neuron has lines retained), however, only a few neurons in the hidden layer are important (called hubs).  These hub neurons also tend to be the ones that are useful for predicting outputs.  Each input gene tends to be connected to very few hubs and the same can be said to each output gene.  Biologically, this might be explained by each input genes is involved in one gene expression regulation module; expression level of module gene members are combined to determine whether the module is activated or deactivated.  The the activation pattern of the key regulation modules can control whether and how much all target genes are expressed.  The paper stopped here, however, I think it would have been interesting to take the genes associated with a certain large hub and run gene enrichment analysis to see if we can indeed associate known gene regulation functions to these hubs.  Furthermore, we can run the same enrichment analysis for target genes associated with the same hub, and even take the combination of landmark and target genes associated with the same hub.  It appears the activities of these regulation hubs tend to be more decoupled from each other, therefore they can be linearly combined for more accurate target prediction without worrying as much about non-linear interactions as before.

D-GEX achieves the best performance of 0.32, is this sufficient to say L1000+D-GEX can replace experimental whole-genome expression profiling?  Since the expression of each gene were normalized so that their standard deviation is 1.0, 0.32 roughly translate into a 32% accuracy in expression prediction.  Biologists mostly care about the fold change $FC_{ij} = \frac{y_i}{y_j}$ of gene expression under two different perturbations, $i$ and $j$, which implies:

$$ \begin{equation} \frac{\delta FC_{ij}}{FC_{ij}} = \sqrt{{\left(\frac{\delta y_{i}}{y_{i}}\right)}^2 + {\left(\frac{\delta y_{j}}{y_{j}}\right)}^2 } = 45\%. \end{equation}$$

This accuracy is not very high, but could be decent for many biological studies.

Discussions

More Realistic Applications


The real L1000 prediction problem is actually more challenging than what is described above.  The reason is the largest training data set comes from data acquired with microarrays in GEO database, where gene expression for both landmark and target genes are available from published studies.  Our real application is, however, to predict target gene expressions on non-microarray platform, where only gene expression for the landmark genes are measured via a different technology - Next Generation Sequencing (NGS) (microarray is a much less used expression profiling platform nowadays).  Although microarray and NGS are supposed to measure the same underlying expression objectives, due to the difference in the two technologies, measurements are correlated yet not always consistent.  Therefore, the prediction performance of using the D-GEX trained with GEO microarray dataset, but applied to input data of landmark genes acquired with NGS would be much lower.

The authors performed another study using GEO dataset as the training, but use 2921 samples from an NGS dataset called GTEx expression data as the validation and test set, where data are available for both landmark and target genes (in the paper, whether GTEx was used as the validation was not mentioned, let us just assume so).  Indeed the performance got much worse and the best DNN configuration is D-GEX-25%-9000X2 as shown in the table below.  This implies a 62% accuracy if fold change prediction, it becomes questionable whether the in silico predictions for real L1000 is sufficiently accurate.

The figure below shows how an increase dropout rate reduce the error by regularize the training more aggressively.






Two observations from the table above: (1) The optimal dropout rate increases from 10% to 25%.  The explanation is the expression distribution between training set and test set are now noticeably different, therefore, the influence of training data should be attenuated, which requires a more aggressive regularization to prevent overfitting, hence the higher dropout rate. (2) At individual gene level, D-GEX still outperforms LR and KNN, however, there are certain target genes, where KNN does a much better job (see scatter plot below) (right).



This is because KNN method is a similarity-based method, it can only use NGS data itself as the training data to predict NGS target data.  D-GEX has to use microarray data as the training data to predict NGS target data, therefore, it tackles a much harder problem.  Nevertheless, it is still very impressive to see D-GEX performs better than LR for over 81% genes, and better than KNN for over 96% of genes.


Comparison with SVM Regression

This study compared D-GEX to LR, the latter is a linear regression method.  It would have been much more interesting if the authors compared it to traditional non-linear regression methods, such as SVM regression, and see if the advantage of D-GEX still holds.  Compared with SVM relying on a particular kernel, i.e., using a fixed form of non-linear transformation, DNN can theoretically identify better non-linear transformations.  I admit it would be hard to conduct SVM regression on this large dataset in practice, as there are 88807 training data points and it is computationally expensive to use kernel methods.  Also in the kernel method, each landmark feature is equally weighted, which could also be problematic.  One can imagine DNN still has a conceptual advantage than SVM, despite they are not directly compared against each other in this study.

Conclusion


Fully connected DNN can be applied to effectively transform features non-linearly, so that the new features are more linear-regression friendly.  Its application to gene expression datasets shows DNN consistently outperforms linear regression, when feature learning component is omitted.



Friday, February 2, 2018

Introduction to Deep Learning

Resource


In the past year or so, there are several high-quality deep learning learning materials appeared.  If you start learning deep learning now, you are super lucky.

Recently my most favorite deep learning book has become Francois Collet's "Deep Learning with Python".  Chollet is also the author of the Keras package.  The example code in the book is significantly simpler to read, thanks to the Keras library.  Other books use code written in TensorFlow, an underlying library of Keras, and could be much harder to understand.  Chollet is not shy in providing his own insights on the field, especially the overview (Chapter 1) and the forward looking opinions (Chapter 9), which I found extremely helpful.

My second favorite is Aurelien Geron's Hands-on machine learning with Scikit-learn and TensorFlow.  I also read: Tom Hope, Yehezkel S. Resheff, Itay Lieder.  Learning TensorFlow: A Guide to Building Deep Learning Systems, which I liked.  I also like the book Deep Learning with Keras by Antonio Gulli and Sujit Pal, especially its first Chapter.  The authors carefully tested and compared how different hyperparameters affect the performance of MNIST classification task.  Last, I cannot not mention the deep learning bible: Ian Goodfellow, Yoshua Bengio and Aaron Courville's  Deep Learning.  However, it is too heavy to be a good introduction book.  Save this as an advanced-level book after you have gone through other introduction material.  I you can only read one book, definitely go with Deep Learning with Python.

I recommend the excellent video lectures: Stanford's cn231 (to lecture 10)
https://www.youtube.com/watch?v=NfnWJUyUJYU&t=2s (2016)
Some newer topics can be found in http://cs231n.stanford.edu/syllabus.html (2017)

For Chinese readers, an excellent video course is offered by Hung-yi Lee.  Distinct from other lectures, Dr. Lee offers many of his own insights; this is extremely helpful, as systematic theoretical framework is still unavailable for this field.
https://www.youtube.com/watch?v=CXgbekl66jc&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49

If you happen to have subscription to https://www.safaribooksonline.com (not free), I recommend an excellent video course by Jon Krohn, named "Deep Learning with TensorFlow: Applications of Deep Learning to Machine Learning Tasks".

Feature Learning


First a quick review.  We explain Artificial Intelligence (AI) is "the effort to automate intellectual tasks normally performed by humans" [1].  So strictly speaking Excel is an AI program, although we don't normally think of it that way.  The concept of AI is very broad, in many cases we come up with rules based on our human intelligence, then code these rules into a program.  The program can then convert data into output, as if our brain is behind it.  We are the intelligent piece here.  Machine learning (ML) is a type of AI, where machine write the rules.  As demonstrated in Figure 1, human intelligence goes into classical programming, which is the hardest part of the job.  But in machine learning, we only provide example data and answers, then ML will writes the rules for us, therefore machine provides the intelligence (i.e., artificial intelligence), the output of ML are rules.  These rules can then be used in the classical programming paradigm to answer future data.  This blog series is dedicated to the machine learning portion of AI.

Figure 1.  Taken from Figure 1.2 in Chollet "Deep Learning with Python", page 5.


We discussed some core machine learning theories in our previous blogs.  Machine learning actually consists of two parts, "feature learning" and "model optimization".  Previously we mostly apply linear analysis tools such as linear regression, logistic regression, or SVM onto raw input feature vectors $\mathbf{x}.$  We have been assuming the input features are already well designed, e.g., they may be Scale-invariant Feature Transform (SIFT) features for images [2], can be ECFP fingerprints or physiochemical properties for compounds [3], or sequence motifs (position weighted profiles) for genes or proteins [4], etc.  Typically the performance of our machines heavily depend on the quality of such features; good features such as SIFT require decades of hard work by domain experts.  If we ignore the feature learning portion, the model optimization portion is actually just a numerical global parameter optimization problem, therefore, sometimes people jokingly call machine learning as "machine optimization", as we normally do not have a "learning" component.   Where is the "learning" in machine learn?  The answer is "feature learning", i.e., projecting raw data into meaningful features that makes the task easy, has been the most difficult part of the machine learning that requires serious human intelligence.  Previously, smart features such as SIFT or ECFP were learned predominantly by human being instead of by machine.  The traditional machine learning field has focused on the optimization part of the equation, while do very little on the feature learning part.  This situation has now been fundamentally changed by deep learning.  As we will learn later, Deep Learning devises a strategy for feature learning, which is new, while it mostly inherits the model optimization portion, which we have already learned.

Feature learning basically means we take the raw input feature vector $\mathbf{x}$, apply some non-linear transform and turn it into a new feature vector $\mathbf{z}$; the goal is to find the optimal transformation function, so that the best overall machine learning performance can be achieved.  The non-linear kernels in SVM took a nice step in addressing the feature learning challenge, where kernel effectively introduces a transformation function $\mathbf{z} = \phi(\mathbf{x}, \gamma)$, projecting $\mathbf{x}$ into $\mathbf{z}$.  $\phi$ is determined by what kernel is chosen and its exact form can be optimized through its hyperparameter $\gamma$ (see previous blog).  There are still two limitations for the SVM kernel approach.  First, we only explore a family of transformation functions given a specified kernel, such as a Gaussian kernel (e.g., Equation 22 in blog).  That is, this feature learning strategy is not very flexible, if the optimal transformation function does not look like what the Gaussian kernel implies, feature learning will not perform well.  In reality, we can imagine $\phi$ has to be many forms, so ideally we need to create a learning mechanism that is capable of exploring all possible functional forms.  Second, the kernel strategy requires us to calculate similarities among all pairs of training data points, as well as similarities between a future data input to all training data points, therefore, kernel-based machines are too computationally intensive, both memory and CPU, to be applicable to large data sets.

To summarize, a real machine learning process should include a feature learning capability, where we can explore the function space $\left\{ \phi \right\}$ to find the arbitrary optimal transformation through tuning model parameters.  That is we want to find a model can present a Gaussian function, a polynomial function, a sine function, or basically any function forms!  Neural network (NN) is an answer to this requirement.  Figure 2 shows a single-layer neural network, where the input is our raw feature $\mathbf{x}$ and output is the learned feature $\mathbf{z}$ and we assume:

$$\begin{equation} z_i= \sigma(\sum_{j} w_{ij}x_j+b_i) \end{equation}.$$


Figure 2. One layer of neural network, where each node is a linear combination of the previous layer, then subjected to a non-linear transformation to produce an activation for output.

For simplicity, we will assume the offset term $b_i$ is already included in the linear transformation $\mathbf{w}_i\mathbf{x}$, so $\mathbf{w}_i\mathbf{x}$ implies $\mathbf{w}_i\mathbf{x}+b_i$.  Let us first assume the non-linear function $\sigma$ is the $sign$ function for now (i.e., a step function) and we can prove such a large single-layer neural network can approximate any arbitrary transformation function $\phi$.  Considering the special case where the input only contains one feature -- a scalar $x$.  We have two neurons $z_1$ and $z_2$ defined as (Figure 3):

$$\begin{equation} \begin{aligned} z_1 &= \text{sign}(x-a) \\ z_2 &= \text{sign}(a+\delta-x) \\ f(x) &= \frac{h}{2}z_1 + \frac{h}{2}z_2 \end{aligned} \end{equation},$$

The equation above implies our NN parameters are: $w_1 = 1$, $b_1 = -a$, $w_2=-1$, and $b_2=a+\delta.$  This output $f(x)$ is the sum of $z_1$ and $z_2$, which basically corresponds to a function element starting at $a$ with width $\delta$ and height $h$ (Figure 2).

Figure 3. A pair of $z$ nodes combined with a non-linear sign function can produce a function element of width $\delta$ and height $h$, sitting at location $a$.

Now with many of such $z$-pairs, we can construct multiple function elements centered at various locations with various heights, which can then be combined into a "bar graph" to approximate any arbitrary function $f(x)$ in the output node (Figure 4).  Notice for each give $x$ value, only one pair of $z$ neuron is activated, which gives an output of $f(x)$.

Figure 4. Using many structure elements shown in Figure 2, we can approximate any $f(x)$ with a single layer of NN.

A more tedious proof can be found in [5], and there are more stringent proofs that $\sigma$ can be other non-linear forms.  Based on this simple example, let us take it for granted that NN can approximate any transformation function, and the exact shape of the function is represented by network parameters $\mathbf{w}$ and $\mathbf{b}$.  Certainly the ability of a NN to mimic a function depends on how many neurons it uses, but at least NN provides a framework to mimic any function.  Once the weights in a NN is fixed, $\mathbf{x}$ is then transformed into $\mathbf{z}$ correspondingly, which solves the feature learning problem; then a traditional linear regression or logistic regression engine discussed previously can be hooked to the $\mathbf{z}$ output to complete it with the machine optimization part of the puzzle.  The combination of NN feature transformation (or feature learning) and the traditional regression/classification engine produces a mighty machine that can consume raw feature input directly, as it is now theoretically capable of learning any $\phi$ without the need of human learning.  The optimal parameters of the machine can be determined by minimizing the loss function based on the task at hand.  Popular loss functions are described in the multiclass classifier chapterFrom now on, we can approximately equate a neural network as an arbitrary unknown function $f$, i.e., $NN = f(x)$.  I cannot over emphasize how this view has helped myself in understanding deep learning, i.e., as long as we can describe a problem as a function, we are able to apply deep learning to solve it by approximating that function.  We will use the GO game as an example later to further illustrate this philosophy.

The general NN architecture is outlined in Figure 5, where we use a NN layer for $X \to Z$ transformation (from an input layer to a hidden layer), then use a classical ML layer (may not contain $\sigma$) to regress or to classify (from the hidden layer to the final output layer).  The $X$ layer is called the input layer, the last ML part generating the output is called output layer, all the network structures between the input and the output layer (can be multiple layers, although only one is shown in Figure 5) is called hidden layer(s).  In practice, since the regression/classification components can also be represented as NN-style connections (Figure 5), all layers, from the input to the final output, is call a neural network.  Compared to the traditional ML approach, NN goes one step further in its capability of exploring the feature transformation space.  Therefore, sometimes people rightfully call deep learning as feature learning.  Francois even defined traditional machine learning as: "searching for useful representations of some input data, within a predefined space of possibilities, using guidance from a feedback signal."[6]  Where deep learning pushes the envelop of that "predefined" space (such as defined by an SVM kernel) into something more unbound; the "feedback signal" refers to loss function, represents the machine optimization aspect.


Figure 5. An example NN-binary logistic regression classifier.  Where the raw input features are first transformed into a new set of features, as represented in the hidden layer, then the new features are linearly combined, softmax scaled into a two-element probability vector, as explained in the multiclass classifier chapter.  Although the main contribution of NN is on feature learning, we call the complete network, including the logistic regression components, as one NN.  There can be multiple hidden layers stacked one after another, which provide some depth, and therefore leads to the name of deep neural network (DNN) and deep learning.

GO Game as a Function


Alpha GO (and its second generation Alpha Zero) is a well-known example of what deep learning can do.  The GO board is a $19 \times 19$ grid, where each location can have three status, unoccupied or occupied by a black or a white stone.  The number of possible GO layout is unimaginably huge, $3^{361} = 10^{172}$.  Even after excluding non-sensible moves, there are still $10^{170}$ possibilities [7].  For GO players, given the current board layout $L$ as the input, they identify some regions of interest as key battle grounds or strategic territories, within which they need to consider the next move.  Then they evaluate those move candidates for their contributions to the final victory, oftentimes carrying out mental simulations for a number of virtual steps, and pick the move that feels optimal to them.  So making a GO move is equivalent to solving the following problem:

$$\begin{equation} m^* = \arg \max_{m} J(L^\prime(L, m)), \end{equation} $$
where m is a move candidate, which changes the current layout $L$ into a new layout $L^\prime$, and $J$ is the scoring function for a layout regarding how likely it is going to win at the end.  $m^*$ is the best move that generates the best possible new layout.  $J$ obviously is an extremely complicated function, it has to integrate over all possible future moves by both players till the end of the game and weights each move by some probability functions based on the opponent.   Each GO player has his/her own $J$, when they look at a board layout, they have the "gut feeling" of how promising the current layout $L$ is (taking all its future evolving trajectories into account as well), therefore, they are able to compare several moves, repeatedly apply gut feelings and pick the best move at the end.

For comparably simpler games such as chess, $J$ may rely on heavy simulations and knowledge derived from past chess games and can be calculated reasonably well, given sufficient computing power like IBM's Deep Blue [8]. The scoring function $J$ in GO has to take a feature vector of size 361 elements, each can take one of 3 values.  This function for sure is intractable.  Each GO player's years of training are to help him/her model this function as best as possible, but is still limited to the number of games they managed to play and with whom they played.  As there is no viable solution to construct $J$, what Alpha GO does conceptually is to use a deep neural network to approximate this function, and with the millions of training data it receives and millions of artificial neurons, its approximation to $J$ is more accurate than any human player can archive using biological neural networks.  Alpha GO certainly uses lots of other ingenuous solutions, e.g., deal with how to reduce the sample space of ${m}$, so that it only evaluates those moves that are promising (where lots of previous research have been focused on), but the core component is its using a NN to model the scoring function $J$.  In fact, theoretically we can even use $J$ to model the $\arg \max$ operation as well, as outputting the best move $m^*$ with a layout input is also a function!

Real life is complex, we have learned to develop intuitions or experience for many complex tasks.  It is very possible that these intuitions are basically a neural network approximation in our brain of the optimal true decision function in reality.  We continually reevaluate and improve the model based on the lessons life teaches us, as the results, some people have developed models better than others, but all are approximations to the ultimate truth.  The important concept here is to appreciate NN essentially does function approximation.

For traditional machine learning problems, the form of $J$ is a loss function, which can be precisely defined as previously described here.  However, $J$ in GO is largely unknown, beside it is a function.  The training there comes from the final outcome of a game, win or lose, which propagates back as a decaying reward for all moves in a game.  Therefore, the shape of $J$, i.e., the parameters of the NN, is optimized based on a framework developed by maximizing the winning outcome of millions of game playing experience; this type of learning problem is different from either supervised or unsupervised learning, it is called reinforcement learning problem (I do not have experience in this area yet).


Deep Learning Implementation

"Deep"


The above described single-layer NN is not used in practice, as it is inefficient in function approximation.  A decent approximation would have required too many $z$ neurons on that layer.  Function modeling would be much more efficient if we allow multiple layers of neurons.

In the example below, our target function is a red curve consisting of three similar peaks.  Say each peak could be approximately reasonably simulated by ten blue neurons, it would then require 30 neurons for us to approximate this red function using a single-layer neural network.  At the right of Figure 6, a two-layer network architecture is used instead, where the ten neurons in the first layer construct the structure element of one peak; in the second layer, only three neurons would be sufficient to place the peak three times at different locations and modulate their heights accordingly.  By reusing the structure element (peak) constructed at the earlier layer to model the repetitive components within the target function, we make a much more efficient use of network parameters.  There are often many repetitive patterns within real-life functions, therefore, multi-layer neuron network is more cost-effective and easier to train.

Figure 6. The red target function is modeled by a neural network.  On the left, a single layer of neurons is used, where it requires ten neurons to model each peak, so a total of 30 neurons are needed.  On the right, the first layer uses ten neurons to model a peak, the second layer only uses three neurons.  Each neuron in the second layer just reuse the feature (peak) already made available by the first layer, stretch/shrink and place it at three locations.  This way, the two-layer architecture uses much less parameters.


Another example of preferring DNN is the hierachical feature transformation, layer by layer, it allows simpler features to be constructed at earlier layers and then such simpler features are combined and reused to construct more complex features at deeper layers.  It would have required many neurons, if one has to construct the complex features in only one step.  Figure 7 shows many icons could be built at the third hidden layer by only allowing a horizontal and a vertical line segment features to be constructed in the first layer.  Typically, the number of possible features grows exponentially as the depth increases.  Another example is instead of drawing every leave and branch to show a forest, we draw a leave first, copy and paste the leave into a branch, copy and paste the branch into a tree, copy and paste a tree into a forest.  Therefore, many non-linear functions require a DNN to be modeled efficiently, i.e., DNN could model more complex functions by allocating neurons into different layers than placing them all onto one layer.


Figure 7. By using multiple layers of NN, the complexity of features it can model grows exponentially.  Such complex features would have required lots of nodes, if only a single layer is used.

Non-linear Transformation


In the early days of deep learning, $\sigma$ took the form of the sigmoid form, because it was believed to mimic biological neurons.  However, parameter optimization of such DNN are hard.  To find the optimal weight parameters in a DNN, we typically use stochastic gradient descend (SGD) method to descend down the surface of loss function $J$ in the NN parameter space $\theta$.  The gradient, $\frac{\partial J}{\partial \theta}$ must be calculated using a back propagation process.  We will skip the details of back propagating process, as it is well described in literature and it is encapsulated as a built-in feature in TensorFlow or Keras library.  Just to think of it in a hugely simplified form:

$$\begin{equation} J(w, x) = J(\sigma(wx)), \end{equation}$$

where $w$ modulates $x$ and the result is then transformed nonlinearly by $\sigma$, $\sigma$ is then used to eventually calculate $J$.  Therefore, we have:

$$\begin{equation} \begin{aligned} \frac{\partial J(w, x)}{\partial w} &= \frac{\partial J}{\partial \sigma}\frac{\partial \sigma(w, x)}{\partial w} \\
&= \frac{\partial J}{\partial \sigma}\sigma(wx)(1-\sigma(wx))x. \end{aligned} \end{equation}$$

When an intermediate neuron carries a large value $wx$, either largely positive or largely negative, $\sigma(wx)$ is close to 1 or 0.  Therefore the gradient $w$ receives is very small (Equation 5).  How often does a neuron saturates?  Quite often, as to be explained later (weight initialization section), as one neuron receives inputs from many parent neurons, its input tends to be much larger than the output of a single parent neuron, when the weights are not very small, its signal sum tend to be large enough to fall into the plateau region of the sigmoid function.  This is called vanishing gradient problem and it explains why a deeper NN is hard to train, because $w$ in earlier layers do not effectively receive feedback (gradient) from the loss function, they cannot be tuned to improve NN performance [9].  People developed strategies such as training DNN layer by layer, one layer at a time. Only the parameters for the first layer are optimized then fixed, parameters of the second layer is then optimized and fixed, etc.  However, this strategy is no longer used, thanks to developments described below.

The current most popular practice is to replace the sigmoid function with a much simpler ReLU function in the form of $\max(wx, 0)$.  This naive non-linear transformation basically keeps the positive output and zero out negatives.  The gradient therefore no longer decays for positive $wx$, no matter how large its value is.  For negative $wx$, gradient is zero and corresponding weight receives no update in this particular training sample.  But when other training samples were used, $wx$ might become positive and it will receive a chance of update.  Even if gradient is zero within a training batch, therefore a $w$ skips an update iteration (the neuron is "deleted" temporarily), the neurons in its earlier layer do not depends on this short-circuit neuron alone.  They depend on all the neurons in the current lay, therefore, neurons in the earlier layer are still likely to receive non-zero gradient information, i.e., short-circuiting does not propagate.  In practice, ReLU is one of the two main factors that make DNN trainable.  The layer-by-layer training strategy may still help to initialize DNN with a decent starting point, but it seems no longer necessary, when there is a reasonable amount of training data.  Most DNN can now be trained in one shot, as gradient information can be propagated many layers back thanks to ReLU function.  With people understanding gradient propagation better, very deep DNN, such as Microsoft ResNet of 152 layers, can be trained, as long as there are wiring in the DNN helps pass the gradient to early layers [10].

We may still wonder why biological system uses sigmoid-like response function and performs well.  My thought is the issue of sigmoid probably originates from our simplification of the NN into a layered structure.  In a biological NN, it probably does not contains clear layer separations, i.e., there are neurons directly connecting the input layer to the output layer; others connecting them with one hidden layer, two hidden layers, etc.  So a real biological NN is a mixture of NN of varying number of layers, it is a general network instead of nicely layered sandwiches.  This implies there exists shortcuts to pass signals by skipping some intermediate hidden layers and they also serve as express ways to route gradient traffic to deeper earlier layers during back propagation stage, so the network remains trainable even when it is deep.   We have observed similar ideas used in ResNet, where "skip connections" were introduced [10] and ResNet can have more than one hundred layers and remain perfectly trainable.

Initialization


When the vanishing gradient problem was identified to be associated with the difficulties in training DNN, it was realized that the initial values of weights play an important role for the trainability of DNN.  Taking sigmoid function as our $\sigma$, if weights are really small, we are basically using its middle linear region, which loses the benefit of non-linearity.  However, when weights are non-small, the sum of input sent to the sigmoid function would be large (either positively or negatively).  Due to the many input neurons a neuron on the next layer connected to, the weighted sum will easily lead to signal saturation and cause the vanishing gradient problem described above.  Therefore, the idea is to initialize the weight parameters in such a way that the variance of input and output of a neuron layer roughly maintains the same amplitude.

This is clearly explained in a blog here [11], where we aim to keep the ${\rm var}(y) = {\rm var}(x)$ for:
$$\begin{equation} y = \sum_i w_i x_i. \end{equation}$$
Given:
$$\begin{equation} \begin{aligned} {\rm var}(y) &= \sum_i {\rm var}(w_i x_i) \\
&= \sum_i {\rm var}(w_i) {\rm var}(x_i) \\
&\approx n {\rm var}(x) {\rm var}(w) \\
&\approx n {\rm var}(y) {\rm var}(w).
 \end{aligned} \end{equation}$$
Here we assume ${\rm mean}(w)$ and ${\rm mean}(x)$ to be zero in the second step, we also assume the variance are the same across $i$ for ${\rm var}(w_i)$ and ${\rm var}(x_i)$, respectively.  By requiring ${\rm var}(x_i) = {\rm var}(y)$, we obtain ${\rm var}(w) = \frac{1}{n}$, where $n$ is the number of input neurons.

Currently it is popular to use Xavier initialization (a.k.a. Glorot initialization) or He Initialization [10]. For ReLu function, according to He initialization, we can initialize weights with a uniform distribution $[-r, r]$, where:

$$\begin{equation} r = \sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}} \end{equation}$$

Here $n$ becomes the average of input and output neuron counts, so that the initialization takes into account both forward and backward propagation.  Initialization can prevent the signal saturation for some time, but weights are to be adjusted and this strategy may not last.  The ReLU solution mentioned above is a more fundamental solution (both should be used).


Batch Normalization


To continue to prevent signal saturation as optimization iterations proceed, we cannot totally rely on the initialization process.  There is another strategy called batch normalization that automatically scale the output of each layer into the right range empirically.  It works as the following:

$$ \begin{equation} \begin{aligned}
\mu_B &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; \mathbf{x}_i \\
{\sigma_B}^2 &= \frac{1}{m_{B}} \sum_{i = 1}^{m_B} \; (\mathbf{x}_i - \mu_{B})^2 \\
\hat{\mathbf{x}}_i &= \frac{\mathbf{x}_i - \mu_B}{\sqrt{ \sigma_B^2 + \epsilon}} \\
\hat{\mathbf{z}}_i &= \gamma \hat{\mathbf{x}}_i + \beta. \end{aligned} \end{equation}$$

For a given mini-batch containing ${m_B}$ data records, it calculates mean $\mu_B$ and standard deviation $\sigma_B$ and use them to first scale input $\mathbf{x}$ into a normal distribution $\hat{\mathbf{x}}$.  However, since this normal distribution may not be the optimal distribution for training, it then transforms $\mathbf{x}$ further into $\mathbf{z}$ through scaling and shifting with $\gamma$ and $\beta$, respectively.  $\gamma$ and $\beta$ are treated as model parameters, which will be optimized as part of $\theta$.  At test time, we do not have a mini-batch, but only one test sample.  In that case, $\mu$ and $\sigma$ come from estimations made during training time, reflecting the average mean and standard deviation of the training data set.


Optimization

Traditional machine learning methods use vanilla gradient descent (GD) to minimize the loss function.  GD/SGD rolls a ball down hill, the speed of the ball is proportional to the slope, so the ball rolls slower and slower as it approaches the bottom.  GD works well for simple functions, but not for $J$ in deep learning.

DNN contains tens of thousands of parameters.  For a three-layer DNN, say the input feature vector contains 10 elements and we have 100 neurons each for the first and the second hidden layer, then an output layer of one neuron.  The first hidden layer contains $10 \times 100 + 100 = 1100$ parameters, where 1000 are weight parameters associated with the lines connecting 10 inputs and 100 neurons; and 100 are bias parameters, one per neuron.  The second hidden layer contains $100 \times 100 + 100 = 10100$ parameters; the last output layer contains $101$ parameters.  So there are 11301 parameters in total.  DNN system can easily contain millions of parameters.  Some new challenges occur when our parameter space is so vast.

First, we would need to calculate gradients based on exact gradient formula, instead of estimating gradients numerically.  This is because, to estimate gradients numerically, we would need to change one parameter $\theta_i$ slightly with all the remaining 11300 parameters fixed, carry out a forward propagation with the training batch in order to calculate a difference for $\frac{\partial J}{\partial \theta_i}$.  Then we repeat this training process for all 11301 parameters, it will be too expensive.  We need to have all the 11301 partial derivative formula somehow known, so we can plug values in to get the 11301 gradient values.  We skip the details (you should read Appendix D: Autodiff in [12] to appreciate how it works), but this problem implies all deep learning packages have to represent the DNN in a graph, so that it knows how all parameters contributes to the loss function and is able to compute derivatives accurately.  Programming with deep learning modules therefore is drastically different from traditional machine learning programming, as the statement are not executed sequentially, but rather graphically.  (BTW, how do we compute the derivative of ReLU function, as it is not differentiable at $x = 0$.  Do not let this bother you, we typically use the average of the derivatives from both directions, so its derivative at zero can be set to 0.5.)  If one programs with lower-level libraries, such as TensorFlow, the logic can be scattered everywhere and the code is hard to read.  However, higher-level libraries, such as Keras, wraps lots of those logic so that the code appears a lot like traditional procedure programming workflow.  Therefore, I would recommend coding with Keras.

Second, there are millions of global minima.  All $m$ neurons within a layer basically are equivalent in the architecture, so if we find a global minimum solution, we could permute the order of neurons $m!$ ways and still get the same solution.  Consider $n$ layers, we will have $(m!)^n$ equivalent solutions.  This does not mean we could easily find one of the global minimum, in fact, we very likely end up in a local minimum as our final solution, as  there are even more local minima and finding global minima among many local minima is a very hard problem.  But do not let this bother you too much.  Say each of our brain is a giant neural network, our weight parameters are drastically different from each others' and weights are changing every second as new connections are formed, old ones are destroyed, existing ones are strengthen or weaken.  All these different DNN solutions have no trouble in guiding us to walk, to drive, to speak, to listen effectively.  The me today is still largely the me yesterday.  None of us has the best DNN weights for walking, yes, some might walk slightly better than others, but we are nearly all perfect walkers for all practical needs.  Even if we cannot be Einstein, it does not mean we cannot be a good physicist.  Our goal for DNN application is to create a tool that performs well, the difference between well and the best may not be that big of a deal.

Third, it is actually hard to find even a local minimum.  A local minimum requires $\frac{\partial J}{\partial \theta} = 0$ for all parameter $\theta$.  However, when the partial derivative is zero, the function along that parameter dimension could be at maximum, minimum or a saddle point.  With ten of thousands of parameters, when we find an optima point initially, pretty much there is almost always the case that $J$ is at a saddle point along certain dimensions, i.e., we deal with saddle points, not minimum points most of the time.  The biggest challenge for DNN optimization is to overcome saddle directions.  The ball will move slower and slower when it approaches a saddle point under the vanilla GD strategy, therefore, it does not work well.  We need the ball to behave like a ball in the physical world, carrying some momentum and roll over a saddle point in order to continue a down hill journey.   This can certainly also help the ball to climb over a small bump and potentially land in a much better local minimum.  This observation has led to momentum optimization and other strategies.  We here only explain a popular one called Adam optimization.

Adam optimization is based on two ideas.  The first is the speed of the ball should not be totally determined by the current slope, it should largely inherit what it carries previously and modifies it (momentum optimization we just described above).  The second is vanilla GD uses a constant learning rate $\eta$ for all parameters, which should ideally be parameter-specific.  The learning should happen faster in the low-gradient direction as they are likely saddle points and slower in the high-gradient direction (this is typically explained with the example of an elongated bowl problem, where large gradient implies noise.[13]), therefore, we want to adjust $\eta$ to reduce the impact of large gradients (this idea was originated from AdaGrad optimization, later improved by RMSProp optimization).  Adam algorithm, combining research results in both areas, is outlined by the following three update rules:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& \leftarrow \beta_1 \mathbf{m} + (1-\beta_1) \nabla_\theta J(\theta), \\
\mathbf{s}& \leftarrow\beta_2 \mathbf{s} + (1-\beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta), \\ \theta & \leftarrow \theta - \eta \mathbf{m} \oslash \sqrt{ \mathbf{s} }. \end{aligned} \end{equation}$$

Here, $\beta_1$ and $\beta_2$ are decay rates controlling how heavy information the past steps affect the current movement, $\eta$ is a learning rate parameter.  Smaller $\beta$ means less influence from history.  $\mathbf{m}$ is a history-averaged gradient, where gradients with consistent directions build up and gradients frequently changing directions cancel out (the momentum optimization idea).  This part is easy to appreciate.  $\mathbf{s}$ is the squared-amplitude of the history-averaged gradient, ignoring the directions in gradients.  The magnitude of the resultant smoothed-out gradient is then used to module $\eta$ for each individual parameter direction; it encourages a large-step change, if $\mathbf{s}$ is small.  The $\otimes$ and $\oslash$ operator just mean they are element-wise calculation, not matrix/tensor operations.

I actually have not understood the part where $\mathbf{s}$ can be used to modulate $\mathbf{m}$.  For the special example where $\nabla_\theta J(\theta)$ is a constant; $\mathbf{m}$ and $\mathbf{s}$ will converge to a stable value:

$$ \begin{equation} \begin{aligned}
\mathbf{m}& = {\nabla_\theta J(\theta)} \\
\mathbf{s}&= \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta) \\
\theta & \leftarrow \theta - \eta  \end{aligned} \end{equation}$$

So the step will not contain any gradient information, this does not make sense.  Probably our special case here does not happen often, therefore, should be ignored.  An interesting explanation is based on the frequency of a parameter receiving gradient feedback [14].  If many training data are insensitive to a particular parameter, i.e., its gradient is about zero most of the time, but once in awhile, it receives some feedback from some interesting training samples, the history-averaged $\mathbf{s}$ is small, so we miss the golden opportunity to adjust the parameter.  Then we should increase the learning rate in order to seize such a rare-occurring training opportunity.  Here, we can argue moment averaging also hurts the infrequent parameters, but maybe that part can be alleviated by adjusting $\beta_1$.  Adam performs larger updates for infrequent parameters and smaller updates for frequent parameters.  Certainly this is a double-edge sword, as it magnify noise as well.  In practice, Adam is currently the most popular optimization algorithms, it is considered to be capable of automatically adjusting learning rate instead of relying on some unknown learning rate scheduling protocols.  I guess we might just have to accept that Adam optimization somehow works in practice, even if we cannot explain its magic power at this point.  This also shows the current status of the deep learning field, which is still far from maturing into a science.


Dropout


We cannot run machine learning optimization without regularization, otherwise, we inevitably run into overfitting.  The more complex a DNN is, the more likely it will demonstrate unrealistically good performance under the limited training data set.  Our previous weapon for regularization is to include extra terms such as Lasso or ridge penalty terms into the loss function in order to prevent network parameters $\theta$ from taking huge values.  However, a new strategy called dropout has been discovered for DNN and has proven to be extremely effective against overfitting.

When dropout is applied to a layer with an example rate of 0.4, the output from each neuron on the layer has 40% chance to be zeroed out.  So at each training step, on average, only 60% of the layer output is piped into the next layer.  Since dropout is random during the learning process, neurons receive gradient updates 60% of the time.  The benefits of dropout might be explained in two ways.

First, since each parent neuron could be dropped, that means no single neuron can be critical to the network.  This encourage some feature redundancy.  For example, we can recognize a human face either based on eyes or a nose alone or in combine.  If the network already contains an eye feature, there is no motivation for it to discovery the nose feature, as the eye feature already does a pretty good job.  With dropout, the eye feature might be removed, therefore, the discovery of an additional nose feature becomes advantageous.  Therefore, dropout encourages more independent features, so that a subset of them can be dropped without affecting the accuracy of face recognition.  If the network has already discovered both nose and eye features, dropout can also help tune the activation parameter, so that a face neuron can be correctly fired by nose or eye alone, without requiring both nose and eyes to be present in future test set.  The final network, after tortured by randomly losing an eye neuron or a nose neuron, is more robust for dealing with photos where part of the face is covered or is noise contaminated (Figure 8).

Figure 8.  Dropout encourage a network containing an eye neuron to discovery another nose neuron. As the result, the presence of both neurons enables the network to recognize faces with only eyes or nose.


Second, dropout might effectively create an ensemble machine learning solution.  The random combination of neurons might be equivalent to the linear combination of many trimmed-down DNN models, each contains a random sampling of 60% subset of the original DNN (Figure 9).  As ensemble model has a theoretical advantage compared to a single model, where they reduce variance in prediction, i.e, the dropout-applied model are more generalizable.   Bear in mind that deep learning at this stage more like a technology than a science, so many statements presented here are people's or my own interpretations to help us accept why certain techniques seem to work better than others.  They might not be exactly correct in the future.

Figure 9. Dropout effectively makes the neural network to be an ensemble of many smaller networks.  If dropout rate is 50%, an M-neuron network could contain $2^M$ possible dropout network architectures.  The ensemble classifier typically improves performance by reducing variance.


Although dropout and regularization can be simultaneously applied, it seems dropout alone is often sufficient.  One can tell whether further regularization is needed by checking whether training error and validation error are close.  Dropout does not need to be applied to every layer, or dropout rate should not be too high, as that might be too conservative and lead to underfitting.  Another point to be aware of is dropout should be disabled at test (prediction) time, as we want all nodes to contribute, otherwise, predictions would not be deterministic.  In our example, the expected input when all neurons contribute in test time is about $\frac{1}{0.6} = 1.67$ fold larger, so at test time, outputs from each neuron should be scaled down by the factor 0.6.  Here we use 40% dropout rate as an example, it certainly is a hyperparameters subjected to optimization in applications.


Hyperparameter Optimization


There are still many unanswered questions for the DNN architecture, including but not limited to how many layers to use, how many neurons for each layer, what non-linear function to apply, what parameter initialization strategies should be adopted, what optimization strategy is the most effective one, etc.  All these are called hyperparameters.  We learned previously to answer these questions, one would explore all combinations in the hyperparameter space, and then find the one yields the best performance based on the training-testing cross validation strategy as described here.  In practice, it is not realistic to explore this huge hyperparameter space, due to the computational burden, therefore many heuristic exploring strategies have been seen in applications.  We often need to rely on gut feelings and assume certain parameters should be fixed, so that we can reduce the number and the range of the hyperparameters to tune.  Typically it is considered more advantage to adopt a random sampling approach than to use a grid search strategy in exploring the hyperparameter space [15] (Figure 10).  This is because, for each individual hyperparameter dimension, each random sampling is likely to pick a different value, compared to the large redundancies embedded in the grid search strategy.  Conceptually, however, a seemingly beautiful approach is to rely on Gaussian Process (GP).  GP will make smooth Bayesian interpretations of the loss function formed by the hyperparameters, based on the existing sample runs available.  It then recommends a new sample point in the hyperparameter space, which is expected to produce the best gain in performance.  GP might be a topic for a separate blog.

Figure 10.  Not all hyperparameters are equally important.  In this example, the $X$ parameter is important and $Y$ is unimportant.  On the left grid search strategy, 1/3 of the searches are wasted, as exploring different $Y$ with $X$ fixed is a waste of time.  On the right, a random search strategy allows us to explore more $X$ data points, therefore, leading to an $X$ value much closer to its optimal..

A Black-Box DNN Solution


Our introduction here for DNN is brief, there are many training materials now available that cover fully-connected DNN to great details.  Our focus of this blog is to explain the theoretical advantage of DNN and the key components for its implementation.  We can now mentally put these pieces together to form a general-purpose black-box DNN solution, which will serve as a generic machine for any typical regression/classification problem.

The thought exercise is to design a fully-connected DNN (all the NNs we have described so far are fully connected, i.e., all the neurons in the previous layer are connected to all neurons in the current layer, in contrast to convolutional NN to be discussed in the future) consisting of $n$ layers and each layer contains the same $m$ number of neurons.  The last layer is either a linear layer without non-linear transformation (if the problem domain is regression) or a softmax layer (if the problem domain is classification).  The corresponding loss function is the cross-entropy function for classification problems and least-square error function for regression problems.  For each given set of hyperparameters (i.e., a certain DNN structure), we split input training samples into small batches, send them to the DNN batch by batch.  For each batch, we calculate loss function and its gradient against each network parameter $\theta$, update parameters based on gradient information to hopefully minimize the loss function.  A complete iteration over all training samples is called an epoch.  At the end of each epoch, we evaluate validation errors on the held-out data set.  After multiple epochs, the process is stopped when test error starts to arise.  As resource permits, DNN for other hyperparameters (could be suggested by GP process) are trained as well.  Then the DNN with the best expected performance is chosen.  A very brief outline of the code is shown here.

Compared to linear regression or logistic regression methods, this DNN solution has the conceptual advantage of first transforming input features nonlinearly into a set of new features, i.e., it can learn new features automatically.  In this regard, the traditional ML methods are just special cases of DNN, where they use the input features with a unity  transformation. Therefore, we generally expect the performance of DNN to be better or at least as good as traditional machine learning methods, as long as the DNN is appropriately constructed (i.e., can capture the true transformation function $\phi$).  This is confirmed by many published deep learning applications.

Let us look at a bioinformatics application next, where DNN is used to regress gene expression levels (next blog).


Reference


  1. Chollet. Deep Learning with Python, page 4.
  2. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
  3. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50(5): 742-754.
  4. https://en.wikipedia.org/wiki/Position_weight_matrix
  5. http://neuralnetworksanddeeplearning.com/chap4.html
  6. Chollet. Deep Learning with Python, page 8.
  7. https://en.wikipedia.org/wiki/Go_(game)
  8. https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer)
  9. http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf
  10. https://arxiv.org/pdf/1512.03385v1.pdf
  11. https://prateekvjoshi.com/2016/03/29/understanding-xavier-initialization-in-deep-neural-networks/
  12. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Appendix D, page511-518.
  13. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. Page298-299.
  14. Ruder, https://arxiv.org/pdf/1609.04747
  15. http://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf