Monday, August 13, 2018

LSTM Networks for Predicting Subcellular Localization of Proteins

Bioinformatics Application of Deep Learning Technology

Resource


We will discuss the following paper: Sonderby et al., Convolutional LSTM Networks for Subcellular Localization of Proteins.  The first draft is available here, the published version of the paper with a much more sophisticated architecture and more detailed technical descriptions is available here.  As the two versions are rather different, we discuss both.

All figures shown here should be credited to the two papers, unless the source is explicitly provided.

Introduction


Proteins are products of a factory called ribosome, each protein has its own blueprint called message RNA (rooted from its gene structure).  They are bikes, cars, ships and airplanes in biological systems, therefore they need to be delivered into different biological compartments in order to carry out their functions.  A ship cannot sail on land, it needs to be moved onto the sea, and cells have a complex UPS delivery system, where it is able to read the "shipping barcode" carried on the protein sequences and ship the proteins to their destinations.  E.g., some proteins will stay within the cell, but in different organelles, some may sit on cell membrane, others may be shipped outside into extracellular space and serve as carriers or messengers.  For a nice introduction on how proteins are delivered to different destinations, please read this.  To predict the subcellular location of a protein is a very interesting topic in bioinformatics, as it can provide clues to the function of a protein.

Existing bioinformatics prediction tools often are not ab initio, which means they tend to use information other than the sequence of the protein.  E.g., the MultiLoc program has a MotifSearch component, which basically used knowledge of experimentally confirmed signal motifs; many "cheat" by using homology information [1] (it is not a true prediction, but rather a mix of prediction and annotation).  These tools will perform poorly, when the protein to be predicted has no close homologues in the training data set.  In addition, existing bioinformatic tools relying on traditional machine learning techniques, such as SVM, require a fixed length input, therefore, they are conceptually not the most suitable choice for the variable protein lengths and are very difficult to be further improved.  Sonderby et al. was able to use Deep Learning tools to build a DeepLoc system, that improved the prediction accuracy from 0.767 (MultiLoc) to 0.902 (Table 1 in [2]) in their draft version, when only sequence data was used [2].  In the later publication, they were able to further improve their solution to surpass other tools on a more realistic data set, 0.7797 for DeepLoc versus 0.5592 for MultiLoc2, when homology factor is largely eliminated [3].  Let us see how they did it.


Architecture


Encoding by CNN


We first need to represent an input protein sequence.  There are 20 types of amino acids, therefore, a protein sequence of $n$ amino acids is naturally a 20-element one-hot-encoded vector $20 \times n$.  Amino acids have different physicochemical properties, some are charge neural, some are positively or negatively charged, some are polar or hydrophobic, etc.  The one-hot encoding does not take advantage of this knowledge.  A popular approach is to use the BLOSUM matrix element as the scores for each amino acid.  In the first draft, the authors use a combination of four different representations: one-hot, BLOSUM80 matrix, HSDM matrix, and sequence profiles (each protein is blasted against protein database to fish out its homologs; a profile is then constructed based on the multi-sequence alignment.).  If you are not familiar with bioinformatics, just consider BLOSUM matrix only as a similarity matrix, where chemically-similar residues have a higher positive score and chemically-opposite residues have a negative score.  In the final publication, the authors actually only chose one of the four representation; let us pretend the BLOSUM matrix was the one chosen in Figure 1.

First, the BLOSUM representation is transformed to be 1000 in length (only 10% of proteins are longer than this size).  If a protein is shorter, it is padded by NULL amino acids in the middle, otherwise, the extra residues in the middle are truncated.  This is because it is known that the signal carrying the location information is often located at the two ends (N-terminal and C-terminal, i.e., left and right ends) of a protein.  This makes biological sense, as ends of proteins tend to be loose and are available to be in contact with other proteins to allow the signal to be utilized, while the middle part can be wrapped and embedded within a 3D structure and inaccessible.

Figure 1.  A CNN for encoding an input protein sequence. [3]
The $1000 \times 20$ input matrix is then fed to six independent CNN modules (Figure 1).  Take the left most CNN module as an example, it has 20 features, each uses a convolution filter spanning 21 positions (as the input has 20 channels, one per an amino acid channel), the feature map dimension is $21 \times 20 \times 20$.  The right most module has no convolution, simply passing the original input.  Therefore, this CNN system looks for motifs of length 1, 3, 5, 9, 15 and 21.  Why do we want to encode the input using motifs?  We think the delivery system relies on some general tags (motifs).  All proteins to be shipped extracelluar may share a few motif patterns, so when CNN is trained, such motif patterns are over-represented, as they appear more frequently in the protein sequence.  Then CNN is able to represent a protein sequence by saying there is a motif A here, and there is a motif B there; such representation certainly makes downstream prediction way easier.  All the $20 \times 6 = 120$ feature map outputs are combined into a $1000 \times 120$ motif representation for the input sequence.   Another layer of convolution is introduce to carry out one more covolution using filter size $3 \times 120$ and obtain 128 feature maps of size 1000 each.  So the end feature matrix is of dimension $1000 \times 128$.


Bidirectional RNN


The RNN is made of a LSTM cell of 256 hidden elements.  If we pipe the $1000 \times 128$ sequence through the LSTM unit from N-to-C direction, the N-terminal signal may still be hard to retain, when the scanning reaches the C-terminal end.  Despite LSTM has longer memory, at least it is not so easy to optimize during the training process.  In fact, most of the location barcodes for a protein are near either the N-terminal (left end) or the C-terminal (right end).  Therefore, another popular technique to retain more signals from both terminals is to use two LSTM cells and scan the protein in both directions: N-to-C and C-to-N, respectively.  When the LSTM reads in one input base, it produces an output $\mathbf{h}$, therefore, the two directional LSTM have two outputs at each time step, both are combined into a final output vector, i.e., $\mathbf{h}_{N2C,t}$ is concatenated with $\mathbf{h}_{C2N, t}$, and we have hidden state vector $\mathbf{h}_t$ of size 512 elements (shown as the blue and red bars in Figure 2).  As there are 1000 time steps, the feature now is $1000 \times 512$.

Figure 2. Bidirectional LSTM network is used to generate 1000 vectors of 512 hidden states. [2]

Attention Mechanism


Not all of the 1000 hidden states are equally informative.  Thanks to the dual scan, the hidden state for each time step captures the information for the complete sequence, therefore, we can certainly use $\mathbf{h}_1$ and $\mathbf{h}_{1000}$ as the new input and hook up a FNN to classify proteins into different classes, representing different subcellular locations.  This should work, however, may not be optimal (as shown in Figure 6 later).  The footprint of the destination motif is the strongest, when it is just seen, therefore if the location barcode of a protein locates at the 100th amino acid (N-term, left end), the hidden states around that time step would be the best for prediction.  So the idea is somehow we aim to come up a 1000-element weighting vector, where each of the 1000 states are weighted differently.  Ideally, the states closer to the useful tags have larger weights, we can then weighted average all 1000 hidden states into a final state vector, which should then lead to better predictions.  To figure out which time steps the latter classification system should pay attention to, the authors used an attention mechanism.  The attention mechanism described in their final publication [3] is rather complicated and hard to understand, so I choose here to discuss the mechanism presented in their draft version [2].

The attention component has a weight matrix $\mathbf{W}_a$ and an attention output vector $\mathbf{v}_a$.  We compute an attention scalar for time point $t$ by:

$$\begin{equation}a_t = \rm{tanh}(\mathbf{h}_t \mathbf{W}_a)\cdot \mathbf{v}_a^\intercal. \end{equation}$$

The way I interpret Equation 1 is there are some motifs that are very informative for location prediction, they are encoded by some elements of the hidden states.  Say we have $k$ important motifs, each is associated with a few elements in the hidden states, therefore $\mathbf{W}_a$ is a matrix of $k \times 512$ that extracts the presence data of each of the $k$ motifs from within the hidden state.  The output of $\rm{tanh}(\cdot)$ thus is a "probability vector" indicating whether $\mathbf{h}_t$ contains those $k$ motifs.  In the second step, considering not all such motifs are of equal importance, therefore, $\mathbf{a}_v$ is a vector coding the importance of each of the $k$ motifs.  The resultant scalar  $a_t$ is therefore the logits measuring the importance of each hidden state.  The weight for each hidden state can then be obtained by softmax:

$$\begin{equation} a_t = \frac{exp(a_t)}{\sum_{t^\prime=1}^T \; exp(a_{t^\prime})}. \end{equation}$$

The final context vector representing the protein is a weighted sum:
$$\begin{equation} \mathbf{c} = \sum_{t^=1}^T \; \mathbf{h}_t a_t. \end{equation}$$

This then serves as the input to an FNN to classify proteins into 11 classes.  The authors used a single layer FNN in their draft version; they also introduce a  hierarchical tree Bayesian style predictor we will describe later in the discussion.


Results


Each protein is encoded by its context vector $\mathbf{c}$, Figure 3 is the t-SNE 2-dimensional visualization of all proteins predicted using two separate data sets.  It is clear that the encoding (basically the transformed feature space) is very good at separating proteins of different classes, therefore, the above system is quite capable of extracting the location-rich features, and the downstream classification system does not have to be too sophisticated.  $\mathbf{c}$ is very interesting, they encode the variable length proteins into a fixed length vector.  Unfortunately, they are trained based on subcelluar location data, they cannot be used as general-purpose protein vectors.

Figure 3. t-SNE representation of the context vector, colored by different protein classes [3].  For the data set on the right, there is quite some degree of homology between training and test sets, therefore, data points are better separated.  For the data set the authors constructed on the left, there is less homology between training and test sets, so it is a more realistic reflection of the prediction quality on new proteins, which therefore is expected to be less well separated on t-SNE map.

Figure 4 plots the attention vector, where we expect the locations catching the attention tend to be where location tags resides.  It is clear that extracellular proteins, plastid proteins, mitochondrion proteins have their signals at the very end of the N-terminal.  Transmembrane preteins have their signals scatter across the whole range, where there are dark fragments probably indicating transmembrane regions (where those are hydrophobic amino acides, which are probably extracted by the CNN).  These transmembrane tags in the middle part of the protein are not necessary our shipping barcodes, they are probably just signature motifs for transmembrane protein, therefore, highly correlated with the true shipping barcode.  Such patterns are useful for prediction, but are not necessarily the reason for cells to sort proteins.  For most proteins, signals are mostly distributed around the N-terminal, and sometimes towards the C-terminal.  The attention data makes sense.

Figure 4. Attention vectors for proteins of different classes [3].

Figure 5 shows the overall performance data of the describe system - DeepLoc, compared with other published methods using the DeepLoc data set.  The DeepLoc approach is noticeable better.

Figure 5. Performance comparison of DeepLoc (this study) to other methods. [3]

Discussion

Comparing the first draft (unpublished version) [2] and the final published version [3], we can see the architecture of the network were changed quite a lot, it became way more sophisticated.   The authors initially easily obtained better performance than MultiLoc using the MultiLoc dataset - Hoglund dataset [1].   However, the Hoglund dataset over estimates classification accuracy, because its test set contains many proteins homologous to the proteins in the training set, therefore, makes the  prediction easier.  The authors later decided to create a new dataset, where they first clustered proteins, and then divided proteins into different cross-validation partitions cluster by cluster. This way, the tools cannot "cheat" by homology information and the true prediction performance is measured.   I suspect this probably led to some poor performance of their initial architecture, so they started adding lots of bells and whistles.  However, based on the performance data, I am not fully convinced the complexities introduced into the final version is all that essential.  It feels to me more like the authors wanted to describe all these add-ons, since they had already tried those  ideas and did not want to waste them by not publishing.  I will explain why I feel this way.

Input Representation


The original draft represents a protein by a $1000 \times 80$ matrix, where they combined four different amino acid representations.  That sounds intuitive.  In the final version, they first identified the optimal hyperparameter set by scanning one hyperparameter at a time, then fixed the optimal hyperparameter and tested each of the four representations alone.  The profile representation was found to be the best and was chosen for their final performance comparison.  However, the performance data published in Figure 6 was not based on profiles, but based on BLOSUM instead (interesting, why do that?).

Based on this table, the bidirectional LSTM system indeed significantly outperformed FFN (Feed Forward Network) (0.69 versus 0.52).  This was further improved by introducing the attention mechanism (0.73, a 5% improvement).  Considering attention mechanism offers additional benefit of visualizing where motifs are, this is a worthwhile addition.  However, adding convolution components did not make any difference.  One could simply use the A-BLSTM system, which would be conceptually a much cleaner design I would personally favor.   I suspect the similar table for profile-based performance might show a marginal improvement of CONV A-BLSTM over A-BLSTM, however, the difference should not be too much.  This is one reason I feel the architecture in the final publication seems overdone and runs the risk of overfitting.  However, since the authors did cross-validation correctly (they emphasized they did not peek into test set during hyperparameter and parameter optimization), we cannot point finger at what might be wrong with their bells and whistles.  The take home message here is the A-BLSTM system is the core piece of DeepLoc, we do not have to pay much attention to the CNN part.

Figure 6. FFN: feedforward neural network, BLSTM: bidirectional LSTM, A-BLSTM: BLSTM with attention mechanism, CONV A-BLSTM: A-BLSTM with convolution component added. [3]

The attention mechanism in their final version is rather complicated, using two RNN systems and a 10-depth iterations to obtain the final weight vector for all time points.  I have a hard time understand the rational behind (not really discussed in the paper) after quite a few reads, therefore, I choose to only discuss the attention mechanism they used in their original draft.


Classification


In the draft version, each protein is classified into one of 11 classes.  Although it was not described, but we can comfortably assume softmax classification is used.  In the final version, two classes lysosome and vacuole were combined into one class, probably because the individual accuracies associated with them were too low.  In that version, besides softmax, the authors also introduced a Bayesian model to mimic how a protein package is sorted by cell (see link).  As shown in Figure 7, in order for a protein to be predicted as an extracelluar protein, it first need to be predicted as a secretory protein at the tree root, then extracellular, finally an extracellular protein.  Each white node is associated with a logit output of the classification network, however, these nine white nodes are not softmax'ed as a whole, as their probability should not summed to one.  Instead, each of the white node represents a conditional probability, the logit output of each is directly cast into a probability by a sigmoid function, $\sigma = \frac{e^x}{1+e^x}$.  This is because the probability of the two branches given the parent node should sum to one, for example, secretory protein and non-secretory protein sum to one.  The probability of intracellular and extracellular protein, given the protein is a secretory one, again sum to one.  To calculate the probability associated with each class, i.e., the probability associated with the black nodes in Figure 7, we use the Bayesian graph model.  For instance:

$$\begin{equation}p(Golgi|X) = p(Golgi|ER/Golgi)p(ER/Golgi|Intracellular) \\ p(Intracelluar|Secretory Pathway)p(Secretory Pathway|X).\end{equation}$$

Notice that the secretory pathway does not mean the protein will be extracellular.  Proteins eventually bound to endomembrane systems will also be first sorted into secretory pathway (entering rough ER) while the protein is being translated, therefore, for secretory-pathway proteins, there must be an N-terminal signal.  However, this tree-based classification model did not produce better results compared to the simple softmax model.  Then why did the authors describe them in the paper, I can only say they did not waste any material they used during their study, when they wrote the paper.  So it is important for us to remain focus on the core architecture of DeepLoc, not be distracted by the many details.

Figure 7. The prediction tree. [3]

Datasets


The paper pointed out an important mistake that is often made by machine learning applications, that is whether the random partitioning the data into train-test sets realistically reflect the final goal of the machine learning task.  If our goal is to predict proteins, where they are often homologous to our training proteins, the traditional random partition of the dataset works fine.  However, if our goal is more ambitious, i.e., we aim to predict locations of novel proteins, for those that are not homologous to our training set, we would need to make sure there is homology between our training set and test set.

In Figure 8, DeepLoc is a dataset authors compiled without much homology between training and test sets (as proteins were clustered and then divided by cluster), Hoglund is a data set used by MultiLoc that share homologs, as random partition is used.  Therefore, using Hoglund as the training set, we see a high accuracy of 0.9138 in the Hoglund test set (4th row), however, this does not reflect the true performance on novel proteins.  The true performance is 0.6426 (2nd row), much lower.  On the contrary, if it was trained based on DeepLoc training set, its performance is 0.7511 on DeepLoc test set (first row), which is a more realistic measure, as it actually performs 0.8301 (3rd row), generalization is better than expectation.

Figure 8. Performance of using different training and test data sets. [3]



Conclusion

The authors started from amino acid sequence information to predict subcellular locations of proteins without experimental meta data.  They focus on using RNN to construct a good fixed-length feature representation for input sequences.  The amino acid sequences were first analyzed through a CNN system, where sequences were represented in the motif space.  Then these new inputs were processed by an LSTM system from both N-term to C-term direction and backwards to avoid signal lose.  The authors also introduced an attention mechanism to further enhance the possible tag signals by weighted averaging the 1000 hidden states into a final fixed-length context feature vector.  They were able to show the context feature vector was very efficient in encoding protein location information (Figure 3) and the attention vector indeed seem to highlight where the location tags were (Figure 4).  The take-home message is the LSTM (RNN) architecture can be used to effective encode variable length inputs.


Reference


1. MultiLoc: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btl002
2. https://arxiv.org/abs/1503.01919 (Table 1)
3. DeepLoc: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx431 (Table 7)

Tuesday, May 22, 2018

A Deep Learning Black Box Code

I mentioned in a previous blog that for the basic machine learning tasks such as classification or regression, we could replace the traditional machine learning tools, such as SVM or random forest, by a fully-connected deep learning solution.  If we do such tasks often, we should be able to write a black-box solution based on a few simplifications: (1) all layers have the same number of neurons; (2) dropout is applied only to the last layer; (3) We use ReLu.  I have not really work on this, but below is an outlier of the Keras code, which automatically explores the optimal NN solution for the MNIST dataset.


from keras.datasets import mnist
from keras import models, layers
from keras.utils import to_categorical
import numpy as np

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape(-1, 28*28).astype('float32')/255
validate_images = train_images[-10000:]
train_images = train_images[:-10000]
test_images = test_images.reshape(-1, 28*28).astype('float32')/255
train_labels = to_categorical(train_labels)
validate_labels = train_labels[-10000:]
train_labels = train_labels[:-10000]
test_labels = to_categorical(test_labels)

def model(input_shape, n_classes, n_layer=3, nodes_in_layer=64, dropout=0.5):
    net = models.Sequential()
    net.add(layers.Dense(nodes_in_layer, input_shape=input_shape, activation='relu'))
    for i in range(n_layer-1):
        net.add(layers.Dense(nodes_in_layer, activation='relu'))
    if dropout>0:
            net.add(layers.Dropout(dropout))
    net.add(layers.Dense(n_classes, activation='softmax'))
    return net

input_shape=(28*28,)

n_classes=10
n_layers=[1, 2, 3, 4]
nodes=[16, 32, 64, 128]
dropouts=[0, 0.2, 0.5]

n_search=10
best_acc=0.0
for i in range(n_search):
    n_layer = n_layers[np.random.randint(4)]
    nodes_in_layer = nodes[np.random.randint(4)]
    dropout = dropouts[np.random.randint(3)]
    print("INFO> Test new model: n_layer=%d, nodes=%d, dropout=%.2f" % (n_layer, nodes_in_layer, dropout))
    net = model(input_shape, n_classes, n_layer, nodes_in_layer, dropout)
    net.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    history = net.fit(train_images, train_labels, epochs=5, batch_size=64, validation_data=(validate_images, validate_labels))
    loss,acc = net.evaluate(test_images, test_labels)
    if acc > best_acc:
        print('INFO> Better model found, accuracy=', acc)
        best_acc=acc
        net.save('my_best_model.h5')
    else:
        print('INFO> Keep existing model, new model has worse accuracy=', acc)

An example run of the above script gives the following output, where the optimal model is highlighted:


INFO> Test new model: n_layer=1, nodes=16, dropout=0.50
INFO> Better model found, accuracy= 0.9193
INFO> Test new model: n_layer=3, nodes=32, dropout=0.50
INFO> Better model found, accuracy= 0.9577
INFO> Test new model: n_layer=4, nodes=64, dropout=0.50
INFO> Better model found, accuracy= 0.9707
INFO> Test new model: n_layer=2, nodes=64, dropout=0.50
INFO> Keep existing model, new model has worse accuracy= 0.9674
INFO> Test new model: n_layer=3, nodes=64, dropout=0.50
INFO> Better model found, accuracy= 0.9729
INFO> Test new model: n_layer=1, nodes=64, dropout=0.00
INFO> Keep existing model, new model has worse accuracy= 0.9674
INFO> Test new model: n_layer=2, nodes=32, dropout=0.00
INFO> Keep existing model, new model has worse accuracy= 0.9597
INFO> Test new model: n_layer=2, nodes=16, dropout=0.20
INFO> Keep existing model, new model has worse accuracy= 0.9383
INFO> Test new model: n_layer=4, nodes=64, dropout=0.20
INFO> Keep existing model, new model has worse accuracy= 0.9672
INFO> Test new model: n_layer=2, nodes=32, dropout=0.00
INFO> Keep existing model, new model has worse accuracy= 0.9583

Wednesday, March 7, 2018

Recurrent Neural Network

Introduction


For normal FNN, the elements in the input vector $\mathbf{x}$ has no intrinsic order.  Therefore, if one uses FNN to solve a computer vision problems, it will not be able to take advantage of the texture patterns formed by neighboring pixels, as the "neighboring" relationship is lost.  CNN instead takes advantage of the spatial relationship of a neighborhood and use that to form features, therefore, it is superior for computer vision problems.  Similarly, when the input is a sequence data, which could be a temporal time series or positional data such as a text stream, we do not want to scramble the order of the data points, instead should try to recognize sequence patterns, which requires a different architecture than FNN.  You may argue why not use CNN to recognize the sequence pattern.  CNN has two limitations, first, CNN can only look at patterns of a fixed size (determined by the size of its filters), but sometimes we are interested in long-range patterns.  For instance, to predict weather, no longer we have to look for patterns within the past few days, but few weeks, few months, even few years, such long-range pattern cannot be captured well by local CNN filters.  Second, CNN patterns are translational invariant, which means they are not context sensitive.  Patterns embedded in sequence data are often interpreted differently, depending on its context.  For example, the word "Chinese" within a document could mean either Chinese people or Chinese language, its exact meaning depends on the context.  CNN does not capture context.

We so far have been dealing with inputs of fixed size.  This is true for all the traditional machine learning methods we discussed previously (such as SVM, random forest), as well as for CNN.  Therefore to classify ImageNet pictures, we need to make sure they are all of the same dimensions.  In reality, many inputs are of variable length. For example, to transcribe voice, the voice wave form lasts differently depending on what the contents are and the rate of speech.  When XBox Kinect recognizes a kick action, the whole kick video sequence comes at varying number of frames, as there are quick kicks and slow kicks.  To read an email and then classify it into spam or ham, the length of the emails vary.  Traditionally for the XBox Kinect problem, we apply Bayesian graph theory to construct generative models to explain the video sequence, which is very complicated and heavily rely on oversimplified assumptions.  Therefore, we need a new NN architecture to handle variable input length, as well as mechanisms to discover patterns of varying time spans.

When input elements are fed into our machine one piece at a time, the machine must have memory mechanism.  If the machine makes a decision simply based on the current input token, word "Chinese" will always be associated with either a language or people, preventing the true meaning from being assigned based on its context.  In translation, the output sequence should also be variable.  The new architecture invented to overcome the challenge of variable-length input/output sequence is called recurrent neural network (RNN).  In this blog, we try to use simple examples to convince you RNN theoretically is capable of capturing patterns in a sequence of arbitrary length.  RNN has applications in language translation, video classification, photo captioning, trend prediction (arguably stock price prediction, not sure if that can work in theory.  See Table 4 and conclusion in [1]), etc.


Architecture


The basic unit of RNN is a cell as depicted in Figure 1.  A cell has an internal state, which is a vector $\mathbf{h}$, so a cell can store multiple internal elements.  At time $t$, it reads in the input $\mathbf{x}_{t}$, mixed with its current hidden status $\mathbf{h}_{t-1}$ to transition into a new state $\mathbf{h}_{t}$; $\mathbf{h}_{t}$ can then be used to produce an output vector $\mathbf{y}_{t}$.  Imagine $\mathbf{h}_{0}$ is initialized as a zero vector, the cell is able to read in input $\mathbf{x}$, one at a time, and produces a sequence of hidden states $[\mathbf{h}_1,\mathbf{h}_2, \ldots, \mathbf{h}_n]$, for our purpose, we can assume $\mathbf{y} = \mathbf{h}$ for now.

Figure 1. An RNN cell rolled out in time sequence [2].
The update math can be described as the following:

$$\begin{equation}
\mathbf{h}_{t} = \phi(\mathbf{h}_{t-1}\mathbf{W}_{h}+ \mathbf{x}_{t}\mathbf{W}_x + \mathbf{b}),
\end{equation}$$

where $\mathbf{W}_x$, $\mathbf{W}_h$ and $\mathbf{b}$ are cell parameters to be determined by optimization.  $\mathbf{h}_{t}$ is the hidden state of the cell, i.e., it is the impression of all the data the RNN cell has viewed so far: $[\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{t}]$.  After we watched a movie, the snapshot of our brain is $\mathbf{h}_t$, which contains all the effects of the whole movie.  This vector is then used as the input to classify the movie into a good one or a bad one.

Now let us try to understand why this simple recurrent architecture works.  I have not read about any "proofs" as elegant as what we did for FNN in a previous blog, where it is rather convincing to see why FNN can model any arbitrary function.  For RNN, the best I can do is to present some special cases to convince you RNN is indeed quite capable.

Memory


RNN cells can memorize data.  Imagine the input is the time series of daily stock price, and output $y$ is the most-recent three day average, or any prediction of tomorrow's price based on the past three days.  In the prediction at time $t$, we only receive a one-day input $\mathbf{x}_{t}$, thus there needs to be a way for CNN to retain $\mathbf{x}_{t-1}$ and $\mathbf{x}_{t-2}$, i.e., it should have short-term memory capability in order to use three pieces of data to generate the output $y_t$.

Let us look at the special case, where stock price $x_t$ is a scalar:
$$\begin{equation} \mathbf{h}_0 = {[ 0, \; 0, \; 0]}^\intercal, \\ \mathbf{h}_t = \phi(\mathbf{h}_{t-1}\mathbf{W}_{h}+x_{t}\mathbf{W}_x +\mathbf{b}), \\ \phi(x) = x, \\ \mathbf{W}_h = \begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \\ \mathbf{W}_x = {[ 0, \; 0, \; 1 ]}^\intercal, \\ \mathbf{b} = {[ 0, \; 0, \; 0 ]}^\intercal. \\ \end{equation}$$
Our hidden state contains three elements, reserved to store the stock prices for the past three days.  We follow the time sequence, the following $\mathbf{h}$ sequence can be obtained based on Equation 2:

$$\begin{equation} \mathbf{h} = \left\{ \mathbf{h}_0=\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \mathbf{h}_1=\begin{bmatrix} 0 \\ 0 \\ x_1 \end{bmatrix}, \mathbf{h}_2=\begin{bmatrix} 0 \\ x_1 \\ x_2 \end{bmatrix}, \mathbf{h}_3=\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \\ \mathbf{h}_4=\begin{bmatrix} x_2 \\ x_3 \\ x_4 \end{bmatrix}, \ldots, \mathbf{h}_t=\begin{bmatrix} x_{t-2} \\ x_{t-1} \\ x_t \end{bmatrix}, \ldots \right\}. \end{equation}$$
This way RNN can retain a short-term memory.  Although at each time point, we could use all stock prices for the past three days as the feature vector input in real applications, we here choose a scalar input form to illustrate the memory mechanism.

Running Average and Derivative


Let us consider another special case, where our hidden state has three elements, reserved to store the running average $h_{\Sigma}$, running derivative $h_{\nabla}$, and memorize the previous input $h_{mem}$, respectively:

$$\begin{equation} \mathbf{h}= {[ h_{\Sigma}, \; h_{\nabla}, \; h_{mem} ]}^\intercal, \\
\mathbf{h}_0 = {[ 0, \; 0, \; 0]}^\intercal, \\ \mathbf{h}_t = \phi(\mathbf{h}_{t-1}\mathbf{W}_{h}+x_{t}\mathbf{W}_x +\mathbf{b}), \\ \phi(x) = x, \\ \mathbf{W}_h = \begin{bmatrix} \beta_1 & 0 & 0 \\ 0 & \beta_2 & -(1-\beta_2) \\ 0 & 0 & 0 \end{bmatrix}, \\ \mathbf{W}_x = {[ 1-\beta_1, \; 1-\beta_2, \; 1 ]}^\intercal, \\ \mathbf{b} = {[ 0, \; 0, \; 0 ]}^\intercal. \\ \end{equation}$$
The recurrent equation can be simplified into:
$$\begin{equation}
\begin{bmatrix} h_{\Sigma,t} \\ h_{\nabla,t} \\ h_{mem,t} \end{bmatrix} =
\begin{bmatrix} \beta_1 h_{\Sigma,t-1} + (1-\beta_1)x_t \\
\beta_2 h_{\nabla,t-1} + (1-\beta_2) (x_t - x_{t-1}) \\ x_t \end{bmatrix}.
\end{equation}$$

We follow the time sequence, we obtain the following $\mathbf{h}$ sequence:

$$\begin{equation} \mathbf{h} = \left\{ \mathbf{h}_0=\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \mathbf{h}_1=\begin{bmatrix} (1-\beta_1)x_1 \\ (1-\beta_2)x_1 \\ x_1 \end{bmatrix}, \\
\mathbf{h}_2=\begin{bmatrix} (1-\beta_1)\left(\beta_1  x_1+  x_2 \right) \\ (1-\beta_2)\left( \beta_2(x_1-0)+ (x_2-x_1)\right) \\ x_2 \end{bmatrix}, \ldots, \\
\mathbf{h}_t=\begin{bmatrix} (1-\beta_1)\left(\beta_1^{t-1}x_1+\beta_1^{t-2}x_2+\cdots + x_t)\right) \\  (1-\beta_2)\left(\beta_2^{t-1}(x_1-0)+\beta_2^{t-2}(x_2-x_1)+ \cdots + (x_t-x_{t-1})\right)  \\ x_t \end{bmatrix}\right\}. \end{equation}$$
Now we can see how the three elements in our hidden state work.  $h_{mem}$ simply memorizes the current input, so that $x_{t-1}$ is available to be used with $x_t$ at time point $t$.  $h_{\Sigma}$ is for calculating a running average, which is similar to the momentum optimization.  The current average is first decayed by a factor $\beta_1$, the new input is weighted by $(1-\beta_1)$ and used to update the running average.  $h_{\nabla}$ is for calculating a running first-order derivative.  The current derivative is first decayed by a factor $\beta_2$, the new derivative estimated based on $(x_{t}-x_{t-1})$ is weighted by $(1-\beta_2)$ and used to update the derivate memory.   We here assume $\beta_1$ and $\beta_2$ are decay rate within $[0, 1]$.  When $\beta$ approaches one, we see data far into the past, the new input has little weight to change the average or derivate too much.  When $\beta$ approaches one, we only see the effect of very recent inputs.  Imagine we create many RNN cells with different $beta$ values, we can have a range of cells that capture the temporal average or derivative of varying time windows.  If you stack such memory cells, say one $h_{\nabla}^{(2)}$ on top of the output of another $h_{\nabla}^{(1)}$, we can get higher-order derivatives with a multi-layer RNN architecture (Deep RNN).

Above thought experiments are far from any proof, they nevertheless offer us some confidence that with enough cells and enough layers, an RNN is able to capture many temporal patterns.  In fact, it was believed that RNN can be used to simulate arbitrary computer program [3]:

Thus RNNs can be thought of as essentially describing computer programs.  In fact, it has been shown that RNNs are turing complete (for more information refer to the article: On the Computational Power of Neural Nets, by H. T. Siegelmann and E. D. Sontag, proceeding of the fifth annual workshop on computational learning theory, ACM, 1992.) in the sense that given the proper weights, they can simulate arbitrary programs.

When the activation of its hidden state can encode different temporal patterns (just as feature neuron in CNN), they can certainly be used as powerful features to for downstream predictions.  If our thought experiment is representative, the memory exercise tells us we would need an RNN cell to contain at least $m$ hidden elements, if we want to predict the future based on the past $m$ time points.  Therefore, RNN cells should have reasonable complexity to be useful.  The second exercise tells us activities captured by a cell contains an exponential term on $\beta^t$.  During optimization, our $\beta$ may become greater than one, which will result in an explosion and make optimization impossible, as early steps make infinite contribution.  When $\beta$ becomes too small, the early steps have virtually no contribution, this is called vanishing gradient problem, which prevents RNN from looking far into the past.  Therefore this RNN architecture is actually very hard to train.   Also, it is inefficient if we would need large number of memory elements, when there is a need to use stock price in the past few years to predict tomorrow's price.  In practice this classical RNN architecture is now called SimpleRNN, the architecture we discussed above has been replaced by more efficient alternatives, such as LSTM (long short term memory) to be introduced later.


RNN Applications


RNN has a few variants that make it suitable to handle a number of very interesting use cases.

Case A, each input produces an output.  Say stock price prediction, each new input $x_t$ of today's price will enable us to combine it with the data it has seen so far (coded in $\mathbf{h}_{t-1}$) to predict tomorrow's price.  We can conveniently ignore the first few outputs as they were based on very few inputs, and start using the prediction only after it has seen enough historical data points.

Case B, all outputs are ignored, only the last output is used.  An RNN reads a movie review from beginning to the end, its last hidden state $\mathbf{h}_{last}$ encodes content for the whole review, therefore, this can be used as the feature representation of the review and be passed onto a FNN to classify the review into positive/negative or to be regressed into movie ratings.

Case C, only one input and the RNN generates a sequence of output.  Give the RNN a photo $\mathbf{x}_0$, RNN represents the photo into a hidden state, then this hidden state can recurrently output a sequence of words to be used as the capture.  This sounds magic, but it actually has been shown to work reasonably well, please watch the nice TED talk by Feifei Li [4], if you have not.

Case D, RNN takes an input sequence, then it outputs a new sequence.  This is what behind language translocation, where an English sentence is sent to an RNN, its hidden state at the end encodes the meaning of the sentence, it then spits out a new sentence in Chinese.  Case D is actually the combination of Case B and Case C.

Figure 2. Four different RNN use cases [5].


LSTM


We mentioned above that the SimpleRNN architecture has many practical issues, especially it cannot accommodate too many time steps.  When there are many time steps, the network suffers from either vanishing or exploding gradient problem, so RNN cannot have longer term memory.  LSTM is a technique that addresses these challenges by enabling longer short-term memories, so it is called Long Short-Term Memory cells.

Figure 3. Anatomy of an LSTM cell [6].
Figure 4. The update rules of LSTM [7].
LSTM cell is still quite a magic to me, so my confidence is not very high for what I write next.

The example I am using is inspired by a tool called LSTMVis [8].  Let us consider an LSTM cell to parse the following string, where numeric tokens are associated with different depth defined by parentheses (highlighted in three colors, one per depth), and the ultimate goal of the LSTM is to sum up all numbers associated with depth two and beyond (blue and green numbers).

( 4.5 ( 2.5 ( 3.5 5.5 ) 3.2 ( 1.0 ) ) 2.86.8 9.0 ) )

The encounter of each left parenthesis increases the depth by one and a right parenthesis decreases the depth by one.  Our cell vector $\mathbf{c}$ naturally contains two elements, one for counting the depth and another for storing the sum, i.e., ${[ c_{depth}, c_{sum} ]}^\intercal$.  Our input vector $\mathbf{x}$ has four elements: ${[ x_l, x_r, x_n, x_v ]}^\intercal$, where $x_1$, $x_r$ and $x_n$ are one-hot encoder indicating the input is a left parenthesis, a right parenthesis, or a number, respectively.  When it is a number, $x_v$ stores the numerical input value.  Therefore string "(2.5 4.2)" is represented as four inputs:

$$\begin{equation} \begin{bmatrix}1\\0\\0\\0\end{bmatrix},  \begin{bmatrix}0\\0\\1\\2.5\end{bmatrix}, \begin{bmatrix}0\\0\\1\\4.2\end{bmatrix}, \begin{bmatrix}0\\1\\0\\0\end{bmatrix}. \end{equation}$$

Here we only look at the special case where output gate is always open, i.e., $\mathbf{o} = {[1, 1]}^\intercal$, and output state $\mathbf{h} = \mathbf{c}$.

Let us first consider the case where there is no feedback loop, i.e., $\mathbf{h}_{t-1}$ is not linked to the input at time $t$.  The input gate is determined by:
$$\begin{equation} \mathbf{i} = \rm{sign} \left(\begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}\cdot \mathbf{x}\right). \end{equation}$$
Apply Equation 8 to Equation 7, we obtain a sequence of input gates:

$$\begin{equation} \begin{bmatrix}1\\0\end{bmatrix},  \begin{bmatrix}0\\1\end{bmatrix}, \begin{bmatrix}0\\1\end{bmatrix}, \begin{bmatrix}1\\0\end{bmatrix}. \end{equation}$$

The input gate is ${[1, 0]}^\intercal$, when it sees either "(" or ")", therefore, the input gate is open for $c_{depth}$ in order for it to be updated and the input gate for $c_{sum}$ is closed as no update on that element is needed.  We here replace $\rm{tahn}(\cdot)$ by a simple $\rm{sign}(\cdot)$ function :
$$\begin{equation} \rm{sign}(x) = \begin{cases} 1, \; \rm{if} \; x \gt 0 \\ 0,  \; \rm{if} \;x \le 0 \end{cases} \end{equation}$$
On the other hand, when it seems a number, the input gate will be ${[0, 1]}^\intercal$, which prepares $c_{sum}$ for an update and protects $c_{depth}$ from being modified.

The forget gate can be kept open all the time, i.e., ${[1, 1]}^\intercal$ for our simply example.

The new data is calculated by:
$$\begin{equation} \mathbf{g} =\rm{sign} \left( \begin{bmatrix} 1 & -1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\cdot \mathbf{x} \right). \end{equation}$$
Apply Equation 11 to Equation 7, we get the sequence:
$$\begin{equation} \begin{bmatrix}1\\0\end{bmatrix},  \begin{bmatrix}0\\2.5\end{bmatrix}, \begin{bmatrix}0\\4.2\end{bmatrix}, \begin{bmatrix}-1\\0\end{bmatrix}. \end{equation}$$

This means $\mathbf{g}$ is ${[1, 0]}^\intercal$ for "(", ${[-1, 0]}^\intercal$ for ")", and ${[0, x_v]}^\intercal$ for number inputs.

Therefore, the final update rules for $\mathbf{c}$ is the following (recall that forget gate simply just pass $\mathbf{c}_{t-1}$ through according to Figure 4:

$$\begin{equation} \begin{aligned}
c_{depth,t} &= \begin{cases} c_{depth,t-1} + 1, \rm{if \; input \; is \; (} \\
c_{depth,t-1} - 1, \rm{if \; input \; is \; )} \\
c_{depth,t-1}, \rm{if \; input \; is \; numeric}\end{cases}, \\[2ex]
c_{sum, t} &= \begin{cases} c_{sum, t-1}, \rm{if \; input \; is \; ( \; or )} \\
c_{sum, t-1}+x_v, \rm{if \; input \; is \; numeric} \end{cases}. \end{aligned} \end{equation}$$

So far, we are summing up all numbers read from the input, regardless of their depths, but our ultimate goal is to only sum the numbers when its depth is greater than one.  Since now the input gate for $c_{sum}$ should only be opened, when $c_{depth} \gt 1$, we link our output $\mathbf{h}$ back to the next time step to influence our input gate calculation.  Our new input gate calculation formula is updated from Equation 8 into the following:

$$\begin{equation} \mathbf{i} = \rm{sign} \left(\begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}\cdot \mathbf{x} + \begin{bmatrix} 0 & 0 \\ 1 & 0 \end{bmatrix}\begin{bmatrix}c_{depth} \\ c_{sum}\end{bmatrix} - \begin{bmatrix} 0 \\ 2\end{bmatrix} \right). \end{equation}$$

The input gate for $c_{depth}$ behaves the same as before, as the two added terms have no effect on the top output element.  When the input is a number, the input gate for $c_{sum}$ is controlled by the value of $c_{depth}-1$, i.e., it only opens for update when $c_{depth} \gt 1$, exactly what we want.  The linkage of output to the next input brings the context (memory), so that the same input $\mathbf{x}_t$ produces different gate values and our cell is updated in a context-sensitive manner.

With the capability of longer-term memory, LSTM network can be used to deal with very long sequences.  Why LSTM can avoid the vanishing/exploding gradient problem that troubles the SimpleRNN?  A good explanation is offered by a blog here [9].  The main nested product involved in back-propagation is $\frac{\nabla c_{t}}{\nabla c_{t-1}}$.  Based on Figure 4, the addition term containing $\mathbf{i}$ and $\mathbf{g}$ has no contribution, the derivative is basically $\mathbf{f}$ - the forget gate.  $\mathbf{f}$ value is never greater than one, therefore, there is no exploding problem.  When memory needs to be retained, $\mathbf{f}$ values should be close to one, so the vanishing gradient problem is also diminished.  The gradient is terminated when old memory needs to be wiped out ($\mathbf{f}$ is set to zero).

Through the few virtual experiments discussed here, hopefully we now have more faith in the recurrent network.  As LSTM is so successful, when people say they use RNN, they actually refer to LSTM by default.  The traditional RNN is called SimpleRNN instead.  Another RNN variant call Gated Recurrent Unit (GRU) has shown to be simpler than LSTM and perform equally well, but we do not dig into that for now.  For readers speaking Chinese, a nice video on RNN is available here [10].

I am very excited by the RNN architecture, because it seems to explain how we dreams.  In the video here (20:00 - 31:00), it was shown that we can feed an RNN with Shakespeare, with C-code, or with mathematical articles and the RNN seems to be able to spit out shakespeare, C-code, and math articles.  The RNN architecture is outline in Figure 5.  Where one input character triggers the output of the next character, then the next character, etc.  Starting with a character, the RNN generates a sequence of characters.  If we feed the input with Shakespeare, the output generated by the RNN is then compared to the desirable output and the cross-entropy loss function can be used to train the weights in the RNN.  With enough iterations, the network starts to generate text that share the many syntax features on the training sequence.

Figure 5. The input is one character at a time. The character is one-hot encoded.  The hidden state is updated accordingly and then used to generate logits for the output characters (softmaxed into probabilities).  One output character spits out as the result.  This is a sequence-to-sequence application depicted in Figure 2.A. (credit: [11])
Figure 6 & 7 shows some sample outputs where RNN generate Shakespeare and Linux C code.
Figure 6.  With sufficient training samples, the RNN can generate something syntactically looks like Shakespeare. 
Figure 7.  With sufficient training samples, the RNN can generate something syntactically looks like C programming code. (Credit [11])

The generated output seems sensible at their local structures, but non-sensible at the longer range.  This conveniently explains how we dream, fragments of dreams seem quite logical and movie-like, however, these fragments are loosely connected, they do not serve the purpose of one theme.  When the fragments accidentally make some sense, you are extremely impressed by the power of dreaming.  It is like how people chat, moving from one topic to the next, logical within each transition, but accomplishing nothing as the whole.

When the elements within the hidden states were examined, some elements appear to represent interesting sequence pattern, similar to the depth and summation elements in our previous example.  Figure 8 shows a few examples of such neurons, this reminds us that RNN is able to train meaningful neurons just like CNN neurons can represent complex patterns, such as faces or texts.

Figure 8. Three example elements in the hidden state, where each captures a distinct sequence pattern.  A. A neuron that is activated when it encounters a left quote, then deactivates when right quote is encountered, very similar to our previous $C_depth$ neuron. B. A neuron appears to count the number of characters it reads and gets reset when line break is encountered.  C. A neuron that counts the indention depth of the C code.  (Credit: [11])


Reference



  1. http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8911069&fileOId=8911070
  2. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. p373.
  3. Antonio Gulli, Deep Learning with Keras: Implementing deep learning models and neural networks with the power of Python, page 214.
  4. https://www.ted.com/talks/fei_fei_li_how_we_re_teaching_computers_to_understand_pictures
  5. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. p385.
  6. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. p403.
  7. Aurelien Geron, Hands-on machine learning with scikit-learn & tensorflow. p405.
  8. http://lstm.seas.harvard.edu/
  9. http://harinisuresh.com/2016/10/09/lstms/
  10. https://www.youtube.com/watch?v=xCGidAeyS4M
  11. http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf

Wednesday, February 28, 2018

Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning

Bioinformatics Application of Deep Learning Technology


Resource


We will discuss the following paper: Alipanahi et al., Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.  The paper is available here, the supplementary method is available here (subscription required).

All figures shown here should be credited to the paper, unless the source is explicitly provided.

Introduction


Genes are encoded by DNA sequences.  Whether a gene is expressed and how much it is expressed can be regulated by proteins called transcription factors (TF).  A TF protein can bind to certain DNA motifs, then recruit other factors to initiate the transcription of a piece of DNA sequence into an mRNA sequence.  In eukaryotic cells, mRNA consists of multiple exons separated by introns.  Introns are spliced out and exons are concatenated into a matured mRNA, which will then be translated into a protein.  RNA-binding proteins can regulate the splicing process and lead to the inclusion or exclusion of certain exons, therefore, producing different splicing isoforms, eventually different protein products of different functions.  Mutations in the genomic sequence can perturb the binding of proteins, then lead to undesirable under/over transcription of genes, or undesirable splicing forms, which can induce disease.  Therefore, it is of interest to understand what nucleotide pattern (motifs) a DNA/RNA-binding protein can recognize and how mutations impact such molecular recognition.

Let us describe two popular experimental approaches for identifying sequences enriched for protein binding sites.  First, protein binding microarrays (PBM).  The microarray contains many different short nucleotide sequences called probes at different array locations.  For example, an array can contains probes representing many variants of 10-nucleotide patterns (10-mers).  Even for a given 10-mer, it can include many probes where the same 10-mer is embedded within different sequence context of the longer probe sequence.  A TF protein is hybridized over the array and it will be stuck to the places where binding occurs and produce fluorescent readouts.  Therefore, PBM gives readout of probe sequences and their intensity readout indicates the strength of the binding to a specific TF.  A deep learning (DL) system aims to take these probe sequences as input $X$ and predict the binding score $y$ (i.e., intensity).  The natural choice of the loss function would be least-square error, as this is a regression task.  Second, ChIP-seq technology.  When a TF is cross-linked to genomics materials, DNA were chopped and all DNA fragments bound to the TF will be extracted through chromatin immunoprecipitation (ChIP) process, bound DNA/RNA fragments are NGS-sequenced to produce readout (NGS stands for next-generation sequencing).  ChIP-seq data only provides positive sequence fragments, one needs to added negative sequences (either taken from genome background, or the authors here generated negative sequence by permuting positive sequences), a DL system aims to classify an input sequence $X$ into binary $y$ of 1/0, i.e., bind or not bind.  Since this is a classification problem, cross-entropy loss function is a natural choice.  The paper also studied datasets generated using a HT-SELEX technology, where the data format is similar to that of ChIP-seq.

The motif discovery problem is a well-studied bioinformatics problem for the past twenty years (I also contributed an algorithm called GEMS).  The positive sequences are collected and algorithms look for patterns that are over-represented among them, compared to a pool of negative background sequences.  Alipanahi et al. used a DL approach, more specifically, a CNN + FNN architecture that archived better performance than existing tools.  It is a very clean application of CNN to bioinformatics.


Results


Architecture

In the diagram below, the system processes five input sequences in parallel.  This is rather a technical detail in order to make full use of the GPU resource.  For our purpose, we only need to focus on one sequence.  The sequence is first cast into two sequences, the original one and its reverse complement (e.g., ATGG is cast into CCAT as well).  This is because a TF could theoretically recognize either one of the DNA strand, or both in rare cases (Figure 4.e in the paper is such a rare case).  A motif is a sequence pattern represented by a 1-D CNN convolution filter.  More precisely, the input sequence is actually a 4-channel 1-D vector, where 4 channels encode the base composition A, T, C, G and the vector is a 1/0 binary vector one-hot encoding the input sequence.  For example, a short sequence "ATGG" can be represented as:


$$S_{ATGG} = \begin{bmatrix} A & C & G & T\\
. 25  & . 25 & . 25 & . 25\\
. 25 & . 25 & . 25 & . 25 \\
1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
. 25 & . 25 & . 25 & . 25 \\
. 25 & . 25 & . 25 & . 25 \end{bmatrix}. $$

Those rows containing 0.25 are paddings (represent "any" nucleotide) to extend the original sequence, so that convolution will not miss motifs partially overlapping with the ends.  If the CNN is designed to capture motifs of length $m$ (it will capture motifs length $\lt m$ as well), there will be $m-1$ extra 0.25- rows on both ends.  A motif is represented as a convolution filter $M$ that is known as position-weighted matrix (PWM) in bioinformatics, it basically specifies the weighting of each nucleotide at each location.  The output of the convolution layer is:

$$\begin{equation} \begin{aligned}
 X_{i,k} &= \sum_{j=1}^{m} \sum_{l=1}^{4} S_{i+j,l}M_{k,j,l}, \\
Y_{i,k} &= \max(0, X_{i,k}-b_k),
\end{aligned} \end{equation}$$

where $k$ is the index of a particular motif (a motif here is called a feature in CNN), $i$ is a location on the input sequence, $m$ is the length of the motif or the convolution window, the convolution therefore is taken over all positions $j$ and over all four nucleotide channels $l$.  The sum is then passed through a ReLU function as shown in the second step (weights can be negative, so neurons with convolution sum less than $b_k$ will not be activated).  The result is a feature map representing where a given motif is detected along the input.  A pooling layer is followed.  The authors used a max-pooling layer for DNA-binding proteins and both max- and average-pooling for RNA-binding proteins.  In the latter case, $k$ features results in $2k$ pooling maps.  The pooled output will then serve as the input for a fully-connected neural network (FNN) to generate target scores, one per input sequence.

This architecture is very simple, as it only contains one CNN convolution layer and a pooling layer.  The usage of CNN here is because the motif can be located at arbitrary positions in the input sequence, just like a dog can appear in any location of an image, CNN, beings translation-invariant, is a natural option for this problem.




Performance



The CNN is then trained with a set of experimental data (either PBM, ChIP-seq, or HT-SELEX) and its performance was shown to be consistently superior than established bioinformatics motif discovery tools.  Without getting into the details of biology applications, the figure below shows their CNN solution, named DeepBind,was able to out-perform other methods on PBM, ChiP-seq, SELEX datasets according to various performance metrics (AUC or Pearson correlation scores).



Why CNN is Better?


Unfortunately, the paper did not discuss why DeepBind performed better.  Although it shows many examples of how the motifs identified makes biological sense, but that is not something unique, as traditional motif discovery tools can produce those results as well (the author did not say traditional tools failed in doing that).  It was only mentioned in the beginning of the paper:

"The sequence specificities of a protein are most commonly characterized using position weight matrices (PWMs), which are easy to interpret and can be scanned over a genomic sequence to detect potential binding sites. However, growing evidence indicates that sequence specificities can be more accurately captured by more complex techniques".

This hints the motivation of their study is to describe motifs with a language more sophisticated than PWM.  However, motifs in CNN are described by filters, which is basically the same as PWM, what is the strength of this approach then?

I think what happened probably is DeepBind contains $k$ motif sub-elements.  In previous methods, only one PWM is used to explain biological measurements (for that specific transcription factor or RNA-binding protein).  However, in DeepBind, multiple elements are combined through a FNN for activity prediction.  Although each motif element is equivalent to a PWM, their combination could non-linear motif descriptor cannot be captured by PWM.  Hypothetically, imagine there are two motif elements in a real system: AGGGT and TGGGA.  That is "GGG" in the middle, when "A" appears on the left, "T" is favored on the right, similarity a left "T" pairs with a right "A".  However, AGGGA and TGGGT are not valid patterns.  However, if we only allow this to be represented using one PWM, the result is an "averaged" motif looks like "A/T-GGG-A/T", which would encode wron patterns such as AGGGA and TGGGT.  We therefore hypothesize the CNN system provides a way to capture motifs in a non-linear way, more powerful than PWM.


Another possible reason is NN allows a motif consisting of wide-spacing sub elements, which are very hard to be discovered by traditional algorithms due to the vast search space.  But wide-spacing motifs are important particularly for RNA-binding proteins, as they bind to RNA structures in non-trivial manner.  As shown below, unlike DNA-binding, RNA-binding proteins may not recognize only one binding site, but may actually recognize multiple binding sites on the surface of RNA.  Unlike double-strand DNA fragments (the last structure in the figure below), single-strand RNA can form complex secondary structures, oftentimes, the RNA structure contributes significantly to the binding no less than the specific oligonucleotide bases.  The authors commented: "RNA recognition motif, the most abundant RNA binding domain, is highly flexible, and usually the assembly of multiple domains are needed for proper binding. Third, “indirect readout” in which the structure of nucleotides guide the binding process, is more prominent in RBPs; Gupta and Gribskov analyzed 211 protein-RNA structural complexes and reported that nearly three-quarters of protein-RNA interactions involve the RNA backbone, not just the bases."  A RNA-binding protein could recognize multiple binding sites on the surface in contact, therefore, this is better modeled by multiple motif features in DeepBind, where they are combined through FNN to predict the binding affinity.  Traditional tools identify motifs as one continuous binding element, without synergistic combinations of underlying elements.  I suspect this might be the main reason DeepBind performs well, at least for the more difficult RNA-binding proteins.
Protein-binding interface on RNA (DNA, the last example) (source).  Purple: binding sites, yellow: non-binding sites.

We here only see one convolution layer, instead of multiple repeats of convolution plus pooling layers.  Hierarchical motif patterns may not be necessary for motif discovery problems, because although shorted motif elements, say "ATC" might appear in multiple longer motifs, the frequency of A, T, and C in the longer motif might be very different compared to their frequencies in the short elements.  So unlikely eyes, noses and mouths made up of faces, short motif PWMs may not be very effective in concatenated to form larger PWMs.


How to Visualize CNN?


A PWM generated by traditional bioinformatics tools can be easily visualized as a motif logo, where the height of a character encodes the frequency or information content of a base in its PWM.  CNN here contains multiple motifs and combines them in a non-linear way, therefore, it is conceptually incorrect to visualize each convolution filter alone.  How to visualize DeepBind is a rather challenging topics, the authors did not address this in my view.  Instead they adopted a much naive approach -- basically visualizing individual feature.  For each sequence in the test set, if the sequence contains a positive output signal for a given motif feature map, the corresponding input sequence fragment (gives the maximum activation) is retained.  Then all such CNN activating fragments are aligned and base counts are compiled to construct a PWM and subsequently generating a motif logo.  Some example logos are shown in the figure below, where it shows DeepBind model indeed reproduced many known motifs.  As we mentioned the downside of this approach is it really does not visualize the non-linearity introduced in the FNN component as we hypothesized, why DeepBind performs better is really not answered in this simplified representation.

Examples of known motifs discovered by DeepBind.

Mutation Map


An alternative visualization that is conceptually more powerful is called mutation map.  Given an input sequence $s$, DeepBind predicts a binding score $p(s)$.  We can then mutation the nucleotide at each position into a different bases, i.e., resulting in a new sequence $\hat{s}$ and predict a new binding score $p(\hat{s})$.  The importance of a particular single-base mutation, therefore, can be quantified by:

$$\begin{equation} \Delta S_{ij} = p(\hat{s})-p(s). \end{equation}$$

The nice thing is the mutation score is based on the whole CNN plus FNN system, therefore, include all the non-linear effects.  However, I guess the above equation did not work well as it is.  As the predicted scores do not necessarily represents the change in the unit of binding energy; The scores and binding energy might be related through some non-linear function.  The authors empirically used the following formula instead, where the effect of a mutation is magnified by the binding strength of the mutation position:

$$\begin{equation} \Delta S_{ij} = (p(\hat{s})-p(s)) \cdot \max(0, p(\hat{s}), p(s)). \end{equation}$$

Two examples of mutation maps are shown below.  On the left, it shows the promoter region of LDL-R gene; LDL-R is the LDL receptor gene and LDL is known as the "bad cholesterol".  The wild type genome contains based C in the middle of promoter, which can be recognized by SP1 protein, which then drives the expression of the LDL-R, and that helps recognize excess levels of LDL and regulates its removal from the blood.  Patients with this base mutated to G seriously decrease the binding score (see the red dot at the bottom), as the mutant version is no longer recognized by SP1.  This leads to the deleterious condition known as hypercholesterolemia (high level of cholesterol) due to the low expression of LDL-R.  On the right, a mutation in the globin gene, from A to G, creates a new binding site TGATAA attracting a TF protein called GATA1.   The binding of GATA1 blocked the genome region originally reserved to serve as the promoter of globin.  As the result, globin is no longer expressed at sufficient level, which leads to a disease called $\alpha$ thalassemia (insufficient amount of hemoglobin to carry oxygen in blood).



How DeepBind is Optimized?


The DeepBind system contains many hyperparameters, including motif length $m$, number of motifs $k$, learning rate, momentum, batch size, weight decay, dropout rate, etc.  The training process first randomly sample 30 sets of such hyperparameters (see Figure below), using 3-fold cross validation to evaluate their performance.  The best performing set of hyperparameter is then used to repeat the DeepBind training process using the whole training dataset three times.  This is to avoid statistical fluctuations during the optimization process, the best performing set of DeepBind model is then chosen as the final model for the given protein and used to predict the test dataset and produce performance metrics.  One DeepBind model was trained per protein per experimental data type.




Conclusion


CNN is a DL architecture suitable to discover translation-invariant patterns in inputs.  DeepBind system uses CNN to capture protein-binding motifs on DNA/RNA, where NN architecture enables it to capture some non-linearity effects and/or aggregation effects of multiple underlying motif patterns, as well as aggregate motifs that are widely spaced.  The study showed DeepBind surpassed traditional bioinformatics motif discovery tools and was able to explain a number of disease causing mutations.  Our visual estimation shows the improvement, compared to traditional approach, is at about a few percentage, similar to what we generally see from DL applications in bioinformatics.  However, the architecture of DeepMind is a lot simpler compared to the many sophisticated motif discovery algorithms developed over the passed two decades.  The authors did not offer explanations on why DeepBind performed better.