Bioinformatics Application of Deep Learning Technology
Resource
We will discuss the following paper: Sonderby et al., Convolutional LSTM Networks for Subcellular Localization of Proteins. The first draft is available
here, the published version of the paper with a much more sophisticated architecture and more detailed technical descriptions is available
here. As the two versions are rather different, we discuss both.
All figures shown here should be credited to the two papers, unless the source is explicitly provided.
Introduction
Proteins are products of a factory called ribosome, each protein has its own blueprint called message RNA (rooted from its gene structure). They are bikes, cars, ships and airplanes in biological systems, therefore they need to be delivered into different biological compartments in order to carry out their functions. A ship cannot sail on land, it needs to be moved onto the sea, and cells have a complex UPS delivery system, where it is able to read the "shipping barcode" carried on the protein sequences and ship the proteins to their destinations. E.g., some proteins will stay within the cell, but in different organelles, some may sit on cell membrane, others may be shipped outside into extracellular space and serve as carriers or messengers. For a nice introduction on how proteins are delivered to different destinations, please read
this. To predict the subcellular location of a protein is a very interesting topic in bioinformatics, as it can provide clues to the function of a protein.
Existing bioinformatics prediction tools often are not ab initio, which means they tend to use information other than the sequence of the protein. E.g., the MultiLoc program has a MotifSearch component, which basically used knowledge of experimentally confirmed signal motifs; many "cheat" by using homology information [1] (it is not a true prediction, but rather a mix of prediction and annotation). These tools will perform poorly, when the protein to be predicted has no close homologues in the training data set. In addition, existing bioinformatic tools relying on traditional machine learning techniques, such as SVM, require a fixed length input, therefore, they are conceptually not the most suitable choice for the variable protein lengths and are very difficult to be further improved. Sonderby et al. was able to use Deep Learning tools to build a DeepLoc system, that improved the prediction accuracy from 0.767 (MultiLoc) to 0.902 (Table 1 in [2]) in their draft version, when only sequence data was used [2]. In the later publication, they were able to further improve their solution to surpass other tools on a more realistic data set, 0.7797 for DeepLoc versus 0.5592 for MultiLoc2, when homology factor is largely eliminated [3]. Let us see how they did it.
Architecture
Encoding by CNN
We first need to represent an input protein sequence. There are 20 types of amino acids, therefore, a protein sequence of $n$ amino acids is naturally a 20-element one-hot-encoded vector $20 \times n$. Amino acids have different physicochemical properties, some are charge neural, some are positively or negatively charged, some are polar or hydrophobic, etc. The one-hot encoding does not take advantage of this knowledge. A popular approach is to use the BLOSUM matrix element as the scores for each amino acid. In the first draft, the authors use a combination of four different representations: one-hot, BLOSUM80 matrix, HSDM matrix, and sequence profiles (each protein is blasted against protein database to fish out its homologs; a profile is then constructed based on the multi-sequence alignment.). If you are not familiar with bioinformatics, just consider BLOSUM matrix only as a similarity matrix, where chemically-similar residues have a higher positive score and chemically-opposite residues have a negative score. In the final publication, the authors actually only chose one of the four representation; let us pretend the BLOSUM matrix was the one chosen in Figure 1.
First, the BLOSUM representation is transformed to be 1000 in length (only 10% of proteins are longer than this size). If a protein is shorter, it is padded by NULL amino acids in the middle, otherwise, the extra residues in the middle are truncated. This is because it is known that the signal carrying the location information is often located at the two ends (N-terminal and C-terminal, i.e., left and right ends) of a protein. This makes biological sense, as ends of proteins tend to be loose and are available to be in contact with other proteins to allow the signal to be utilized, while the middle part can be wrapped and embedded within a 3D structure and inaccessible.
|
Figure 1. A CNN for encoding an input protein sequence. [3] |
The $1000 \times 20$ input matrix is then fed to six independent CNN modules (Figure 1). Take the left most CNN module as an example, it has 20 features, each uses a convolution filter spanning 21 positions (as the input has 20 channels, one per an amino acid channel), the feature map dimension is $21 \times 20 \times 20$. The right most module has no convolution, simply passing the original input. Therefore, this CNN system looks for motifs of length 1, 3, 5, 9, 15 and 21. Why do we want to encode the input using motifs? We think the delivery system relies on some general tags (motifs). All proteins to be shipped extracelluar may share a few motif patterns, so when CNN is trained, such motif patterns are over-represented, as they appear more frequently in the protein sequence. Then CNN is able to represent a protein sequence by saying there is a motif A here, and there is a motif B there; such representation certainly makes downstream prediction way easier. All the $20 \times 6 = 120$ feature map outputs are combined into a $1000 \times 120$ motif representation for the input sequence. Another layer of convolution is introduce to carry out one more covolution using filter size $3 \times 120$ and obtain 128 feature maps of size 1000 each. So the end feature matrix is of dimension $1000 \times 128$.
Bidirectional RNN
The RNN is made of a LSTM cell of 256 hidden elements. If we pipe the $1000 \times 128$ sequence through the LSTM unit from N-to-C direction, the N-terminal signal may still be hard to retain, when the scanning reaches the C-terminal end. Despite LSTM has longer memory, at least it is not so easy to optimize during the training process. In fact, most of the location barcodes for a protein are near either the N-terminal (left end) or the C-terminal (right end). Therefore, another popular technique to retain more signals from both terminals is to use two LSTM cells and scan the protein in both directions: N-to-C and C-to-N, respectively. When the LSTM reads in one input base, it produces an output $\mathbf{h}$, therefore, the two directional LSTM have two outputs at each time step, both are combined into a final output vector, i.e., $\mathbf{h}_{N2C,t}$ is concatenated with $\mathbf{h}_{C2N, t}$, and we have hidden state vector $\mathbf{h}_t$ of size 512 elements (shown as the blue and red bars in Figure 2). As there are 1000 time steps, the feature now is $1000 \times 512$.
|
Figure 2. Bidirectional LSTM network is used to generate 1000 vectors of 512 hidden states. [2] |
Attention Mechanism
Not all of the 1000 hidden states are equally informative. Thanks to the dual scan, the hidden state for each time step captures the information for the complete sequence, therefore, we can certainly use $\mathbf{h}_1$ and $\mathbf{h}_{1000}$ as the new input and hook up a FNN to classify proteins into different classes, representing different subcellular locations. This should work, however, may not be optimal (as shown in Figure 6 later). The footprint of the destination motif is the strongest, when it is just seen, therefore if the location barcode of a protein locates at the 100th amino acid (N-term, left end), the hidden states around that time step would be the best for prediction. So the idea is somehow we aim to come up a 1000-element weighting vector, where each of the 1000 states are weighted differently. Ideally, the states closer to the useful tags have larger weights, we can then weighted average all 1000 hidden states into a final state vector, which should then lead to better predictions. To figure out which time steps the latter classification system should pay attention to, the authors used an attention mechanism. The attention mechanism described in their final publication [3] is rather complicated and hard to understand, so I choose here to discuss the mechanism presented in their draft version [2].
The attention component has a weight matrix $\mathbf{W}_a$ and an attention output vector $\mathbf{v}_a$. We compute an attention scalar for time point $t$ by:
$$\begin{equation}a_t = \rm{tanh}(\mathbf{h}_t \mathbf{W}_a)\cdot \mathbf{v}_a^\intercal. \end{equation}$$
The way I interpret Equation 1 is there are some motifs that are very informative for location prediction, they are encoded by some elements of the hidden states. Say we have $k$ important motifs, each is associated with a few elements in the hidden states, therefore $\mathbf{W}_a$ is a matrix of $k \times 512$ that extracts the presence data of each of the $k$ motifs from within the hidden state. The output of $\rm{tanh}(\cdot)$ thus is a "probability vector" indicating whether $\mathbf{h}_t$ contains those $k$ motifs. In the second step, considering not all such motifs are of equal importance, therefore, $\mathbf{a}_v$ is a vector coding the importance of each of the $k$ motifs. The resultant scalar $a_t$ is therefore the logits measuring the importance of each hidden state. The weight for each hidden state can then be obtained by softmax:
$$\begin{equation} a_t = \frac{exp(a_t)}{\sum_{t^\prime=1}^T \; exp(a_{t^\prime})}. \end{equation}$$
The final context vector representing the protein is a weighted sum:
$$\begin{equation} \mathbf{c} = \sum_{t^=1}^T \; \mathbf{h}_t a_t. \end{equation}$$
This then serves as the input to an FNN to classify proteins into 11 classes. The authors used a single layer FNN in their draft version; they also introduce a hierarchical tree Bayesian style predictor we will describe later in the discussion.
Results
Each protein is encoded by its context vector $\mathbf{c}$, Figure 3 is the t-SNE 2-dimensional visualization of all proteins predicted using two separate data sets. It is clear that the encoding (basically the transformed feature space) is very good at separating proteins of different classes, therefore, the above system is quite capable of extracting the location-rich features, and the downstream classification system does not have to be too sophisticated. $\mathbf{c}$ is very interesting, they encode the variable length proteins into a fixed length vector. Unfortunately, they are trained based on subcelluar location data, they cannot be used as general-purpose protein vectors.
|
Figure 3. t-SNE representation of the context vector, colored by different protein classes [3]. For the data set on the right, there is quite some degree of homology between training and test sets, therefore, data points are better separated. For the data set the authors constructed on the left, there is less homology between training and test sets, so it is a more realistic reflection of the prediction quality on new proteins, which therefore is expected to be less well separated on t-SNE map. |
Figure 4 plots the attention vector, where we expect the locations catching the attention tend to be where location tags resides. It is clear that extracellular proteins, plastid proteins, mitochondrion proteins have their signals at the very end of the N-terminal. Transmembrane preteins have their signals scatter across the whole range, where there are dark fragments probably indicating transmembrane regions (where those are hydrophobic amino acides, which are probably extracted by the CNN). These transmembrane tags in the middle part of the protein are not necessary our shipping barcodes, they are probably just signature motifs for transmembrane protein, therefore, highly correlated with the true shipping barcode. Such patterns are useful for prediction, but are not necessarily the reason for cells to sort proteins. For most proteins, signals are mostly distributed around the N-terminal, and sometimes towards the C-terminal. The attention data makes sense.
|
Figure 4. Attention vectors for proteins of different classes [3]. |
Figure 5 shows the overall performance data of the describe system - DeepLoc, compared with other published methods using the DeepLoc data set. The DeepLoc approach is noticeable better.
|
Figure 5. Performance comparison of DeepLoc (this study) to other methods. [3] |
Discussion
Comparing the first draft (unpublished version) [2] and the final published version [3], we can see the architecture of the network were changed quite a lot, it became way more sophisticated. The authors initially easily obtained better performance than MultiLoc using the MultiLoc dataset - Hoglund dataset [1]. However, the Hoglund dataset over estimates classification accuracy, because its test set contains many proteins homologous to the proteins in the training set, therefore, makes the prediction easier. The authors later decided to create a new dataset, where they first clustered proteins, and then divided proteins into different cross-validation partitions cluster by cluster. This way, the tools cannot "cheat" by homology information and the true prediction performance is measured. I suspect this probably led to some poor performance of their initial architecture, so they started adding lots of bells and whistles. However, based on the performance data, I am not fully convinced the complexities introduced into the final version is all that essential. It feels to me more like the authors wanted to describe all these add-ons, since they had already tried those ideas and did not want to waste them by not publishing. I will explain why I feel this way.
Input Representation
The original draft represents a protein by a $1000 \times 80$ matrix, where they combined four different amino acid representations. That sounds intuitive. In the final version, they first identified the optimal hyperparameter set by scanning one hyperparameter at a time, then fixed the optimal hyperparameter and tested each of the four representations alone. The profile representation was found to be the best and was chosen for their final performance comparison. However, the performance data published in Figure 6 was not based on profiles, but based on BLOSUM instead (interesting, why do that?).
Based on this table, the bidirectional LSTM system indeed significantly outperformed FFN (Feed Forward Network) (0.69 versus 0.52). This was further improved by introducing the attention mechanism (0.73, a 5% improvement). Considering attention mechanism offers additional benefit of visualizing where motifs are, this is a worthwhile addition. However, adding convolution components did not make any difference. One could simply use the A-BLSTM system, which would be conceptually a much cleaner design I would personally favor. I suspect the similar table for profile-based performance might show a marginal improvement of CONV A-BLSTM over A-BLSTM, however, the difference should not be too much. This is one reason I feel the architecture in the final publication seems overdone and runs the risk of overfitting. However, since the authors did cross-validation correctly (they emphasized they did not peek into test set during hyperparameter and parameter optimization), we cannot point finger at what might be wrong with their bells and whistles. The take home message here is the A-BLSTM system is the core piece of DeepLoc, we do not have to pay much attention to the CNN part.
|
Figure 6. FFN: feedforward neural network, BLSTM: bidirectional LSTM, A-BLSTM: BLSTM with attention mechanism, CONV A-BLSTM: A-BLSTM with convolution component added. [3] |
The attention mechanism in their final version is rather complicated, using two RNN systems and a 10-depth iterations to obtain the final weight vector for all time points. I have a hard time understand the rational behind (not really discussed in the paper) after quite a few reads, therefore, I choose to only discuss the attention mechanism they used in their original draft.
Classification
In the draft version, each protein is classified into one of 11 classes. Although it was not described, but we can comfortably assume softmax classification is used. In the final version, two classes lysosome and vacuole were combined into one class, probably because the individual accuracies associated with them were too low. In that version, besides softmax, the authors also introduced a Bayesian model to mimic how a protein package is sorted by cell (see
link). As shown in Figure 7, in order for a protein to be predicted as an extracelluar protein, it first need to be predicted as a secretory protein at the tree root, then extracellular, finally an extracellular protein. Each white node is associated with a logit output of the classification network, however, these nine white nodes are not softmax'ed as a whole, as their probability should not summed to one. Instead, each of the white node represents a conditional probability, the logit output of each is directly cast into a probability by a sigmoid function, $\sigma = \frac{e^x}{1+e^x}$. This is because the probability of the two branches given the parent node should sum to one, for example, secretory protein and non-secretory protein sum to one. The probability of intracellular and extracellular protein, given the protein is a secretory one, again sum to one. To calculate the probability associated with each class, i.e., the probability associated with the black nodes in Figure 7, we use the Bayesian graph model. For instance:
$$\begin{equation}p(Golgi|X) = p(Golgi|ER/Golgi)p(ER/Golgi|Intracellular) \\ p(Intracelluar|Secretory Pathway)p(Secretory Pathway|X).\end{equation}$$
Notice that the secretory pathway does not mean the protein will be extracellular. Proteins eventually bound to endomembrane systems will also be first sorted into secretory pathway (entering rough ER) while the protein is being translated, therefore, for secretory-pathway proteins, there must be an N-terminal signal. However, this tree-based classification model did not produce better results compared to the simple softmax model. Then why did the authors describe them in the paper, I can only say they did not waste any material they used during their study, when they wrote the paper. So it is important for us to remain focus on the core architecture of DeepLoc, not be distracted by the many details.
|
Figure 7. The prediction tree. [3] |
Datasets
The paper pointed out an important mistake that is often made by machine learning applications, that is whether the random partitioning the data into train-test sets realistically reflect the final goal of the machine learning task. If our goal is to predict proteins, where they are often homologous to our training proteins, the traditional random partition of the dataset works fine. However, if our goal is more ambitious, i.e., we aim to predict locations of novel proteins, for those that are not homologous to our training set, we would need to make sure there is homology between our training set and test set.
In Figure 8, DeepLoc is a dataset authors compiled without much homology between training and test sets (as proteins were clustered and then divided by cluster), Hoglund is a data set used by MultiLoc that share homologs, as random partition is used. Therefore, using Hoglund as the training set, we see a high accuracy of 0.9138 in the Hoglund test set (4th row), however, this does not reflect the true performance on novel proteins. The true performance is 0.6426 (2nd row), much lower. On the contrary, if it was trained based on DeepLoc training set, its performance is 0.7511 on DeepLoc test set (first row), which is a more realistic measure, as it actually performs 0.8301 (3rd row), generalization is better than expectation.
|
Figure 8. Performance of using different training and test data sets. [3] |
Conclusion
The authors started from amino acid sequence information to predict subcellular locations of proteins without experimental meta data. They focus on using RNN to construct a good fixed-length feature representation for input sequences. The amino acid sequences were first analyzed through a CNN system, where sequences were represented in the motif space. Then these new inputs were processed by an LSTM system from both N-term to C-term direction and backwards to avoid signal lose. The authors also introduced an attention mechanism to further enhance the possible tag signals by weighted averaging the 1000 hidden states into a final fixed-length context feature vector. They were able to show the context feature vector was very efficient in encoding protein location information (Figure 3) and the attention vector indeed seem to highlight where the location tags were (Figure 4). The take-home message is the LSTM (RNN) architecture can be used to effective encode variable length inputs.
Reference
1. MultiLoc: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btl002
2. https://arxiv.org/abs/1503.01919 (Table 1)
3. DeepLoc: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btx431 (Table 7)