Sciency Stuff

Sunday, September 22, 2019

Bound-constrained Optimization by Variable Mapping

This is a note written many years ago, when I was working on a curve fitting problem. Parameter fitting with bound constrains was not supported in numerical libraries available at the time. Many people approached the bound-constrained problem through adding a penalty term, which distorts the original target function. Therefore, the ideas below is still worth a post in my opinion.

Introduction

Considering the optimization problem of finding a minimum of a function $f(x), x \in R^n$, subject to bounds $a \le x \le b$, i.e.:

$$\begin{equation} x^* = \arg\min_x f(x), x \in [a,b] \end{equation} $$.

Although solutions have been described previously [1-2], we here propose an alternative conceptually simpler approach by optimizing an equivalent function $f(y), y \in R^n$ and $-\infty \le y \le \infty$. That is by mapping the constrained variable $x$ into an unconstrained variable $y$ via $x = \mathcal{M}(y)$, the solution $y^* = \arg\min_y{f(y)} $ can be found using any unconstrained optimization solver, then map $y^*$ back, $x^* = \mathcal{M}(y^*)$.

Variable Mapping

It is a trivial case if $a$ is $-\infty$ and $b$ is $\infty$, i.e., an unconstrained case. Therefore, we only consider mapping functions for the following three non-trivial cases:

Case A: given constrain $a \le x \le b$, define $y$, so that
$$\begin{equation} x = \frac{e^y-e^{-y}}{e^y+e^{-y} }\frac{b-a}{2}+\frac{b+a}{2}. \end{equation} $$
As $x$ increases from $a$ to $b$, $y$ increases from $-\infty$ to $\infty$ monotonically.

Case B: given constrain $-\infty \lt x \le b$, define $y$, so that
$$\begin{equation} x = b-e^{-y}. \end{equation}$$
As $x$ increases from $-\infty$ to $b$, $y$ increases from $-\infty$ to $\infty$ monotonically.

Case C: given constrain $a \le x \lt \infty$, define $y$, so that
$$\begin{equation} x = a+e^{y}. \end{equation}$$
As $x$ increases from $a$ to $\infty$, $y$ increases from $-\infty$ to $\infty$ monotonically.

With the above mapping functions, the minimization of $f(x)$ with bound constrains is equivalent to the minimization of $f(y)$ without any constrain.

Example A

Figure 1. $f(x) = -x^2$

Let's find the $x$ that minimize $f(x) = -x^2$, with the catch that $x$ can only be in $[1,2]$. We know the answer should be 2. However, if you directly solve this problem numerically without constrains, the answer will be $x = \inf$, outside the domain box.

Using Equation 2, we can define $y$ as:

$$\begin{equation} x = \frac{1}{2}\frac{e^y-e^{-y}}{e^y+e^{-y}} + \frac{3}{2}. \end{equation}$$

We then find the $y^*$ that minimizes $f(y)$:

$$\begin{equation} y^* = \arg\min_y - \left( \frac{1}{2}\frac{e^y-e^{-y}}{e^y+e^{-y}} + \frac{3}{2} \right)^2. \end{equation}$$

Now numerical solution of the above equation returns a really large value $y^* = \infty$.

Therefore, corresponding solution according to Equation 5 is: $x^* = 2$, satisfying the constrain.

Example B

This example was what motivated this study. If it is not understandable to you, do not worry. A common problem in biomedical research is to characterize the potency of a compound in a biological assay by determining the parameters described in a logistic regression formula. By taking $n$ measured data points $(c_i, r_i), i = 1, …, n$, where $r_i$ is assay activity measured with dosing the compound at concentration $c_i$. The compound is characterized by four parameters, which should be determined by minimizing the least square error between experimental data points and their corresponding theoretical predictions as the following:

$$\begin{equation}
\min_{x_1,x_2,x_3,x_4 }⁡ \sum_{i=1}^{n} \left( x_1+\frac{x_2-x_1}{1+{(\frac{c_i}{x_3})}^{x_4 }}-r_i \right) ^2 . \end{equation}$$

To ensure the parameters $\textbf{x}$ are biologically meaningful, $x_1,x_2,x_3,x_4$ are subject to individual constrains, for instance, $0 \le x_1 \le 0.2, 0.8 \le x_2 \le 1.2, 0 \lt x_3$, and $0.3 \le x_4 \le 3.0$. Parameters $x_1, x_2, x_3$, and $x_4$ are also known as bottom, top, $IC_{50}$, and Hill slope of the dose-response formula (see [3] for details). A negative $IC_{50}$ would be biologically meaningless.

To solve the above minimization problem, we optimize a similar unconstrained function of $y$:

$$\begin{equation} \min_{y_1,y_2,y_3,y_4 }⁡ \sum_{i=1}^{n} \left( \mathcal{M}_1(y_1)+\frac{\mathcal{M}_2(y_2)-\mathcal{M}_1(y_1)}{1+{(\frac{c_i}{\mathcal{M}_3(y_3)})}^{\mathcal{M}_4(y_4) }}-r_i \right) ^2 , \\
\textbf{y} \in (-\infty, \infty).\end{equation}$$
with mappings (based on Equation 2 and 4):

$$\begin{equation} \begin{aligned}
x_1 &= \mathcal{M}_1(y_1) = 0.2 \frac{e^{y_1}}{e^{y_1 }+e^{-y_1}} \\
x_2 &= \mathcal{M}_2(y_2) = 1+0.2 \frac{e^{y_2 }-e^{-y_2 }}{e^{y_2}+e^{-y_2}} \\
x_3 &= \mathcal{M}_3(y_3) = e^{y_3} \\
x_4 &= \mathcal{M}_4(y_4) = 1.65+ 1.35 \frac{e^{y_4 }-e^{-y_4 }}{e^{y_4 }+e^{-y_4}} \end{aligned} \end{equation}.$$

We use any numerical solver to solve $\textbf{y}^*$ in Equation 6, then find out $\textbf{x}^*$ using Equation 7. All constrains on $\textbf{x}$ are automatically satisfied.

Discussion

The variable mapping approach introduced appears to be conceptually simpler compared to ones previously introduced. By mapping the constrained variable space $x$ into unconstrained space $y$, we can utilize nearly any well-studied general optimization algorithms to solve the unconstrained problem. The mapping functions introduced here are generally well behaved and do not expect to significantly increase the complexity of the original optimization problem. However, better mapping functions most likely exist. If the general optimization algorithm of interest requires calculating the derivatives of $f(y)$ with respect to $y$, such derivatives can be easily computed. Taking for instance the case A mapping described in equation 2:

$$\begin{equation} \begin{aligned}
\frac{\partial{f(\textbf{y})}}{\partial{y_i}}&=\frac{\partial{f(\textbf{x})}}{\partial{dx_i}} \frac{d \mathcal{M}_i(y_i) = }{d y_i} \\
&=\frac{2(b-a)}{(e^{y_i}+e^{-y_i} )^2 } \frac{\partial{f(\textbf{x})}}{\partial{dx_i}} \end{aligned} \end{equation}$$.

Reference

Powell MJD, The BOBYQA algorithm for bound constrained optimization without derivatives. DAMTP 2009/NA06.
Dieter Kraft, Algorithm 733: TOMP – Fortran modules for optimal control calculations. ACM Transactions on Mathematical Software. (1994) 20:262-281.
https://en.wikipedia.org/wiki/Dose%E2%80%93response_relationship

Friday, April 19, 2019

Notes on Biological Image Analysis

Table of Content

Chapter 1. Overview
Chapter 2. ImageJ Training Course
Chapter 3. 3D Imaging Analysis

Chapter 1. Overview

Computer Vision

Imaging is an effective way to digitally capture the state of a biological system, therefore, it is naturally used to also characterize the changes in the system. For example by comparing the changes in cell count between a cancer cell line and a normal cell line after the same chemical perturbation, we can quantify the potential differential anti-tumor activities of the compound. Similarly morphological changes within a stem cell population gives hints to the effectiveness of a compound in some cell regeneration process.

Human vision is superb at recognizing objects, e.g., given a cell image (Figure 1.a), we have little trouble in outlining the contours of both individual nucleus and their corresponding cytoplasm boundaries (most of them) (Figure 1.b). The boundary outlining process is called segmentation. In modern drug discovery facilities, a high-throughput high-content screen may involve a million compounds and one thousand cells per compound treatment. Repeating this process with human labor to depict one billion cells is clearly not an option. Besides being slow, human is very poor at being precise and consistent at both outlining these cellular compartments throughout the whole segmentation process. In addition, we need to take measurements, such as the area, perimeter, and average intensities within each cytoplasm region after the segmentation step. Computer is clearly more efficient at doing that. Without question we must rely on computer vision technologies to analyze digital images automatically. We here focus on the topic of segmentation alone.

Figure 1.a

Figure 1.b The cell image is taken from a demo image named "\Widefield Images\Segmentation\Cell.tif)" used by "Fiji Training Notes".

The example image in Figure 1 is not too hard to segment by computer. One idea is to set an intensity cutoff, as nuclei are the brightest in general compared to other compartments, and cytoplasm pixels are brighter than the background dark pixels. Although this threshold idea is straightforward to implement, some manual tweaking is still required. This is because not all nuclei are of the same brightness. To make the point, we turn the intensity dimension (Figure 1.a) into height and show the cell image as a 3D surface plot (Figure 2.a). Nuclei are of varying heights, i.e., different intensities. When we try to set a universal intensity cutoff, say based on the quality of segmenting one nucleus (yellow arrow in Figure 2.c), other brighter nuclei (green arrows in Figure 2.c) fused into one super nucleus (Figure 2.b-c). Vice versa, if the threshold is too stringent (high), some nuclei will be segmented too small or even missed. Fused nucleus will require additional declumping algorithms, which is a difficult topic in its own. People also develop strategies to go beyond the naive approach and introduce dynamic cutoffs, one can be really creative here and probably needs to be creative in different ways depending on the image at hand. Even with this simple example, we can already appreciate that performing good quality analysis using computer vision techniques is not a trivial task. Years of experience are required and it will not always be successful at the end.

Figure 2.a

Figure 2.b

Figure 2.c

To learn computer vision techniques, I recommend a well-written book "Digital Image Processing Using Matlab" by Gonzalez et al. This is actually a hands-on companion book to a more algorithm-focused one by the same authors, "Digital Image Processing". Both books are highly recommended.

I do not plan to write about computer vision algorithms in this blog series, but would instead write about a powerful open-source program called ImageJ. ImageJ allows one to carry out many computer vision analyses without programming; nearly all the figures used in this blog are generated with ImageJ, which should give you a peek at its capability. There are tons of functions in ImageJ, so it can feel overwhelming for new users. Based on my own learning experience, I designed a few exercises that can help new users to quickly grasp those core features of most relevance to biological image analysis in my opinion. ImageJ is the topic of Chapter 2.

Machine Learning

We mentioned even a relatively simple cell image such as Figure 1 can still be tricky to analyze, because there are plenty of variations in real life biological images. Nuclei and cells are of varying intensities, shapes, and sizes; some objects are too close to each other and can be hard to segmented apart, the cell boundaries in Figure 1 can be too hard to determine even by eyes; background may contain noise or unevenness in illumination, etc. We need to be creative in devising clever rules to overcome these challenges, one after another, in order to get good quality segmentation results at the end. Developing a robust segmentation pipeline that can successfully batch process thousands of images or more is challenging and can easily be the bottleneck of most bio-imaging facilities.

An alternative approach is to rely on machine learning. In the computer vision approach, we try to conceive segmentation rules based on our observations. We test these hypotheses using some sample images and then implement these rules in computer code. Computer simply executes these rules provided by us onto images in batch. Therefore, this is basically a "human learning" process, where human did all the intelligent part of the work, while computer solves the engineering scale-up issue. Alternatively, in the machine learning paradigm, computer conceives the segmentation rules based on training data provided by human, so the most difficult and intelligent step, the rule construction task, is offloaded on to machine, as we hope artificial intelligence can supersede human intelligence in the field of biological image analysis, similar to what is happening in many other fields.

One very successful implementation of this approach is found in a software called Ilastik. The concept is well illustrated in this YouTube video. Given an input image, Ilastik first computes a set of predefined features $\{f_i(x)\}$. Each feature $f_i$ is generally defined by a filter matrix; the filter is applied onto the input image $x$, performing pixel-level computations within a given neighborhood. This process is called convolution. It results in a new image of the same dimension and the output image is called a feature map $f_i(x)$. E.g., an edge filter applied on Figure 1.3 generates a new feature map image outlining the edge of nuclei (Figure 3.d). There are multiple features predefined in Ilastik ranging from intensity, edge to texture characteristics (Figure 3.a, row-wise), and each feature can be associated with multiple scales (column-wise in Figure 3.a) to capture features at different geometrical distances. Figure 3.b-d shows some example feature maps of Figure 1. These powerful features are the results of decades of hard research work done by computer vision scientists.

Figure 3.a Ilastik features.

Figure 3.b Example intensity feature map.

Figure 3.c Example edge feature map.

Figure 3.d Example texture feature map.

With the collection of dozens of feature maps $\{f_i\}$ calculated, we provide training pixels by drawing a few example scribbles, some are labelled as nuclei (red in Figure 4) and others as background (yellow in Figure 4, here background actually means non-nuclei). These training pixels with correct labels will enable Ilastik to train a random forest engine, which can then automatically predict the labels for the remaining pixels. The transparent red and yellow masks overlaid in Figure 4 are the result of such a prediction, where just a few training strokes are sufficient to teach Ilastik to come up with a random forest model that recognizes nuclei pretty well.

Figure 4. Ilastik training.

Considering this as a binary classification problem, we provide true labels $y_j$ (nucleus or non-nucleus as 1 or 0) for a few given pixels $j$ under the scribbles, random forest aims to combine precomputed $\{f_i\}$ into a probability function $\mathcal{P}$, which optimally predicts the probability of training pixels belonging to the nucleus class.

The probability function is:

$$\begin{equation} \mathcal{P}(x) = \mathcal{P}(f_1(x), f_2(x), \cdots, f_i(x), \cdots). \end{equation}$$

The loss function to be minimized is the cross-entropy score:

$$\begin{equation}\mathcal{L} = \sum_{j} -(y_j \log(\mathcal{P}(x_j)) -(1-y_j)log(1-\mathcal{P}(x_j)). \end{equation}$$

Since the individual feature maps (as shown in Figure 3.b-d) are generally rather smooth functions, $\mathcal{P}$, as the result of their non-linear superposition, also tend to be smooth. Given the target probability function (1 on nuclei and 0 elsewhere) is a reasonably smooth function, it only requires few training data points $x_j$ in order to tune the model parameters within $\mathcal{P}$. This is why the random forest model inside Ilastik can work rather effectively and efficiently for this relatively easy example; very few training strokes are needed.

When we examine the output probability prediction $\mathcal{P}(x)$ in Figure 4.a, we can see that all the pixels within nuclei are white and the rest are brown/black. In the 3D surface view of Figure 4.b, all nuclei are colored in red, which means they all reach the similar height of about probability 1.0, while non-nuclei pixels have probability value around 0.0 and colored in blue.

Compare the probability prediction in Figure 4.b to the original intensity in Figure 2.a, we can see the machine learning process essentially makes a transformation that largely eliminates the original intensity variations in the nuclei and turn the original image into a new one where the pixel intensity represents the probability of its belonging to part of a nucleus. This way all nuclei are well separated from each other and have about the same height.

Figure 4.a Probability function $\mathcal{P}$ constructed by Ilastik.

Figure 4.b Probability surface function $\mathcal{P}$ constructed by Ilastik.

Similarly, the same strategy can by applied to segment cell boundaries (Figure 5).

Figure 5. Example of using Ilastik to segment cells. Image named "Widefield Images\Segmentation\Cytoplasm.tif" is taken from demo image package used by "Fiji Training Notes".

The above machine learning approach relies on two inputs: (1) a set of predefined feature maps, (2) some user-supplied labels of training data pixels. Its output probability function tends to have many desirable normalization properties, if the segmentation works well, e.g., nuclei are already declumped, background are corrected, all nuclei are of about the same intensity (probability approaching 1.0). With all the complexities and variations in the original cell image absorbed by the random forest model, we can now carry out a human-based segmentation task using the probability image as our new input. A simply threshold cutoff of 0.5 would do a decent job in segmenting out nuclei.

Compared to the computer vision approach, although we introduce an additional machine learning step, we only need to spend time in annotating some training pixels, the rest is done by the machine. The whole training process only takes a minute or two, leading to a drastic improvement in efficiency and quality. By offloading human learning on to machine learning, we are liberated from the headache of constructing complex segmentation rules ourselves, thus shifting the bottleneck into computational resource needs.

Deep Learning

If we think the machine learning approach has addressed all the issues, we are certainly way too optimistic. Even for some images seemly trivial to segment by eyes, Ilastik can still fail miserably. An such example is provided in Figure 6. The original image (Figure 6.a) appears rather clear to our eyes, the boundaries of erythrocytes become more clear in the corresponding edge feature map (Figure 6.b). However, despite lots of efforts, neither images can be segmented appropriately using Ilastik (Figure 6.c).

Figure 6.a Erythrocyte image. The example is taken from the U-Net web site. The file named "\DIC-Erythrocyte\6hr-002-DIC.tif" can be found in the "sampledata.zip" file.

Figure 6.b Edge feature map.

Figure 6.c Ilastik training on the edge map.

This difficulty probably originates from the observation that intensities inside the cells are basically indistinguishable from those of the background. Although edge can be detected, whether the edge forms a closed loop can only be determined if the scale of the vision is larger than the cell size. Among the predefined features, none in $\{f_i\}$ alone or in combine is able to distinguish the inside from the outside, and the range of the feature filters are also too short to provide a global sense of a "ring" pattern. The combination of both shortcomings likely leads to the failure of Ilastik in this application.

If we resort to the U-Net deep learning solution as published by Falk et al. in Nature Methods, this image can be easily segmented with as few as 50 refinement steps. We first need to draw examples of erythrocytes (Figure 7.a), ideally in at least two separate images, one for training use and another for validation purpose to avoid over-fitting. The result of a successful training produces a nice probability prediction (Figure 7.b-c), where all erythrocytes can been correctly recognized.

Figure 7.a Deep-learning Training requires cell masks to be provided. Here mask outlined are provided by Falk et al.

Figure 7.b Probability map predicted by U-Net.

Figure 7.c Probability surface plot, where all erythrocytes are nicely recognized, with the internal cell region shows probability of near 1.0.

The fundamental difference introduced in the U-Net approach is that it constructs the probability function using:

$$\begin{equation} \mathcal{P}(x) = \mathcal{P}(\mathcal{G}_1(x), \mathcal{G}_2(x), \cdots, \mathcal{G}_i(x), \cdots). \end{equation}$$

Unlike $f$ in traditional machine learning, $\mathcal{G}$ here are not predefined feature transformations. These are custom features constructed through the deep learning training. In fact $\mathcal{G}$ is drastically different from $f$, where $f$ is mostly relies on a transformation filter matrix depicted at pixel level, but $\mathcal{G}$ is a series of hierarchically nested transformations, which enables it to capture "concepts" (see the last paragraph in "the rationale for CNN" at a previous blog). In this example, U-Net does not construct $\mathcal{G}$ from scratch, it refines them based on features learned through many previous cell images from other projects, therefore, it is likely there was already a ring concept, objects forming a circular shape, in the U-Net before the refinement process. With the ring feature, it would be rather effortless for the U-Net to fit the desirable probability output. For readers never saw a cell image before, they would have no trouble recognizing erythrocytes in Figure 6.a, why? This is because they have seen plenty of circular and ring objects elsewhere, so such features are already exist in their vision system. This is what happens to a pre-trained U-Net model.

The power of deep learning network means it can construct very sensitive features, therefore, its feature map may no longer be as smooth as the $\{f_i\}$ we saw in Ilastik. Overfitting is a real concern now. In order to prevent overfitting, we would need to provide much more training labels for U-Net compared to what is required for Ilastik. Here, it can be tedious to hand draw those dozens of cell masks. Fortunately, in this relatively easy example, one or two dozen example cell masks seem to be sufficient. But for more tricky cases, it could requires many more training data, that could easily become a new bottleneck.

Discussion

From the above examples, we can understand that it is very challenging to rely on traditional computer vision skills to segment biological images due to the inherent variations in the biological system. In many circumstances, the cellular system is heterogeneous, meaning there are multiple cell populations of distinct morphology, and it would be too challenging to come up with a set of successful segmentation rules manually to account for these heterogeneity. Classical machine learning approach can be a speedy alternative, where it relies on a set of predefined features engineered based on decades of research work. The result can be of high quality, if the phenotype of biological interest can be recapitulated based on such features. However, when the phenotype is beyond the depiction of existing features, we would need to resort to deep learning systems. With sufficient amount of training examples, deep learning architecture theoretically allows models such as U-Net to construct custom features particularly suitable for the problem at hand.

Since the capacity of learning in U-Net supersedes that in classical machine learning, would it always make sense to use U-Net instead of Ilastik? As mentioned above, U-Net is much more demanding on its training data than Ilastik, not only it requires more labels, but it requires complete cell masks compared to strokes. We apply U-Net to the simple nuclei segmentation example in Figure 1 and obtain the results in Figure 8.

Figure 8. With a few dozen nuclei masks provided, U-Net nicely predicted the probability map for individual nucleus, where there is no clumping.

Comparison of the resultant probability surfaces between the random forest (Figure 9. left) and U-Net (Figure 9. right) shows U-Net indeed provides a smoother prediction, where the heights of the peaks are more uniform. Based on this, we would expect U-Net indeed provides better segmentation results than Ilastik. However, depending on the biological questions to be asked, the Ilastik results could well be good enough. The marginal improvement provided by U-Net may not be cost-effective compared to the extra amount of labor required to produce the training masks.

Figure 9. Probability surface predicted by ilastik (left) and U-Net (right).

Experienced image analysts may argue the edge map in Figure 6.b could have been segmented using computer vision techniques, because those circular edges looks like giant water lilies (Figure 10), therefore watershed technique could be used to segment individual erythrocytes. The result based on this watershed idea is shown in Figure 11.

Figure 10. Giant water lilies, where centers are as low as the water surface and the edges form a cylindrical dam. If we image the water surface rises from both the inside and the outside, water will be trapped within each lily pad and form a enclosed compartment (watershed algorithm).
(Image source: https://images-na.ssl-images-amazon.com/images/I/71kVRwZobwL._SL1024_.jpg)

Figure 11. Result of watershed based on the water lily idea.

The idea only gets about half of the erythrocyte correctly. It failed because the edge image is quite noisy and the edges are leaky. One might be able to get it to work with lots of custom rules eventually, but the efforts spent probably exceeds the time required for U-Net training.
What if all attempts fail and the image is still not segmentable despite all our efforts? We would need to go back to the biological questions we are asking. Maybe for the erythrocyte image, what we really need is cell counting. We can take the edge image and do a simply threshold segmentation, then remove small particles and produce a mask image like Figure 12. The area of the white pixels can act as a proxy to the cell count and allow us to sort images.

Figure 12. Mask created by thresholding Figure 6.b and then remove particles with areas less then 400 square pixels.

Conclusion

All together, it seems a sensible approach is first to determine if an image is segmentable by eye. If not, we need to discuss with biologists to figure out how to extract image-level signals. If yes, we should always first apply Ilastik for segmentation, as its barrier of training is so low. If the quality is not satisfactory, we should then consider U-Net or other deep learning models. Either Ilastik or U-Net produces a probability map as output, which can then be further segmented using traditional computer vision tools, such as ImageJ or CellProfiler (excels at batch processing), where one can apply simple thresholding methods and watershed techniques to turn the pixel-level level probabilities into individual cell object masks. Once we have individual masks, then measurements can be carried out and results enable population analyses to characterize perturbagen effects at cell level.

Monday, August 13, 2018

LSTM Networks for Predicting Subcellular Localization of Proteins

Bioinformatics Application of Deep Learning Technology

Resource

We will discuss the following paper: Sonderby et al., Convolutional LSTM Networks for Subcellular Localization of Proteins. The first draft is available here, the published version of the paper with a much more sophisticated architecture and more detailed technical descriptions is available here. As the two versions are rather different, we discuss both.

All figures shown here should be credited to the two papers, unless the source is explicitly provided.

Introduction

Proteins are products of a factory called ribosome, each protein has its own blueprint called message RNA (rooted from its gene structure). They are bikes, cars, ships and airplanes in biological systems, therefore they need to be delivered into different biological compartments in order to carry out their functions. A ship cannot sail on land, it needs to be moved onto the sea, and cells have a complex UPS delivery system, where it is able to read the "shipping barcode" carried on the protein sequences and ship the proteins to their destinations. E.g., some proteins will stay within the cell, but in different organelles, some may sit on cell membrane, others may be shipped outside into extracellular space and serve as carriers or messengers. For a nice introduction on how proteins are delivered to different destinations, please read this. To predict the subcellular location of a protein is a very interesting topic in bioinformatics, as it can provide clues to the function of a protein.

Existing bioinformatics prediction tools often are not ab initio, which means they tend to use information other than the sequence of the protein. E.g., the MultiLoc program has a MotifSearch component, which basically used knowledge of experimentally confirmed signal motifs; many "cheat" by using homology information [1] (it is not a true prediction, but rather a mix of prediction and annotation). These tools will perform poorly, when the protein to be predicted has no close homologues in the training data set. In addition, existing bioinformatic tools relying on traditional machine learning techniques, such as SVM, require a fixed length input, therefore, they are conceptually not the most suitable choice for the variable protein lengths and are very difficult to be further improved. Sonderby et al. was able to use Deep Learning tools to build a DeepLoc system, that improved the prediction accuracy from 0.767 (MultiLoc) to 0.902 (Table 1 in [2]) in their draft version, when only sequence data was used [2]. In the later publication, they were able to further improve their solution to surpass other tools on a more realistic data set, 0.7797 for DeepLoc versus 0.5592 for MultiLoc2, when homology factor is largely eliminated [3]. Let us see how they did it.

Architecture

Encoding by CNN

We first need to represent an input protein sequence. There are 20 types of amino acids, therefore, a protein sequence of $n$ amino acids is naturally a 20-element one-hot-encoded vector $20 \times n$. Amino acids have different physicochemical properties, some are charge neural, some are positively or negatively charged, some are polar or hydrophobic, etc. The one-hot encoding does not take advantage of this knowledge. A popular approach is to use the BLOSUM matrix element as the scores for each amino acid. In the first draft, the authors use a combination of four different representations: one-hot, BLOSUM80 matrix, HSDM matrix, and sequence profiles (each protein is blasted against protein database to fish out its homologs; a profile is then constructed based on the multi-sequence alignment.). If you are not familiar with bioinformatics, just consider BLOSUM matrix only as a similarity matrix, where chemically-similar residues have a higher positive score and chemically-opposite residues have a negative score. In the final publication, the authors actually only chose one of the four representation; let us pretend the BLOSUM matrix was the one chosen in Figure 1.

First, the BLOSUM representation is transformed to be 1000 in length (only 10% of proteins are longer than this size). If a protein is shorter, it is padded by NULL amino acids in the middle, otherwise, the extra residues in the middle are truncated. This is because it is known that the signal carrying the location information is often located at the two ends (N-terminal and C-terminal, i.e., left and right ends) of a protein. This makes biological sense, as ends of proteins tend to be loose and are available to be in contact with other proteins to allow the signal to be utilized, while the middle part can be wrapped and embedded within a 3D structure and inaccessible.

Figure 1. A CNN for encoding an input protein sequence. [3]

The $1000 \times 20$ input matrix is then fed to six independent CNN modules (Figure 1). Take the left most CNN module as an example, it has 20 features, each uses a convolution filter spanning 21 positions (as the input has 20 channels, one per an amino acid channel), the feature map dimension is $21 \times 20 \times 20$. The right most module has no convolution, simply passing the original input. Therefore, this CNN system looks for motifs of length 1, 3, 5, 9, 15 and 21. Why do we want to encode the input using motifs? We think the delivery system relies on some general tags (motifs). All proteins to be shipped extracelluar may share a few motif patterns, so when CNN is trained, such motif patterns are over-represented, as they appear more frequently in the protein sequence. Then CNN is able to represent a protein sequence by saying there is a motif A here, and there is a motif B there; such representation certainly makes downstream prediction way easier. All the $20 \times 6 = 120$ feature map outputs are combined into a $1000 \times 120$ motif representation for the input sequence. Another layer of convolution is introduce to carry out one more covolution using filter size $3 \times 120$ and obtain 128 feature maps of size 1000 each. So the end feature matrix is of dimension $1000 \times 128$.

Bidirectional RNN

The RNN is made of a LSTM cell of 256 hidden elements. If we pipe the $1000 \times 128$ sequence through the LSTM unit from N-to-C direction, the N-terminal signal may still be hard to retain, when the scanning reaches the C-terminal end. Despite LSTM has longer memory, at least it is not so easy to optimize during the training process. In fact, most of the location barcodes for a protein are near either the N-terminal (left end) or the C-terminal (right end). Therefore, another popular technique to retain more signals from both terminals is to use two LSTM cells and scan the protein in both directions: N-to-C and C-to-N, respectively. When the LSTM reads in one input base, it produces an output $\mathbf{h}$, therefore, the two directional LSTM have two outputs at each time step, both are combined into a final output vector, i.e., $\mathbf{h}_{N2C,t}$ is concatenated with $\mathbf{h}_{C2N, t}$, and we have hidden state vector $\mathbf{h}_t$ of size 512 elements (shown as the blue and red bars in Figure 2). As there are 1000 time steps, the feature now is $1000 \times 512$.

Figure 2. Bidirectional LSTM network is used to generate 1000 vectors of 512 hidden states. [2]

Attention Mechanism

Not all of the 1000 hidden states are equally informative. Thanks to the dual scan, the hidden state for each time step captures the information for the complete sequence, therefore, we can certainly use $\mathbf{h}_1$ and $\mathbf{h}_{1000}$ as the new input and hook up a FNN to classify proteins into different classes, representing different subcellular locations. This should work, however, may not be optimal (as shown in Figure 6 later). The footprint of the destination motif is the strongest, when it is just seen, therefore if the location barcode of a protein locates at the 100th amino acid (N-term, left end), the hidden states around that time step would be the best for prediction. So the idea is somehow we aim to come up a 1000-element weighting vector, where each of the 1000 states are weighted differently. Ideally, the states closer to the useful tags have larger weights, we can then weighted average all 1000 hidden states into a final state vector, which should then lead to better predictions. To figure out which time steps the latter classification system should pay attention to, the authors used an attention mechanism. The attention mechanism described in their final publication [3] is rather complicated and hard to understand, so I choose here to discuss the mechanism presented in their draft version [2].

The attention component has a weight matrix $\mathbf{W}_a$ and an attention output vector $\mathbf{v}_a$. We compute an attention scalar for time point $t$ by:

$$\begin{equation}a_t = \rm{tanh}(\mathbf{h}_t \mathbf{W}_a)\cdot \mathbf{v}_a^\intercal. \end{equation}$$

The way I interpret Equation 1 is there are some motifs that are very informative for location prediction, they are encoded by some elements of the hidden states. Say we have $k$ important motifs, each is associated with a few elements in the hidden states, therefore $\mathbf{W}_a$ is a matrix of $k \times 512$ that extracts the presence data of each of the $k$ motifs from within the hidden state. The output of $\rm{tanh}(\cdot)$ thus is a "probability vector" indicating whether $\mathbf{h}_t$ contains those $k$ motifs. In the second step, considering not all such motifs are of equal importance, therefore, $\mathbf{a}_v$ is a vector coding the importance of each of the $k$ motifs. The resultant scalar $a_t$ is therefore the logits measuring the importance of each hidden state. The weight for each hidden state can then be obtained by softmax:

$$\begin{equation} a_t = \frac{exp(a_t)}{\sum_{t^\prime=1}^T \; exp(a_{t^\prime})}. \end{equation}$$

The final context vector representing the protein is a weighted sum:

$$\begin{equation} \mathbf{c} = \sum_{t^=1}^T \; \mathbf{h}_t a_t. \end{equation}$$

This then serves as the input to an FNN to classify proteins into 11 classes. The authors used a single layer FNN in their draft version; they also introduce a hierarchical tree Bayesian style predictor we will describe later in the discussion.

Results

Each protein is encoded by its context vector $\mathbf{c}$, Figure 3 is the t-SNE 2-dimensional visualization of all proteins predicted using two separate data sets. It is clear that the encoding (basically the transformed feature space) is very good at separating proteins of different classes, therefore, the above system is quite capable of extracting the location-rich features, and the downstream classification system does not have to be too sophisticated. $\mathbf{c}$ is very interesting, they encode the variable length proteins into a fixed length vector. Unfortunately, they are trained based on subcelluar location data, they cannot be used as general-purpose protein vectors.

Figure 3. t-SNE representation of the context vector, colored by different protein classes [3]. For the data set on the right, there is quite some degree of homology between training and test sets, therefore, data points are better separated. For the data set the authors constructed on the left, there is less homology between training and test sets, so it is a more realistic reflection of the prediction quality on new proteins, which therefore is expected to be less well separated on t-SNE map.

Figure 4 plots the attention vector, where we expect the locations catching the attention tend to be where location tags resides. It is clear that extracellular proteins, plastid proteins, mitochondrion proteins have their signals at the very end of the N-terminal. Transmembrane preteins have their signals scatter across the whole range, where there are dark fragments probably indicating transmembrane regions (where those are hydrophobic amino acides, which are probably extracted by the CNN). These transmembrane tags in the middle part of the protein are not necessary our shipping barcodes, they are probably just signature motifs for transmembrane protein, therefore, highly correlated with the true shipping barcode. Such patterns are useful for prediction, but are not necessarily the reason for cells to sort proteins. For most proteins, signals are mostly distributed around the N-terminal, and sometimes towards the C-terminal. The attention data makes sense.

Figure 4. Attention vectors for proteins of different classes [3].

Figure 5 shows the overall performance data of the describe system - DeepLoc, compared with other published methods using the DeepLoc data set. The DeepLoc approach is noticeable better.

Figure 5. Performance comparison of DeepLoc (this study) to other methods. [3]