Sciency Stuff: Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning

Bioinformatics Application of Deep Learning Technology

Resource

We will discuss the following paper: Alipanahi et al., Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. The paper is available here, the supplementary method is available here (subscription required).

All figures shown here should be credited to the paper, unless the source is explicitly provided.

Introduction

Genes are encoded by DNA sequences. Whether a gene is expressed and how much it is expressed can be regulated by proteins called transcription factors (TF). A TF protein can bind to certain DNA motifs, then recruit other factors to initiate the transcription of a piece of DNA sequence into an mRNA sequence. In eukaryotic cells, mRNA consists of multiple exons separated by introns. Introns are spliced out and exons are concatenated into a matured mRNA, which will then be translated into a protein. RNA-binding proteins can regulate the splicing process and lead to the inclusion or exclusion of certain exons, therefore, producing different splicing isoforms, eventually different protein products of different functions. Mutations in the genomic sequence can perturb the binding of proteins, then lead to undesirable under/over transcription of genes, or undesirable splicing forms, which can induce disease. Therefore, it is of interest to understand what nucleotide pattern (motifs) a DNA/RNA-binding protein can recognize and how mutations impact such molecular recognition.

Let us describe two popular experimental approaches for identifying sequences enriched for protein binding sites. First, protein binding microarrays (PBM). The microarray contains many different short nucleotide sequences called probes at different array locations. For example, an array can contains probes representing many variants of 10-nucleotide patterns (10-mers). Even for a given 10-mer, it can include many probes where the same 10-mer is embedded within different sequence context of the longer probe sequence. A TF protein is hybridized over the array and it will be stuck to the places where binding occurs and produce fluorescent readouts. Therefore, PBM gives readout of probe sequences and their intensity readout indicates the strength of the binding to a specific TF. A deep learning (DL) system aims to take these probe sequences as input $X$ and predict the binding score $y$ (i.e., intensity). The natural choice of the loss function would be least-square error, as this is a regression task. Second, ChIP-seq technology. When a TF is cross-linked to genomics materials, DNA were chopped and all DNA fragments bound to the TF will be extracted through chromatin immunoprecipitation (ChIP) process, bound DNA/RNA fragments are NGS-sequenced to produce readout (NGS stands for next-generation sequencing). ChIP-seq data only provides positive sequence fragments, one needs to added negative sequences (either taken from genome background, or the authors here generated negative sequence by permuting positive sequences), a DL system aims to classify an input sequence $X$ into binary $y$ of 1/0, i.e., bind or not bind. Since this is a classification problem, cross-entropy loss function is a natural choice. The paper also studied datasets generated using a HT-SELEX technology, where the data format is similar to that of ChIP-seq.

The motif discovery problem is a well-studied bioinformatics problem for the past twenty years (I also contributed an algorithm called GEMS). The positive sequences are collected and algorithms look for patterns that are over-represented among them, compared to a pool of negative background sequences. Alipanahi et al. used a DL approach, more specifically, a CNN + FNN architecture that archived better performance than existing tools. It is a very clean application of CNN to bioinformatics.

Results

Architecture

In the diagram below, the system processes five input sequences in parallel. This is rather a technical detail in order to make full use of the GPU resource. For our purpose, we only need to focus on one sequence. The sequence is first cast into two sequences, the original one and its reverse complement (e.g., ATGG is cast into CCAT as well). This is because a TF could theoretically recognize either one of the DNA strand, or both in rare cases (Figure 4.e in the paper is such a rare case). A motif is a sequence pattern represented by a 1-D CNN convolution filter. More precisely, the input sequence is actually a 4-channel 1-D vector, where 4 channels encode the base composition A, T, C, G and the vector is a 1/0 binary vector one-hot encoding the input sequence. For example, a short sequence "ATGG" can be represented as:

$$S_{ATGG} = \begin{bmatrix} A & C & G & T\\
. 25 & . 25 & . 25 & . 25\\
. 25 & . 25 & . 25 & . 25 \\
1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
0 & 0 & 1 & 0 \\
. 25 & . 25 & . 25 & . 25 \\
. 25 & . 25 & . 25 & . 25 \end{bmatrix}. $$

Those rows containing 0.25 are paddings (represent "any" nucleotide) to extend the original sequence, so that convolution will not miss motifs partially overlapping with the ends. If the CNN is designed to capture motifs of length $m$ (it will capture motifs length $\lt m$ as well), there will be $m-1$ extra 0.25- rows on both ends. A motif is represented as a convolution filter $M$ that is known as position-weighted matrix (PWM) in bioinformatics, it basically specifies the weighting of each nucleotide at each location. The output of the convolution layer is:

$$\begin{equation} \begin{aligned}
X_{i,k} &= \sum_{j=1}^{m} \sum_{l=1}^{4} S_{i+j,l}M_{k,j,l}, \\
Y_{i,k} &= \max(0, X_{i,k}-b_k),
\end{aligned} \end{equation}$$

where $k$ is the index of a particular motif (a motif here is called a feature in CNN), $i$ is a location on the input sequence, $m$ is the length of the motif or the convolution window, the convolution therefore is taken over all positions $j$ and over all four nucleotide channels $l$. The sum is then passed through a ReLU function as shown in the second step (weights can be negative, so neurons with convolution sum less than $b_k$ will not be activated). The result is a feature map representing where a given motif is detected along the input. A pooling layer is followed. The authors used a max-pooling layer for DNA-binding proteins and both max- and average-pooling for RNA-binding proteins. In the latter case, $k$ features results in $2k$ pooling maps. The pooled output will then serve as the input for a fully-connected neural network (FNN) to generate target scores, one per input sequence.

This architecture is very simple, as it only contains one CNN convolution layer and a pooling layer. The usage of CNN here is because the motif can be located at arbitrary positions in the input sequence, just like a dog can appear in any location of an image, CNN, beings translation-invariant, is a natural option for this problem.

Performance

The CNN is then trained with a set of experimental data (either PBM, ChIP-seq, or HT-SELEX) and its performance was shown to be consistently superior than established bioinformatics motif discovery tools. Without getting into the details of biology applications, the figure below shows their CNN solution, named DeepBind,was able to out-perform other methods on PBM, ChiP-seq, SELEX datasets according to various performance metrics (AUC or Pearson correlation scores).

Why CNN is Better?

Unfortunately, the paper did not discuss why DeepBind performed better. Although it shows many examples of how the motifs identified makes biological sense, but that is not something unique, as traditional motif discovery tools can produce those results as well (the author did not say traditional tools failed in doing that). It was only mentioned in the beginning of the paper:

"The sequence specificities of a protein are most commonly characterized using position weight matrices (PWMs), which are easy to interpret and can be scanned over a genomic sequence to detect potential binding sites. However, growing evidence indicates that sequence specificities can be more accurately captured by more complex techniques".

This hints the motivation of their study is to describe motifs with a language more sophisticated than PWM. However, motifs in CNN are described by filters, which is basically the same as PWM, what is the strength of this approach then?

I think what happened probably is DeepBind contains $k$ motif sub-elements. In previous methods, only one PWM is used to explain biological measurements (for that specific transcription factor or RNA-binding protein). However, in DeepBind, multiple elements are combined through a FNN for activity prediction. Although each motif element is equivalent to a PWM, their combination could non-linear motif descriptor cannot be captured by PWM. Hypothetically, imagine there are two motif elements in a real system: AGGGT and TGGGA. That is "GGG" in the middle, when "A" appears on the left, "T" is favored on the right, similarity a left "T" pairs with a right "A". However, AGGGA and TGGGT are not valid patterns. However, if we only allow this to be represented using one PWM, the result is an "averaged" motif looks like "A/T-GGG-A/T", which would encode wron patterns such as AGGGA and TGGGT. We therefore hypothesize the CNN system provides a way to capture motifs in a non-linear way, more powerful than PWM.

Another possible reason is NN allows a motif consisting of wide-spacing sub elements, which are very hard to be discovered by traditional algorithms due to the vast search space. But wide-spacing motifs are important particularly for RNA-binding proteins, as they bind to RNA structures in non-trivial manner. As shown below, unlike DNA-binding, RNA-binding proteins may not recognize only one binding site, but may actually recognize multiple binding sites on the surface of RNA. Unlike double-strand DNA fragments (the last structure in the figure below), single-strand RNA can form complex secondary structures, oftentimes, the RNA structure contributes significantly to the binding no less than the specific oligonucleotide bases. The authors commented: "RNA recognition motif, the most abundant RNA binding domain, is highly flexible, and usually the assembly of multiple domains are needed for proper binding. Third, “indirect readout” in which the structure of nucleotides guide the binding process, is more prominent in RBPs; Gupta and Gribskov analyzed 211 protein-RNA structural complexes and reported that nearly three-quarters of protein-RNA interactions involve the RNA backbone, not just the bases." A RNA-binding protein could recognize multiple binding sites on the surface in contact, therefore, this is better modeled by multiple motif features in DeepBind, where they are combined through FNN to predict the binding affinity. Traditional tools identify motifs as one continuous binding element, without synergistic combinations of underlying elements. I suspect this might be the main reason DeepBind performs well, at least for the more difficult RNA-binding proteins.

Protein-binding interface on RNA (DNA, the last example) (source). Purple: binding sites, yellow: non-binding sites.

We here only see one convolution layer, instead of multiple repeats of convolution plus pooling layers. Hierarchical motif patterns may not be necessary for motif discovery problems, because although shorted motif elements, say "ATC" might appear in multiple longer motifs, the frequency of A, T, and C in the longer motif might be very different compared to their frequencies in the short elements. So unlikely eyes, noses and mouths made up of faces, short motif PWMs may not be very effective in concatenated to form larger PWMs.

How to Visualize CNN?

A PWM generated by traditional bioinformatics tools can be easily visualized as a motif logo, where the height of a character encodes the frequency or information content of a base in its PWM. CNN here contains multiple motifs and combines them in a non-linear way, therefore, it is conceptually incorrect to visualize each convolution filter alone. How to visualize DeepBind is a rather challenging topics, the authors did not address this in my view. Instead they adopted a much naive approach -- basically visualizing individual feature. For each sequence in the test set, if the sequence contains a positive output signal for a given motif feature map, the corresponding input sequence fragment (gives the maximum activation) is retained. Then all such CNN activating fragments are aligned and base counts are compiled to construct a PWM and subsequently generating a motif logo. Some example logos are shown in the figure below, where it shows DeepBind model indeed reproduced many known motifs. As we mentioned the downside of this approach is it really does not visualize the non-linearity introduced in the FNN component as we hypothesized, why DeepBind performs better is really not answered in this simplified representation.

Examples of known motifs discovered by DeepBind.

Mutation Map

An alternative visualization that is conceptually more powerful is called mutation map. Given an input sequence $s$, DeepBind predicts a binding score $p(s)$. We can then mutation the nucleotide at each position into a different bases, i.e., resulting in a new sequence $\hat{s}$ and predict a new binding score $p(\hat{s})$. The importance of a particular single-base mutation, therefore, can be quantified by:

$$\begin{equation} \Delta S_{ij} = p(\hat{s})-p(s). \end{equation}$$

The nice thing is the mutation score is based on the whole CNN plus FNN system, therefore, include all the non-linear effects. However, I guess the above equation did not work well as it is. As the predicted scores do not necessarily represents the change in the unit of binding energy; The scores and binding energy might be related through some non-linear function. The authors empirically used the following formula instead, where the effect of a mutation is magnified by the binding strength of the mutation position:

$$\begin{equation} \Delta S_{ij} = (p(\hat{s})-p(s)) \cdot \max(0, p(\hat{s}), p(s)). \end{equation}$$

Two examples of mutation maps are shown below. On the left, it shows the promoter region of LDL-R gene; LDL-R is the LDL receptor gene and LDL is known as the "bad cholesterol". The wild type genome contains based C in the middle of promoter, which can be recognized by SP1 protein, which then drives the expression of the LDL-R, and that helps recognize excess levels of LDL and regulates its removal from the blood. Patients with this base mutated to G seriously decrease the binding score (see the red dot at the bottom), as the mutant version is no longer recognized by SP1. This leads to the deleterious condition known as hypercholesterolemia (high level of cholesterol) due to the low expression of LDL-R. On the right, a mutation in the globin gene, from A to G, creates a new binding site TGATAA attracting a TF protein called GATA1. The binding of GATA1 blocked the genome region originally reserved to serve as the promoter of globin. As the result, globin is no longer expressed at sufficient level, which leads to a disease called $\alpha$ thalassemia (insufficient amount of hemoglobin to carry oxygen in blood).

How DeepBind is Optimized?

The DeepBind system contains many hyperparameters, including motif length $m$, number of motifs $k$, learning rate, momentum, batch size, weight decay, dropout rate, etc. The training process first randomly sample 30 sets of such hyperparameters (see Figure below), using 3-fold cross validation to evaluate their performance. The best performing set of hyperparameter is then used to repeat the DeepBind training process using the whole training dataset three times. This is to avoid statistical fluctuations during the optimization process, the best performing set of DeepBind model is then chosen as the final model for the given protein and used to predict the test dataset and produce performance metrics. One DeepBind model was trained per protein per experimental data type.

Conclusion

CNN is a DL architecture suitable to discover translation-invariant patterns in inputs. DeepBind system uses CNN to capture protein-binding motifs on DNA/RNA, where NN architecture enables it to capture some non-linearity effects and/or aggregation effects of multiple underlying motif patterns, as well as aggregate motifs that are widely spaced. The study showed DeepBind surpassed traditional bioinformatics motif discovery tools and was able to explain a number of disease causing mutations. Our visual estimation shows the improvement, compared to traditional approach, is at about a few percentage, similar to what we generally see from DL applications in bioinformatics. However, the architecture of DeepMind is a lot simpler compared to the many sophisticated motif discovery algorithms developed over the passed two decades. The authors did not offer explanations on why DeepBind performed better.

Sciency Stuff

Wednesday, February 28, 2018

Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning