Email updates

Keep up to date with the latest news and content from Algorithms for Molecular Biology and BioMed Central.

Open Access Highly Accessed Research

LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search

Sebastian Will12, Michael F Siebauer3, Steffen Heyne2, Jan Engelhardt1, Peter F Stadler15678*, Kristin Reiche145* and Rolf Backofen27*

Author Affiliations

1 Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, Härtelstraße 16 -18, Leipzig D-04107, Germany

2 Bioinformatics Group, Department of Computer Science, Albert-Ludwigs-Universität Freiburg, Georges-Köhler-Allee 106, Freiburg D-79110, Germany

3 Genetics Group, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, Leipzig D-04104, Germany

4 Young Investigators Group Bioinformatics and Transcriptomics, Department Proteomics Helmholtz Centre for Environmental Research – UFZ, Permoserstraße 15, Leipzig D-04318, Germany

5 RNomics Group, Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, Leipzig D-04103, Germany

6 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig D-04103, Germany

7 Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C DK-1870, Denmark

8 Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA

For all author emails, please log on.

Algorithms for Molecular Biology 2013, 8:14  doi:10.1186/1748-7188-8-14


The electronic version of this article is the complete one and can be found online at: http://www.almob.org/content/8/1/14


Received:21 March 2013
Accepted:28 March 2013
Published:20 April 2013

© 2013 Will et al.; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

The search for distant homologs has become an import issue in genome annotation. A particular difficulty is posed by divergent homologs that have lost recognizable sequence similarity. This same problem also arises in the recognition of novel members of large classes of RNAs such as snoRNAs or microRNAs that consist of families unrelated by common descent. Current homology search tools for structured RNAs are either based entirely on sequence similarity (such as blast or hmmer) or combine sequence and secondary structure. The most prominent example of the latter class of tools is Infernal. Alternatives are descriptor-based methods. In most practical applications published to-date, however, the information contained in covariance models or manually prescribed search patterns is dominated by sequence information. Here we ask two related questions: (1) Is secondary structure alone informative for homology search and the detection of novel members of RNA classes? (2) To what extent is the thermodynamic propensity of the target sequence to fold into the correct secondary structure helpful for this task?

Results

Sequence-structure alignment can be used as an alternative search strategy. In this scenario, the query consists of a base pairing probability matrix, which can be derived either from a single sequence or from a multiple alignment representing a set of known representatives. Sequence information can be optionally added to the query. The target sequence is pre-processed to obtain local base pairing probabilities. As a search engine we devised a semi-global scanning variant of LocARNA’s algorithm for sequence-structure alignment. The LocARNAscan tool is optimized for speed and low memory consumption. In benchmarking experiments on artificial data we observe that the inclusion of thermodynamic stability is helpful, albeit only in a regime of extremely low sequence information in the query. We observe, furthermore, that the sensitivity is bounded in particular by the limited accuracy of the predicted local structures of the target sequence.

Conclusions

Although we demonstrate that a purely structure-based homology search is feasible in principle, it is unlikely to outperform tools such as Infernal in most application scenarios, where a substantial amount of sequence information is typically available. The LocARNAscan approach will profit, however, from high throughput methods to determine RNA secondary structure. In transcriptome-wide applications, such methods will provide accurate structure annotations on the target side.

Availability

Source code of the free software LocARNAscan 1.0 and supplementary data are available at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan webcite.

Background

Over the last decade, a series of large-scale transcriptome projects has profoundly changed our perception of the transcriptome. Reviewed e.g. in [1], pervasive transcription is widespread and plays a crucial role in controlling gene expression and genomic plasticity. Gene prediction and gene annotation of non-protein coding entities have remained non-trivial problems, nevertheless. In part, this is due to our incomplete understanding of the diversity of ncRNAs, of which novel types and subtypes keep being discovered at a rapid pace. An important confounding factor, however, is the rapid evolution of many ncRNA sequences [2-4], which intrinsically limits the applicability of homology search methods [5,6] and hence hide distant homologs.

The three-dimensional structure is important for the functionality and/or the proper processing of a large and important subgroup of ncRNAs. The most prominent representatives are ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), spliceosomal RNAs (snRNAs), small nucleolar RNAs (snoRNAs), and microRNAs (miRNAs). While rRNAs and tRNAs are among the best-conserved sequences also at sequence level, other classes such as C/D box and H/ACA box snoRNAs exhibit sometimes very large substitution rates. The conservation of spatial structure implies that secondary structure, i.e., base pairing patterns, are also under stabilizing selection. In many cases, the structure evolves much slower than the sequence, see [7] for a recent detailed analysis of this phenomenon. Thus, several computational tools have been devised to utilize secondary structure alongside with sequence information for homology search. The same effect is exploited by tools such as RNAz[8], Evofold[9], or SISSIz[10] that detect selection pressure on RNA secondary structure in multiple sequence alignments.

Structural similarity is either inherited from a common ancestor or arises by convergent evolution as the result of similar selective constraints. Operationally, we distinguish RNA families and RNA classes. The members of RNA families share a sufficiently high level of sequence similarity to establish the existence of a common ancestor, which in practice translates to the possibility of representing them as structure-annotated multiple sequence alignment. The Rfam database [11] serves as a comprehensive repository for this type of data. Representatives of RNA classes share secondary structures (e.g. as a consequence of a common processing pathway in the case of microRNA precursors) or a combination of sequence and structure features (e.g. as a consequence of being incorporated into analogous ribonucleoproteins in the case of snoRNAs).

Homology search programs are geared towards detecting novel members of known RNA families, reviewed e.g. in [12]. The most commonly used tool Infernal[13] uses covariance models (CMs), i.e., the stochastic context free grammar analogue of profile hidden Markov models. Similar to search heuristics such as Erpin[14], the CMs are trained from sequence alignments that are annotated by a consensus secondary structure. When a lack of known examples precludes the construction of consensus models, tools such as RSEARCH[15] and BlastR[16] allow single structure-annotated or even unstructured RNAs as queries. A common feature of all these methods is that they heavily (or even exclusively) rely on the sequence information contained in the query model, and that they evaluate whether a piece of the target sequence can be folded to match a prescribed query structure. Consistency with the query structure, however, does not necessarily imply that a putative homolog is thermodynamically predisposed to actually fold into this structure. Whether the query structure is close to the target’s groundstate or whether it is an unfavourable high energy structure, therefore can provide additional information to improve specificity.

The members of RNA classes, on the other hand, share structural features and some sequence motifs deriving from common binding partners and functions. The term RNA clan[17] has been proposed for RNAs that derive from a common ancestor but have diverged far enough to either be difficult to align or have distinct functions, or both. As the distinction between clans and classes requires detailed knowledge of their evolutionary histories, we will not distinguish between clans and classes in this contributions. Examples of RNA classes are animal microRNAs (featuring a characteristic precursor hairpin and processing pattern) or the two distinct classes small nucleolar RNAs (snoRNAs) defined by the C/D and H/ACA “boxes” (short common sequence motifs) and very different characteristic secondary structures. These three paradigmatic RNA classes each comprise large numbers of families. Computational surveys [18-21], furthermore, gathered convincing evidence for large numbers of conserved RNA structures; subjecting these data sets to structure-based clustering suggested the existence of additional, so far undescribed RNA classes [22].

There are mainly two approaches for the search for novel members of RNA classes. One class consists of descriptor-based methods [23-25] or with the help of class-specific tools that combine an efficient filtering step with elaborate, often machine-learning based, evaluation procedure that ensure the required specificity [26]. In either case an in-depth knowledge of the RNA class under consideration is required. The other class does not require in-depth information on the RNA-class and its characteristic patterns of conservation. Instead it exploits information on sequential and structural similarity directly by using sequence-structure alignment. The first practical approaches for multiple structural alignment, such as RNAforester[27] and MARNA[28], depend on predicted or known secondary structures. In practice, however, these approaches are limited by the low accuracy of non-comparative structure prediction. For this reason, several derivatives of the Sankoff-based approach [29] of simultaneous alignment and folding have been introduced. In approaches such as FoldAlign[30,31], Dynalign[32], and Stemloc-AMA[33] a full energy model for RNA is implemented that is evaluated during the alignment computation. However, in its full form, these approaches suffer from a high worst case computational complexity of <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M1">View MathML</a> time and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M2">View MathML</a> space. In contrast, PMcomp[34] and LocARNA[22] use a full-featured energy model in a precomputation step to determine a reduced representation of the structure ensemble in form of base pair probability matrices [35]. During the alignment process, base pair probabilities are used to assess the similarity of the secondary structures. Using additional computational optimizations, the complexity of LocARNA could be reduced to quartic time and quadratic memory consumption, making it currently one of the most efficient versions of the Sankoff algorithm. Several improvements and extensions of LocARNA have been discussed before: to additionally reduce LocARNA’s runtime, ExpaRNA-P[36,37] utilizes a fast structural filtering method based on local structural motifs [38,39]; REAPR[40] introduces a multiple alignment-based banding method to realign eukaryotic whole genome alignments based on RNA structure; recently, [41] introduces the very efficient LocARNA descendant SPARSE; and LocARNA-P[42] extends LocARNA by computing reliabilities, thus enabling new applications of Sankoff-style alignment. None of these approaches, however, addressed efficient scanning.

In practice, homology search relies predominantly on the sequence information in the query. Even in the CMs representing the heavily structured Rfam alignments sequence information by far outweighs the additional bit score contributed by the consensus secondary structure ( [43], Figure one point nine). Indeed, for most Rfam families, the structural information is well below the 20 bits that would be required to push the E-value below 1 in a small, 1M bacterial genome in a structure-based search that completely ignored sequence conservation.

CM methods ask how well the target sequence matches the query’s sequence and structure model that is derived from the input alignment and its provided consensus structure. However, they do not take into account the thermodynamic stability of the target sequence folding into the consensus structure. This could provide an additional source of information on the structural concordance of query and target.

In principle, this information is accessible in two quite different ways. Following the philosophy of Thermodynamic Matchers [25] and of the structure conservation index [44], one can interpret the consensus structure as constraint and evaluate the energy difference between constrained and unconstrained folding. The alternative is to predict structures also for the target.

A pilot study [45] provided first indications that structural alignments could be used for genome-wide homology search in a regime where sequence information is scarce. In an approach to find class-members, a model specified as base pairing probability matrix, possibly augmented with some additional sequence information, is searched against a target for which local base pairing probabilities are provided. To this end, base pairing probability matrices were computed with McCaskill’s partition function folding algorithm [35]. These were then aligned locally with LocARNA[22]. Here we describe and evaluate an optimized semi-global scanning variant of the LocARNA algorithm that can be employed in genome-wide applications.

The LocARNAscan algorithm

LocARNA is a computationally light-weight and very efficient variant of the Sankoff algorithm [29]. It improves the CPU and memory requirements each by a quadratic factor over the original algorithm. For this purpose, LocARNA allows for matches of base pairs that occur with a given minimum probability in the structure ensembles of the single input sequences.

Here we devise a scanning variant of LocARNA, called LocARNAscan, that computes alignments of a query RNA with a much longer target sequence based on sequence and structure similarity. In our discussion, we require such alignments to be semi-global; in such an alignment the entire query is compared to a subsequence of the target. We achieve this by allowing free end gaps, i.e. arbitrary long deletions at both ends of the target. This semi-global scenario is designed for the common case of known motif boundaries in the query. Nevertheless, our method can be adapted to align locally with respect to both target and query as long as the locality is of biological origin. Technical locality, which was introduced in the trCYK algorithm [46] as a means of dealing with partial sequences from RNA-seq data, cannot easily be integrated, however.

We provide a generic description of the algorithm, where the target is a single sequence and, in general, the query is a multiple alignment. Furthermore, both, query and target are annotated with a base pair probability matrix, which works similarly to a weighted contact map. In general, such matrices can be computed by McCaskill’s algorithm [35]. In the common case of known query structure, we generate an appropriate base pair probability matrix by setting the probabilities of the query structure base pairs to 1 (and all others to 0). For the target, we suggest to compute local base pair probabilities by RNAplfold[45,47], which limits the span (j − i) of base pairs (i,j). Thus, we assume a maximum span L << n of target base pairs that we consider for the comparison to the query. This restriction of the base pair span serves two purposes. First, the restriction allows predicting the base pair probabilities much more efficiently. Second, RNA structure prediction methods generally tend to mispredict large base pairs. As a consequence, the accuracy is usually even increased by limiting the size of base pairs (cf. [48].)

Notation and scoring model

Both, target T and query Q are sequences of nucleotides or, in the case of multiple alignment, alignment columns with lengths n and m, respectively. The sequences are annotated with respective substochastic matrices PT ∈ [0,1]n×n and PQ ∈ [0,1]m×m. We write (i,j) ∈ PX as shorthand for PX(i,j) ≥ pmin(X ∈ {T,Q}), where pmin is a fixed cutoff probability. Note that for each fixed i, the number of j satisfying (i,j) ∈ PX is constantly bounded, since <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M3">View MathML</a> (cf. LocARNA [22]).

An alignment <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M4">View MathML</a> of T and Q is a set of pairs (i,j) of indices i of T and j of Q, where all pairs <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M5">View MathML</a> satisfy (i < i′ iff j < j′) and (i = i′ iff j = j′).

A secondary structure of length n is a set R of base pairs (i,j) with 1 ≤ i < j ≤ n, where base pairs do not cross, i.e. no two base pairs (i,j) ∈ R and (i′,j′) ∈ R satisfy i < i′ < j ≤ j′ or i ≤ i′ <j < j′.

Together with an alignment <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M6">View MathML</a> of target and query, we are going to predict a consensus structure <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M7">View MathML</a>. A consensus structure is a set of pairs of base pairs ((i,j),(k,l)), where the set of the respective first and second components are secondary structures of T and Q; it is consistent with an alignment <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M8">View MathML</a> iff for all ((i,j),(k,l)), <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M9">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M10">View MathML</a>.

The score of a consistent pair <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M11">View MathML</a>, which represents a sequence-structure alignment, is of the form

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M12">View MathML</a>

(1)

Here, <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M13">View MathML</a> denotes the set of single-stranded alignment edges <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M14">View MathML</a>, i.e. the alignment <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M15">View MathML</a> without all edges that match base pairs in <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M16">View MathML</a>. Furthermore, γ < 0 is the gap penalty and Ngap counts the scored gaps in <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M17">View MathML</a>. Recall that for semi-global alignment, we allow free, non-scored, deletions at both ends of the target; thus, for the semi-global score

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M18">View MathML</a>

(2)

the global score is of the same form of Equation (1), where <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M19">View MathML</a>. The functions σ(Ti,Qk) and τ(Ti,Tj;Qk,Ql) yield sequence-based similarities between elements of target and query in the unpaired and paired part of the consensus structure, respectively. We are going to discuss their instantiations later. The explicit dependence of τ on the aligned base pairs can be used to include contributions based on covariation or substitution. Finally, <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M20">View MathML</a> scores the structural contribution of the consensus base pair <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M21">View MathML</a>. As in LocARNA, <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M22">View MathML</a> is derived as log odd from the base pair probability PX(a), where we set ΨX(a) := − if PX(a) < pmin to rule out base pairs with very low probability in (finitely scoring) consensus structures.

For simplicity, we present this score for linear (non-affine) and position independent gap cost, since the extensions of the presented method are straightforward.

Finally, we define the subscore for ij and kl (denoting respective subsequences of T and Q) to be of the form of Equation (1), but – unlike the total score – valid only for alignments <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M23">View MathML</a>, where furthermore Ngap is defined as

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M24">View MathML</a>

such that the subscore penalizes deletions at both ends of the target subsequence ij.

Finally, we define the analogous subscore with free left end deletions by defining Ngap as

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M25">View MathML</a>

Dynamic programming recursions

For maximizing the semi-global score of <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M26">View MathML</a>, we start by introducing dynamic programming matrices S and D in analogy to the dynamic programming matrices of LocARNA[22] and PMcomp[34]. Thus, we define the entry Si,j;k,l as the best subscore for ij and kl. Furthermore, let Di,j;k,l be the best subscore for ij and kl of an alignment and consensus structure <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M27">View MathML</a> subject to the constraint that <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M28">View MathML</a> contains ((i,j),(k,l)).

Equivalently, we redefine S and D recursively by

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M29">View MathML</a>

(3)

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M30">View MathML</a>

(4)

with appropriate initializations

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M31">View MathML</a>

(5)

for 1 ≤ i < j ≤ n and 1 ≤ k < l ≤ m.

Since these equations are left reducing variants of the otherwise well known recursions of PMcomp[34], one computes all entries in D in <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M32">View MathML</a> time and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M33">View MathML</a> space by applying the evaluation strategy of LocARNA, which takes advantage of filtering base pairs by probability pmin. However, note that the dependency on the target size n is still unacceptable for our purpose (of scanning genome-sized targets); therefore, we are going to improve on these complexities later.

The matrices S and D are defined to compute global subscores (i.e., without free end gaps). Since we finally need alignments with free end gaps, we have to introduce an additional DP matrix S. Furthermore, since we need the semi-global alignment score only for the entire target and query, it suffices to define the matrix only for scores of target and query prefixes. Thus, define <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M34">View MathML</a> non-recursively as the best subscore for 1…j and 1…l with free left end deletions. This is equivalent to the recursion

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M35">View MathML</a>

(6)

where the free end gaps are achieved purely by the specific initialization

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M36">View MathML</a>

(7)

for 0 ≤ j ≤ n and 0 ≤ l ≤ m.

Efficient evaluation of the recursion equations

As mentioned before, the computational demands of the straightforward evaluation of these recursions are unacceptable for scanning. This holds even when applying the evaluation strategy of LocARNA, which never stores more than <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M37">View MathML</a>S entries at each time. Although, as suggested before, we generally limit the maximum base pair span to L, this leaves us with prohibitive linear space dependency on the target size, since <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M38">View MathML</a> entries of D have to be kept in memory before the S recursion can be evaluated. Furthermore, storing the entire matrix S requires <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M39">View MathML</a> space.

Therefore, we rearrange the evaluation interleaving the computation of entries in D and S. This allows us to compute all S entries in one pass over the target sequence, while successively materializing required matrix entries and in turn removing “old” entries that are not accessed anymore (Algorithm 1, cf. Figure 1).

Algorithm 1: LocARNAscan algorithm

thumbnailFigure 1. Schematic view of the computation of S by LocARNAscan. The dark green dots in <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M41">View MathML</a> are sufficient to calculate the current entry <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M42">View MathML</a>, given the set of arcs indicated by red lines.

Computing and storing the D entries in lines 3–5 requires to compute and store the subset of S entries that allows to derive D(i,j,k,l) efficiently from S(i + 1,j − 1,k + 1,l − 1). Formulating recursion (3) recursing only to entries with the same fixed right ends j and l enables an important optimization: for each j and l, we compute a matrix slice of entries in S that have right ends j − 1 and l − 1 and then derive all D(i,j,k,l) from this matrix slice. Note that we need to compute and store at most min(j − 1,L) × (l−1) such entries, but potentially less depending on the actual base pairs (i,j) ∈ PT and (k,l) ∈ PQ. After the D entries are derived, these S entries are not accessed anymore and their space can be reused.

In lines 8 and 9, we free the space of all D and S entries that cannot be accessed by subsequent algorithm steps anymore. This is guaranteed for all entries with target left end i ≤ j − L, since no base pair spans more than L positions. Consequently, the space requirements of this algorithm are bounded by the requirement to store Lm entries in slices of S, Lm entries of S, and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M43">View MathML</a> entries of D, i.e., <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M44">View MathML</a> space in total.

Allocating and freeing entries can be implemented in constant time by using rotating matrices, commonly implemented by addressing based on target indices modulo L. Since D is a sparse matrix, it is conveniently implemented based on a hash. Thus, the time complexity of this algorithm is bounded by computing all matrix entries, i.e. nm entries of S, <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M45">View MathML</a> entries of S, and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M46">View MathML</a> entries of D. Note that each entry is computed in constant time; for S and S this holds due to considering only a sparse subset of base pairs. Thus, we derive the total time complexity <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M47">View MathML</a>.

Sequence score contributions

The sequence contribution of our score is defined via the two similarity functions σ and τ. Note that LocARNAscan was designed to work even without sequence information. In this special case, we set both functions to constantly return 0.

In general, LocARNAscan accepts information about the query sequence in the form of a multiple alignment. Replacing the Ribosum-like [15] definition of σ and τ in LocARNA, we suggest to utilize log-odd scores based on nucleotide (and nucleotide pair) frequencies in this multiple alignment. Whereas the original LocARNA score is tailored for comparing single sequences or constructing multiple alignments (there used in a sum-of-pairs score), log-odd based scores, which are similarly utilized by Infernal[13], are more appropriate for scanning applications.

Given two column vectors q and q′ of nucleotides and gaps from the query multiple alignment and given two nucleotides t and t′ from the target, fq(t) denotes the frequency of t in q; <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M48">View MathML</a>, the frequency of pairs t and t′ in the corresponding rows of q and q′. Let bt be the background frequency of nucleotide t; <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M49">View MathML</a>, of the nucleotide pair t, t′ in canonical base pairs. The similarity in the single stranded case is then defined by

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M50">View MathML</a>

(8)

the similarity in the base paired case, by

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M51">View MathML</a>

(9)

For simplicity, we assume uniform distribution of single nucleotides and nucleotides in base pairs. That is we use background frequencies bt = 1/4 and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M52">View MathML</a>; the latter reflects that we consider all six canonical base pairs.

In our implementation, for fast evaluation of the similarity functions, we compute the profiles, consisting of <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M53">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M54">View MathML</a> for all query position k < l, prior to the actual scanning. Furthermore, to handle small query alignments and smoothen the scoring functions, we add pseudocounts depending on the query’s mononucleotide background frequencies.

Reporting optimal and suboptimal occurrences of the query

An occurrence of Q in T is a subsequence T[i]…T[j] of T; its score is the best subscore for ij and 1…m. The optimal occurrence of Q in T is determined during the run of Algorithm 1 by recording the j with the best score S(j,m). By definition of S, j is the right end of the target subsequence that optimally aligns with the query, in the case that an arbitrarily long left end of the target can be deleted for free. Albeit finding the left end requires extra work (see next subsection), the right end suffices to specify the occurrence.

For reporting suboptimal occurrences, one can record all scores sj := S(j,m) during the run of the algorithm. However, reporting all occurrences down to a certain score threshold in this vector is unsatisfactory, since good scores are usually flanked by only slightly worse scores that refer to the same occurrence with slightly altered alignment. For this reason, we report only local maxima of sj, i.e. j, where sj ≥ max(sj−1,sj+1). Furthermore, we do not report a local maximum j′ if it is too close to a reported local maximum j with a better or equal score. Setting the distance threshold to one query length (m), we avoid reporting occurrences with substantial overlap. More formally, for j and j′ where |jj′| ≤ m, we say that jdominatesj′, iff <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M55">View MathML</a> or ( <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M56">View MathML</a> and j < j′). We prune all dominated local maxima and report only the non-dominated ones.

Limiting the memory consumption independent of the target length is crucial for scanning. In addition, random access to data growing linearly with the target length has to be avoided. Consequently, we devise an online pruning algorithm with space requirements linearly bounded by the query length. While scanning the target by algorithm 1, we maintain a list of local maxima j′ and remember their corresponding scores <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M57">View MathML</a>. We maintain three invariants: 1) list entries are increasing, 2) the distance of successive list entries is smaller than or equal to m, and 3) scores of list entries are strictly increasing.

When identifying a new local maximum j (after computing sj + 1 = S(j + 1,m)), we try to resolve as many domination relations as possible. If jm is larger than the last list entry, entries of the list are independent of the new local maximum and all further local maxima. Thus, the list is resolved by iteratively reporting the last list entry and removing all dominated entries. All non-dominated entries are reported and the list is deleted. Finally, unless j is dominated by the last j′ in the list, we push j to the list. Note that here we push to the empty list or the score sj is higher than the score of the last list entry. After scanning the entire target, the list is resolved again.

The correctness follows from preserving the invariants in all cases. It remains to show that the list length is linearly bounded by the query length. This is due to the discretization of our score (we round all score contributions to limited precision); being an additive score, the score is linearly bounded by the query length; consequently, lists satisfying the third invariant have linearly bounded length. Note that in the average case, lists are even much shorter than this theoretic worst case bound suggests.

To handle large targets without excessive use of main memory, we implement two reporting strategies. Either we output all reported occurrences immediately or we limit the number of reported occurrences a priori to some K. To output only K best occurrences, we store them in a priority queue, sorted such that the occurrence with minimum score is on top and thus can be removed, whenever the limit is exceeded. Finally, the contents of the queue are output. While outputting the occurrences, we keep a histogram of the locally maximal scores for later use in determining the significance of an occurrence from the empirical score distribution. The entire strategy is generally similar to GotohScan[49] and extended here by the online pruning procedure.

To determine the significance of an occurrence, we compared several theoretical distributions to the empirical score distribution, but none approximated the alignment scores well (Figure 2). Hence, we followed the approach presented in [49]: for the identification of extreme alignment scores, it suffices to approximate the right tail of the distribution; the tail can be fitted well by an affine function. This strategy enables to return only significant occurrences with e-value up to a given threshold.

thumbnailFigure 2. Fitting of several commonly used probability distributions to the histogram of LocARNAscanscores. Scores correspond to LocARNAscan alignments using the profile of the RFAM family RF00504 (glycine riboswitch) as input query. The first row shows the fitting of log-normal, gumbel, and generalized extreme value (gev) distributions (red curves) to the alignment scores shown as histogram. The shown scores have been shifted to positive values. In the lower panel, we compare the distributions by Q-Q plots. These plots, which plot the quantiles of the observed scores vs. expected quantiles from the theoretical distributions, visualize in how far two distributions differ in location, scale and skew from each other. All tested known probability distributions (including normal and gamma distribution; data not shown) do not represent the LocARNAscan alignment score distribution well; visible in the Q-Q-plot, since none of the Q-Q plots follows a straight line.

Traceback and Bound on Alignment Length

Obtaining the actual alignments corresponding to occurrences with right end j and recovering the left end of the occurrence requires a traceback procedure through the dynamic programming matrices. This poses the problem that space limitations forbid storing the entire matrices, such that these matrices are not available after running Algorithm 1.

Therefore, given j, we determine and recompute the relevant part of matrix S and the corresponding D entries. Efficiency demands to keep the recomputed relevant part of S as small as possible. Given the score sj, we derive a lower bound i for relevant left ends i′ of the occurrence. Consequently, since the occurrence score cannot depend on entries with lower target left ends i′ < i, one initializes all S(i − 1,k′) with − (1 ≤ k′ ≤ m), all S(j′,0) with 0 (i−1 ≤ j′ ≤ j), and recomputes all entries S(j′,k′), where i ≤ j′ ≤ j and 1 ≤ k′ ≤ m, according to Eq. (6).

Finally, transferring a strategy known from LocARNA, we free S entries during this first recomputation, such that we recompute the S entries “on the trace” a second time, when performing the actual traceback. In this way, the traceback does not require more space than the score computation.

Analogously to [45], we bound the difference between the length of query and occurrence by Δ, where we choose Δ := (sjm max(σmax,τmax/2))/γ such that

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M58">View MathML</a>

and σmax and τmax denote the respective maximum single base match and maximum base pair match contributions between query and target; notably, the latter similarity includes sequence contributions due to τ and structure contributions due to ΨT and ΨQ.

Whereas [45] suggests to limit the “history” during the search phase by m + Δ, we limit this more strongly to L at the price of recomputation. Making the common case fast, this strategy is generally advantageous; it reduces load in the scanning phase, while only very few entries close to high scoring occurrences have to be recomputed, causing largely negligible extra cost. Similarly, one bounds the number of required entries of D, since the benefit of matching two base pairs has to justify a potential length difference of the enclosed subsequences.

LocARNAscan recognizes thermodynamic stability of occurrences

To study the specific behavior of LocARNAscan, we compared the performance of LocARNAscan and Infernal on a designed target containing a series of thermodynamically stable and (presumably) unstable occurrences. This allowed us to measure the difference in sensitivity to both classes given different training information.

First, we designed two sets of 1000 RNA decoys each. In the first set, which we call stabilized class, the RNA decoys were designed, applying inverse folding, to fold into tRNA structures with high thermodynamic stability. In the second set, called non-stabilized class, the decoys were generated with the potential to fold into tRNA structures forming canonical base pairs, but their stability is purely by chance. By design, the decoys are not related on the sequence level to each other or to known RNAs but share approximately the same mononucleotide frequencies. However, each decoy has the length and structure of a different randomly selected sequence from the Rfam tRNA family. From these decoys, we generated one pseudogenome consisting of an RNA decoy every 200 bases padded with random nucleotides from the same mononucleotide distribution.

For each class, we selected 100 samples at random to use them as training sets. Based on the Rfam seed alignment, we obtained a multiple alignment of the 100 samples annotated with a consensus structure derived from the Rfam consensus structure of the entire family. From both multiple alignments, we generated covariance models for Infernal. From the model for the non-stabilized class, we striped off all sequence information, replacing it by a background model based on the mononucleotide frequencies of the pseudogenome. This procedure resulted in a stripped model and a stabilized model. For LocARNAscan, we generated two different queries. The first query consists of a tRNA of median length, where the sequence was replaced by a string of Ns and PQ was generated from the Rfam structure of this tRNA; the second, of the “stabilized” multiple alignment and a matrix PQ that was generated from its Rfam-derived consensus structure. By design, the first query contained at most the information of the stripped model, whereas the second query and the stabilized model contain exactly the same information.

We emphasize that both, the stripped CM and the all-N query sequence represent intentionally extreme cases. Since stable RNAs favor CG base pairs over AU and GU base pairs and conversely, A and U in single stranded regions over C and G, one could usually utilize such knowledge, even in the absence of sequence information. Without doubts, Infernal would profit from this knowledge; building corresponding models is even supported via cmbuild’s option eset 0 in Infernal 1.1. As well, LocARNAscan could be similarly extended to mirror Infernal’s behavior. Nevertheless, we intentionally study the extreme scenario to isolate the effect of incorporating thermodynamic stability.

Finally, we performed four scans of the pseudogenome. We run LocARNAscan with both designed queries and Infernal with the stripped model and the stabilized model. While the former model allowed Infernal to make full use of its training machinery, the latter tested Infernal’s performance without sequence information. Per query, Infernal scanned the genome in 9 minutes (on Intel Q9400 2.66 Ghz), whereas LocARNAscan took only 3:10 minutes. Using its HMM filter, Infernal improves to roughly 6 minutes. LocARNAscan additionally requires to precompute base pair probabilities by folding the genome once (for potentially many queries); RNAplfold performed this in 5 minutes.

In Figure 3, we compare the separation of the two decoy classes by the different runs by their classification behavior. For each run, we plot the number of occurrences that coincide with stabilized decoys vs. the number with non-stabilized decoys at the same score threshold. The curves show that Infernal without sequence information completely fails to distinguish the two classes; in contrast, samples of stabilized decoys contain enough sequence information to allow Infernal to classify almost perfectly. Trained without sequence information, LocARNAscan is superior to Infernal in sensing the stability of the decoys. Given the multiple alignment, LocARNAscan gains classification strength, but does not match the excellent performance of Infernal with sequence information.

thumbnailFigure 3. Classification of thermodynamically stabilized vs. non-stabilized occurrences.

Discussion and Conclusions

Genome-wide search for homologs of structured RNAs can benefit from the additional information encoded in their base pairing patterns. Nevertheless, currently available tools predominantly utilize the sequence information contained in query and target. The structural contributions are incorporated at the level of consistency between target sequence and query structure. Here we asked whether it can be worthwhile to include direct structural information, and thus implicitly evidence for the stability of secondary structure, also on the side of the target.

This is feasible in practice based on “scanning versions” of RNA secondary structure prediction tools that compute e.g. probabilities for local base pairs with a limited span in the target. Homology search on such a structure-annotated target is naturally performed as a local or semi-global sequence-structure alignment. With LocARNAscan we devised an efficient implementation of a semi-global variant of the Sankoff algorithm. For applications of genome-wide searches, we developed several algorithmic improvements relative to the global LocARNA approach. Rethinking the trade-offs between storing and recomputing intermediate data, LocARNAscanrequires only <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M59">View MathML</a> memory, dependent only on the query size m and the span L of the precomputed base pairs in the target, but independent of the size n of the target itself, which is linearly read from disc and does not need to be stored in its entirety.

LocARNAscan’s CPU requirements of <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M60">View MathML</a> make genome-wide scanning feasible. In our experiment, LocARNAscan including the precomputation of base pair probabilities by RNAplfold, is about as fast as Infernal; this does not even change much when Infernal is allowed to use HMM filtering, since the small contribution of sequence information in our setting limits the effect of filtering. Calibration of the Infernal model, which is usually required, would have increased its total time requirements dramatically. The actual scan by LocARNAscan and the RNAplfold precomputation took almost the same time. We point out that the theoretical worst case complexities <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M61">View MathML</a> of LocARNAscan, <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M62">View MathML</a> of RNAfold, and <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M63">View MathML</a> of Infernal would suggest a different ordering of run times. The observed run times are a consequence of very different constant overheads. While LocARNAscan’s L2 factor is strongly reduced by the filtering of base pairs by their probability, e.g. RNAplfold’s overhead is much larger due to the complex energy model and Infernal’s run time actually depends on the number of CM states, which is a multiple of the query length.

Not surprisingly, little can be gained by the extra expense of the sequence-structure alignment as long as sufficient sequence information is contained in the query. We therefore concentrated here on the regime in which our current homology search tools are effectively blind, i.e., cases in which sequence conservation is completely absent. We find that in such an extreme setting LocARNAscan retains the ability to distinguish between thermodynamically stable RNA elements and decoys that admit the same base pairing patterns albeit far away from their groundstate. In contrast, Infernal is blind to this difference.

Of course, this is an extreme regime that is of interest in rare applications since most RNA families also exhibit an appreciable level of sequence conservation. A further practical limitation is the accuracy of the predicted base pairs of the target structure. Several approaches for the genome-wide measurements of secondary structure (see [50] and the references therein), however, promise to at least alleviate this issue in the near future.

Methods

Design of decoy RNAs

Each decoy (in both the stabilized and the non-stabilized class) is generated from a randomly selected entry in the Rfam 11.0 seed alignment of the tRNAs. First, we remove all gap columns and the corresponding symbols in the dot-bracket consensus structure string. If this deletes only one parenthesis of a parentheses pair, we replace the other one by a dot. In this way, we generate the ungapped tRNA sequence S and the corresponding projection of Rfam’s tRNA family consensus structure. Then, we fold the sequence constrained with the consensus structure projection (using RNAfold). This results in a specific structure R for the selected tRNA. For the stabilized class, we apply inverse folding into structure R optimizing the probability of R in the ensemble of the designed sequence S; for the inverse folding we apply RNAinverse in partition function mode. In addition, we configure RNAinverse to stop at a cutoff corresponding to probability p = 0.75. This generally results in sequences, where R has only slightly higher probability.

The tool RNAinverse does not directly allow to stop the optimization at a probability p. However, RNAinverse accepts a minimal energy difference between the energy E(R|S) of R for S and the ensemble free energy <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M64">View MathML</a> of S. We utilize this feature after inferring the p-equivalent difference by

<a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M65">View MathML</a>

(10)

where <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M66">View MathML</a> denotes the gas constant and T the temperature.

We generate the non-stabilized decoys based on the mononucleotide frequencies of the stabilized ones. For generating a decoy in the non-stabilized class, given a structure R and mononucleotide frequencies pX(X ∈ {A,C,G,U}), we draw the unpaired bases and bases at left base pair ends randomly from this distribution. The right ends of base pairs are set complementary to the left end bases (disallowing GU pairs).

Stripping off sequence information of a covariance Model

Similar to profile hidden Markov models (HMMs), Infernal’s covariance models (CMs) are generative models that describe a probability distribution over sequences. Designed for RNAs, the CMs contain information about the RNA secondary structure, such that they can distinguish between unpaired bases and base pairs. Thus, like profile HMMs, the CMs contain nucleotide emission probabilities at different match states, but additionally contain nucleotide pair emission probabilities at special base pair match states. Technically, CMs contain log odd bit scores of the emissions calculated from the emission probabilities and, in uncalibrated CMs, a simple uniform null model. The key to entirely remove sequence information from a CM is thus to replace all emission scores of nucleotides and nucleotide pairs according to a background model.

Given mononucleotide frequencies pX(X ∈ {A,C,G,U}), we replace all emission scores for nucleotide X by <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M67">View MathML</a>. For base pairs, we take into account a low probability pnc of emitting a non-canonical nucleotide pair (pnc = 0.001). The scores for nucleotide pairs (X,Y) ∈ {A,C,G,U}2 are then replaced by <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M68">View MathML</a> for canonical pairs XY ∈ {AU,UA,CG,GC,GU,UG} or <a onClick="popup('http://www.almob.org/content/8/1/14/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/14/mathml/M69">View MathML</a> for non-canonical pairs.

Running LocARNAscan and Infernal

First, we performed two scans of the pseudogenome by LocARNAscan 1.0 ; the tool is available as free software at http://www.bioinf.uni-leipzig.de/Software/LocARNAscan webcite. In extension of the presented score of Equation (1), LocARNAscan 1.0 supports affine gap cost and weighs the structure against sequence similarity. We set the weighting factor to 2.0 (option struct-weight=200) and the weighting of the sequence contribution τ(Ti,Tj;Qk,Ql) in the structure score component to 4.0 (option tau=400). Furthermore, we set the insertion/deletion score to -1.0 and gap opening cost to -5.0 (options indel=-100 and indel-opening=-500). Finally, we activate the introduced log odd scoring of LocARNAscan (option logoddscores on). Base pairs in the pseudogenome are predicted with RNAplfold[47] forbidding lonely base pairs and setting a folding window size of 200nt and a maximal base pair span of 100nt. From the RNAplfold prediction, we removed all base pairs with probability less than 0.5.

Second, we performed two corresponding scans by Infernal 1.0.2. We built Infernal models by cmbuild with default parameters. Then, we scan the pseudogenome with cmsearch. To avoid missing decoys we turned off HMM filtering (option fil-no-hmm). Furthermore, to produce comparable results, we scan only the forward strand (cmsearch option top-only) and run Infernal in glocal mode (option -g); the latter turns on LocARNAscan-like semi-global behavior.

For both, LocARNAscan and Infernal, we determined findings of stabilized and non-stabilized decoys and the corresponding occurrence score. There, we considered an occurrence, if it overlaps a decoy by at least 10% nucleotides. In the case of multiple occurrences overlapping the same decoy, we selected the best prediction as score for this decoy.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RB, KR, PFS, and SW designed the algorithm and study; MS and SW implemented LocARNAscan; JE, SH, MS, and SW evaluated its performance; finally, all authors contributed to writing the manuscript. All authors read and approved the final manuscript.

Authors’ information

Sebastian Will and Michael F Siebauer are joint first author.

Acknowledgements

We kindly thank Jana Hertel for discussions on occurrence pruning and e-value estimation similar to GotohScan.

References

  1. Berretta J, Morillon A: Pervasive transcription constitutes a new level of eukaryotic genome regulation.

    EMBO Rep 2009, 10:973-982. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  2. Ponjavic J, Ponting CP, Lunter G: Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs.

    Genome Res 2007, 17:556-565. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Pheasant M, Mattick JS: Raising the estimate of functional human sequences.

    Genome Res 2007, 17:1245-1253. PubMed Abstract | Publisher Full Text OpenURL

  4. Ponting CP, Hardison RC: What fraction of the human genome is functional?

    Genome Res 2011, 21:1769-1776. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  5. Menzel P, Gorodkin J, Stadler PF: The tedious task of finding homologous non-coding RNA genes.

    RNA 2009, 15:2075-2082. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Mosig A, Zhu L, Stadler PF: Customized strategies for discovering distant ncRNA homologs.

    Brief Funct Genomic Proteomic 2009, 8:451-460. PubMed Abstract | Publisher Full Text OpenURL

  7. Piskol R, Stephan W: Selective constraints in conserved folded RNAs, of drosophilid and hominid genomes.

    Mol Biol Evol 2011, 28:1519-1529. PubMed Abstract | Publisher Full Text OpenURL

  8. Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs.

    Proc Natl Acad Sci USA 2005, 102:2454-2459. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  9. Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J: A comparative method for finding and folding RNA secondary structures within protein-coding regions.

    Nucleic Acids Res 2004, 32:4925-4936. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  10. Gesell T, Washietl S: Dinucleotide controlled null models for comparative RNA gene prediction.

    BMC Bioinformatics 2008, 9:248. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  11. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families.

    Nucleic Acids Res 2013, 41:D226-D232. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  12. Freyhult EK, Bollback JP, Gardner PP: Exploring genomic dark matter: a critical assessment of the performance of homology search methods on noncoding RNA.

    Genome Res 2007, 17:117-125. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  13. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments.

    Bioinformatics 2009, 25:1335-1337. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  14. Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles.

    J Mol Biol 2001, 313:1003-1011. PubMed Abstract | Publisher Full Text OpenURL

  15. Klein RJ, Eddy SR: RSEARCH: Finding homologs of single structured RNA sequences.

    BMC Bioinformatics 2003, 4(44):1471-2105. OpenURL

  16. Bussotti G, Raineri E, Erb I, Zytnicki M, Wilm A, Beaudoing E, Bucher P, Notredame C: BlastR–fast and accurate database searches for non-coding RNAs.

    Nucleic Acids Res 2011, 39:6886-6895. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  17. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the “decimal” release.

    Nucleic Acids Res 2011, 39:D141-D151. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  18. Rivas E, Klein RJ, Jones TA, Eddy SR: Computational identification of noncoding RNAs in E. coli by comparative genomics.

    Curr Biol 2001, 11:1369-1373. PubMed Abstract | Publisher Full Text OpenURL

  19. Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, Stadler PF: Mapping of conserved RNA secondary structures predicts thousands of functional non-coding RNAs in the human genome.

    Nat Biotech 2005, 23:1383-1390. Publisher Full Text OpenURL

  20. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D: Classification of conserved RNA secondary structures in the human genome.

    PLoS Comput Biol 2006, 2:e33. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  21. Torarinsson E, Sawera M, Havgaard J, Fredholm M, Gorodkin J: Thousands of corresponding human an mouse genomic regions unalignable in primary sequece contain common RNA structure.

    Genome Res 2006, 16:885-889. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  22. Will S, Missal K, Hofacker IL, Stadler PF, Backofen R: Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering.

    PLoS Comp Biol 2007, 3:e65. Publisher Full Text OpenURL

  23. Gräf S, Strothmann S, Kurtz S, Steger G: HyPaLib: a database of RNAs and RNA structural elements defined by hybrid patterns.

    Nucleic Acids Res 2001, 29:196-198. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  24. Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R: RNAMotif, an RNA secondary structure definition and search algorithm.

    Nucleic Acids Res 2001, 29(22):4724-4735. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  25. Höchsmann T, Höchsmann M, Giegerich R: Thermodynamic Matchers: strengthening the significance of RNA folding energies. In Computational Systems Bioinformatics, CSB 2006. Edited by Xu Y, Markstein P, Markstein P, Xu Y. Singapore: World Scientific; 2006:111-121. OpenURL

  26. The Athanasius FBompfünewererRNAConsortium:, Backofen R, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Rose D, Stadler PF, Tanzer A, Washietl S, Sebastian W, Mosig SJ Axel Prohaska: RNAs everywhere: Genome-wide annotation of structured RNAs.

    J Exp Zool B: Mol Dev Evol 2007, 308B:1-25. OpenURL

  27. Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures.

    Proc of the Computational Systems Bioinformatics Conference, Stanford, CA, August 2003 (CSB 2003) 2003, 159-168. OpenURL

  28. Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons.

    Bioinformatics 2005, 21:3352-3359. PubMed Abstract | Publisher Full Text OpenURL

  29. Sankoff D: Simultaneous solution of the RNA folding, alignment, and proto-sequence problems.

    SIAM J Appl Math 1985, 45:810-825. Publisher Full Text OpenURL

  30. Gorodkin J, Heyer LJ, Stormo GD: Finding the most significant common sequence and structure motifs in a set of RNA sequences.

    Nucleic Acids Res 1997, 25:3724-3732. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  31. Hull Havgaard JH, Lyngsø R, Stormo GD, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%.

    Bioinformatics 2005, 21:1815-1824. PubMed Abstract | Publisher Full Text OpenURL

  32. Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences.

    J Mol Biol 2002, 317:191-203. PubMed Abstract | Publisher Full Text OpenURL

  33. Bradley RK, Pachter L, Holmes I: Specific alignment of structured RNA: stochastic grammars and sequence annealing.

    Bioinformatics 2008, 24:2677-2683. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  34. Hofacker IL, Bernhart SHF, Stadler PF: Alignment of RNA base pairing probability matrices.

    Bioinformatics 2004, 20:2222-2227. PubMed Abstract | Publisher Full Text OpenURL

  35. McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure.

    Biopolymers 1990, 29:1105-1119. PubMed Abstract | Publisher Full Text OpenURL

  36. Heyne S, Will S, Beckstette M, Backofen R: Lightweight comparison of RNAs based on exact sequence-structure matches.

    Bioinformatics 2009, 25:2095-2102. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  37. Schmiedl C, Möhl M, Heyne S, Amit M, Landau GM, Will S, Backofen R: Exact pattern matching for RNA structure ensembles. In Proceedings of the 16th International Conference on Research in Computational Molecular Biology (RECOMB 2012), Volume 7262 of LNCS. Edited by Chor. B, Chor. B. Heidelberg: Springer-Verlag; 2012:245-260. OpenURL

  38. Backofen R, Will S: Local sequence-structure motifs in RNA.

    J Bioinf Comput Biol 2004, 2:681-698. Publisher Full Text OpenURL

  39. Backofen R, Siebert S: Fast detection of common sequence structure patterns in RNAs.

    J Discr Alg 2007, 5:212-228. Publisher Full Text OpenURL

  40. Will S, Yu M, Berger B: Structure-based whole-genome realignment reveals many novel noncoding RNAs.

    Genome Res. 2013, Jun;23(6):1018-27.

    http://dx.doi.org/10.1101/gr.137091.111.Epub2013Jan7 webcite

    PubMed Abstract | Publisher Full Text OpenURL

  41. Will S, Miladi CSM, Möhl M, Backofen R: SPARSE: Quadratic time simultaneous alignment and folding of RNAs without sequence-based heuristics. In Proceedings of the 17th International Conference on Research in Computational Molecular Biology (RECOMB 2013), Volume 7821 of LNCS. Edited by Deng M, Jiang R, Sun F, Zhang X. Heidelberg: Springer-Verlag; 2013:289-290. OpenURL

  42. Will S, Joshi T, Hofacker IL, Stadler PF, Backofen R: LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs.

    RNA 2012, 18:900-914. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  43. Nawrocki EP: Structural RNA homology search and alignment using covariance models.

    PhD thesis.

    Washington University, Saint Louis 2009

    OpenURL

  44. Gruber AR, Bernhart SH, Hofacker IL, Washietl S: Strategies for measuring evolutionary conservation of RNA secondary structures.

    BMC Bioinformatics 2008, 9:122. PubMed Abstract | BioMed Central Full Text | PubMed Central Full Text OpenURL

  45. Bompfünewerer AF, Backofen R, Berhart SH, Hertel J, Hofacker IL, Stadler PF, Will S: Variations on RNA folding and alignment: Lessons from Benasque.

    J Math Biol 2008, 56:129-144. PubMed Abstract | Publisher Full Text OpenURL

  46. Kolbe DL, Eddy SR: Local RNA structure alignment with incomplete sequence.

    Bioinformatics 2009, 25:1236-1243. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  47. Bernhart S, Hofacker IL, Stadler PF: Local RNA base pairing probabilities in large sequences.

    Bioinformatics 2006, 22:614-615. PubMed Abstract | Publisher Full Text OpenURL

  48. Lange SJ, Maticzka D, Mohl M, Gagnon JN, Brown CM, Backofen R: Global or local? Predicting secondary structure and accessibility in mRNAs.

    Nucleic Acids Res 2012, 40(12):5215-5226. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  49. Hertel J, de Jong D, Marz M, Rose D, Tafer H, Tanzer A, Schierwater B, Stadler PF: Non-coding RNA annotation of the genome of Trichoplax adhaerens.

    Nucleic Acids Res 2009, 37:1602-1615. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  50. Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY: Understanding the transcriptome through RNA structure.

    Nat Rev Genet 2011, 12:641-655. PubMed Abstract | Publisher Full Text OpenURL