| SMOTIF: efficient structured pattern and profile motif searchDepartment of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
Algorithms for Molecular Biology 2006, 1:22doi:10.1186/1748-7188-1-22 The electronic version of this article is the complete one and can be found online at: http://www.almob.org/content/1/1/22
©
2006 Zhang and Zaki; licensee BioMed Central Ltd. AbstractBackgroundA structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. ResultsSMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. ConclusionSMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/sMotif/ webcite. BackgroundSearching biological sequence(s) for motifs is a fundamental task in bioinformatics. Motifs can be represented as either patterns over a specific alphabet, or profiles (also called positional weight matrix (PWM)), which give the probability of observing each symbol in each position. Motifs can be classified into two main types. If no variable gaps are allowed in the motif, it is called a simple motif. For example, in the genome of Saccharomyces cerevisiae, the binding sites of transcription factor, GAL4 [1], can be characterized by the simple motif shown in Table 1, which illustrates the pattern over the IUPAC alphabet (ΣIUPAC; see Table 2), as well as its profile (which gives the frequency of each DNA base at each position). The motif in Table 1 only consists of one component and thus is a simple motif. Since the symbols in the first 3 positions (CGS) and in the last 3 positions (SCG) are well conserved, we can also represent this motif as CGS[11,11]SCG, where [11,11] means that there is a fixed "gap" of length 11 between the two components. If variable gaps are allowed in a motif, it is called a structured motif. A structured motif can be regarded as an ordered collection of simple motifs with gap constraints between each pair of adjacent simple motifs. For example, the LTR retrotransposons from the Copia group, corresponding to genes encoding reverse transcriptase, in A. thaliana can be characterized by the structured motif M1 [2,5] M2 [6,7] M3, as shown in Table 3[2]. Here M1, M2 and M3 are three simple motifs; [2,5] and [6,7] are variable gap constraints ([minimum gap, maximum gap]) allowed between the adjacent simple motifs. Note that each simple motif Mi (with 1 ≤ i ≤ 3) can either be a pattern over ΣIUPAC or a profile over ΣDNA. Searching for structured motifs is more complicated than searching for simple motifs, and is an ongoing research area [3-7]. The sequence to be searched can be very long, e.g., chromosome 1 of Homo Sapiens contains 245 million (245M) base pairs. The structured motif can also be as long as several kilobases. All these factors need to be considered when designing an efficient structured motif search algorithm. More formally, a structured motif Given a collection of sequences, Depending on the application, the structured motif search problem can have several variations: Table 4. Structured Motif Search • Missing Components: The matching motifs can consist of some, instead of all, the simple motifs in • Approximate Matches: The matching motifs may consist of similar motifs (as measured by Hamming or Levenshtein distance [8]), instead of exact matches, to the simple motifs in • Overlapping Components: The variable gap constraints (li and ui) can take on a limited range of negative values, allowing search for overlapping simple motifs. We allow two adjacent components Mi and Mi+1 to overlap, but we require that Mi+1 does not precede Mi. This condition can be satisfied by the following constraints on the gap range [li, ui]: - |Mi| ≤ li ≤ ui, for i ∈ [1, k). For example the search for motif ACG[-2,2]CGA, can discover the overlapped occurrence ACGA, as well as the non-overlapped occurrence ACG- -CGA, at the two extremes of the gap range. • Profile Search: The components of the motif In this paper, we focus on the problem of searching for a given structured motif in one or more sequences. We propose SMOTIF, an efficient algorithm for structured motif searches. It uses an inverted index of symbol positions, and it finds all occurrences by positional joins over this index. For structured pattern search problem, we propose two main variants of our approach: i) a direct search for simple motifs and the structured motif via positional joins, and ii) a two-step approach, where we use a suffix tree to search for simple motifs and then use positional joins for the structured motif. For structured profile search problem, we first search each simple motif by aligning its profile with the sequences, and then search structured motifs with positional joins. SMOTIF allows missing components, overlapping motifs, and also approximate matches (when using the two-step approach). SMOTIF also allows flexible matches using IUPAC symbols. We apply SMOTIF for searching long DNA sequences for LTR retrotransposons, which constitute a substantial fraction of most Eukaryotic genomes and are believed to have a significant impact on genome structure and function [9,10]. We show that SMOTIF is effective in searching for composite regulatory patterns, and it can also suggest potentially new binding sites. We experimentally demonstrate the superiority of SMOTIF over SMARTFINDER, a state-of-the-art method for structured motif search, both in terms of time and space; SMOTIF can be up to 7 times faster and can consume 100 times less space. Related workMany existing pattern matching algorithms [8,11-18] can be used to solve the simple pattern search problem. Given the sequence length, n, and the pattern length, m, exact matching algorithm can run in O (n + m) [11]; approximate matching algorithm can run in O (rn) [16,17] or O (nm/w) [18], where r is the error threshold and w is the size of a computer word. The space complexity is O (n) for both exact matching and approximate matching. Several previous efforts have focused on the structured pattern search problem. Anrep [3,4] provides a unified biosequence pattern representation by using network expressions with spacers, where a network expression is a regular expression without Kleene closure. With network expressions, one can specify the scoring scheme and the threshold of approximate matching for each simple motif separately, the positional weights which express the relative importance of different parts of a motif, and whether a simple motif is optional. Anrep introduces a two-step approach: first it searches for simple motifs by a threshold-sensitive motif matching algorithm and then it finds the structured motif by an optimized backtracking matching algorithm. However, as compared by [6], Anrep is much slower than SMARTFINDER. In [5], the structured motif is called a Classes of Characters and Bounded Gaps (CBG) expression and is represented by a non-deterministic ε-automaton with bit parallelism. Two algorithms are proposed for CBG expression search: forward search and backward search. Bit parallelism speeds up the search, but is adequate only for a pattern whose maximum span is smaller than the length of the computer word. Also the implementation of CBG can only handle such pattern. This limits the application of CBG to searching for patterns with small number of symbols and gaps. SMARTFINDER [6,7], which is currently the most efficient method for structured motif search, is also a two-step approach. In the first step each simple motif is searched separately by building a suffix tree for the sequence. This step outputs the ordered occurrence lists of all simple motifs. The second step solves a constraint satisfaction problem by considering constraints individually in three sub-steps. First it considers the gap constraints and builds a constraint graph whose nodes are the simple motif occurrences and edges connect all possible pairs of nodes that locally satisfy the gap constraints. It then considers the constraint for the maximum number of missing components and shrinks the graph to contain only the nodes that can be in the structured motif occurrences. Finally it enumerates all the valid occurrences by a depth first search (DFS). Notable differences in SMOTIF and SMARTFINDER are as follows: we search patterns directly by positional joins over an inverted index, we consider variable gap constraints during the positional joins as opposed to building a constraint graph, and we handle missing components more efficiently by considering them over patterns instead of over each occurrence as in SMARTFINDER. Note also that like Anrep and SMARTFINDER, SMOTIF can also mine approximate patterns, when using the two-step approach, which we describe later. For profile search, MATCH [19], P-Match [20], and MatInspector [21,22] search DNA sequences against a position weight matrix library (such as TRANSFAC database [23]) and report the occurrences that satisfy given score thresholds. They compute the matrix score by multiplying the base frequency with the information content value at each position, in order to emphasize the fact that mismatches at less conserved positions are more easily tolerated than mismatches at highly conserved positions. Besides the matrix score, they define a core region, which is usually the first 4–5 most conserved consecutive positions of the matrix, and perform the core score threshold check. Then they align the matrix to each position of the sequence and calculate the core score and matrix score. However, these algorithm don't consider the prior probability of each base when calculating the matrix (or core) score, and the core region is required to be consecutive. They need to check all positions of each subsequence (at least all the core positions) in order to calculate the matrix (core) score. Moreover, these algorithms only work on simple profile with one single matrix component. For structured profile search, only Anrep [3,4] provides the capability to model structured profiles, with its general network expressions. However, Anrep doesn't give a solution on score calculation and fast search for structured profiles. Moreover, its implementation doesn't support structured profile search. To our knowledge, SMOTIF is the only implemented method that can handle structured profile search. MethodsWe first introduce our basic approach for structured pattern search, and successively optimize it for various practical scenarios. We then present our approach for structured profile search. Structured pattern search: basic approachLet us assume that we are searching for a structured motif Depending on whether we compute the pos-lists for IUPAC symbols or not, SMOTIF uses two approaches: (a) DNA pos-lists: Here we keep (in memory) the pos-lists only for the four DNA symbols. For the other IUPAC symbols, we obtain their pos-lists by taking a union over the pos-lists of their constituting DNA symbols, e.g., Table 5. Pos-lists Positional joinsWe first extend the notion of pos-lists to cover structured motifs. The pos-list of
• d <l: Advance y to the next element in • d > u: Advance x to the next element in • l ≤ d ≤ u: Save this occurrence in The pos-list for X [l, u] Y can be computed in time linear in the lengths of Given a longer motif SMOTIF handles both simple and structured motifs uniformly, by adding the gap range [0, 0] between adjacent symbols within each simple motif Mi. For our example in Table 4, the structured motif Figure 2 shows how the positional joins work for our (expanded) motif from Table 4. At any stage in the process, the head symbol's pos-list corresponds to the full list of positions shown, whereas the tail's pos-list consists only of the positions shown in bold. For example, when computing A[1,4]CAT, the pos-list of the tail CAT is {2, 12, 15}, and that of the head symbol A is {3, 10, 13, 16}. Also, for illustration, we add a link between any two positions (x and y) in adjacent columns if their difference (d = y - x - 1) falls within the corresponding gap range. The joins begin with the last two symbols, with A as the head and T as the tail with a gap of [0,0]. The only positions x ∈
Full position recoveryIn our positional join approach, to save time and space we retain only the motif start positions, however, in some applications, we may need to know the full position of each occurrence, i.e., the set of matching positions for each symbol in the motif. We describe two approaches to recover the full positions: recomputed or indexed full-position recovery. Recomputed full position recoveryFor recomputing the full positions SMOTIF needs access to only the sequence S and the post-list As an example, let's assume we want to recover the full position for the motif
Indexed full position recoveryRather than recomputing the positions, we can "index" some information during the positional joins in order to facilitate full position recovery. For each suffix of Consider the example shown in Figure 6 to recover the full positions for our example motif from Table 4. Under each symbol we show two columns. The left column corresponds to the intermediate pos-lists
Sequence segmentationThe SMOTIF approach as described above works well for searching a motif in a relatively short sequence. For a very long sequence S (e.g., searching for (LTR) retrotransposons in an entire chromosome) the pos-lists can get very long in the initial stages, consuming a lot of memory. SMOTIF handles a long sequence by splitting it into several segments and searches each segment separately for the structured motif. That is, the sequence S is split into p equal partitions (except for the last one). Handling each smaller segment Si (i ∈ [l, p]) instead of the original S can save a lot of space and also reduces the total search time. After segmentation, to avoid missing any occurrence, we require that each partition Si, with i ∈ [l, p - 1], include the first L - 1 symbols from partition Si+1. Finally, to avoid duplicate occurrences, we discard all occurrences with a start position in the overlap region, since it would be reported when we process segment Si+1. For example, let S be the sequence in Table 6, and let the structured motif be So far we have assumed that we are searching for the structured motif in a single sequence. SMOTIF can easily handle a collection Table 6. Segmentation into p = 3 Missing componentsIn some applications a partial match of the structured motif might still be of interest. SMOTIF allows up to q simple motif components to be missing during the search. Let For example, if we allow one (q = 1) missing component for our structured motif in Table 4, the set of sub-motifs that need to be searched for are: GC[0,1]TTA[1,4]CAT, GC[1,8]CAT, GC[0,1]TTA and TTA[1,4]CAT. Note that it is straightforward to incorporate other approaches to compute new ranges into SMOTIF since it would only change the gap constraints. For example, li,j = minn ∈ [i,j-1] {ln} and ui,j = maxn ∈ [i,j-1] {un} is another possible way to compute the adjusted gap ranges. Instead of searching each sub-motif separately, we do an optimized search. We reuse the partial pos-lists created when using a depth first search to enumerate and search the sub-motifs. The idea is to re-use the pos-lists created for common suffixes when enumerating their sub-motif extensions. Two-step approach for structured pattern searchSo far we have described the direct method used by SMOTIF to search for the structured motif by positional joins over the symbols. In fact, SMOTIF, can also follow a two-step approach like in Anrep [4] and SMARTFINDER [6]. In the first step, given Exact matchingMany algorithms [11-14] exist for exact pattern matching. Like in SMARTFINDER we use a lazy suffix tree [11] to extract the pos-lists for all simple motifs. The matching occurrences are sorted after extracting them from the suffix tree to obtain the pos-list in sorted order. For an intermediate pattern Mi [li, ui]
Approximate matchingSeveral algorithms [8, 12, 15–18] exist for approximate pattern matching. For consistency, we used Sellers' dynamic programming algorithm [26], as implemented in SMARTFINDER, to extract the pos-lists for all simple motifs with approximate matches. This algorithm is not optimal and it can be replaced by more efficient ones [16-18]. Since we allow a specific Levenshtein distance [8] (i.e., insertions, deletions and substitutions) between the occurrences and the motif, the length of the occurrences can be different from the component length |Mi|. Thus we augment the pos-list to explicitly store the end position, in addition to the start position, for each occurrence. Figure 7(b) shows how the pos-list joins work for approximate matches of simple motifs. In the structured motif from Table 4, we consider the exact matches of GC and CAT, and the approximate matches of TTA within Levenshtein distance of 1. Each column in (b) shows the pos-list of a simple motif: the left sub-column is a list of its start positions and the right sub-column is a list of its end positions. We first join the pos-lists of TTA and CAT, checking for gap range [1,4]. We compare the end positions of TTA and the start positions of CAT and find that the pairs (9,12), (10,12), (10,15), and (11,15) all lie within the gap range (indicated by the links), and thus the pairs, (7, 10), (8, 9), (8, 10) and (8, 11) are retained in the resulting pos-list. Likewise (5, 6) is in the final pos-list, since after comparing the end position of GC, 6, with the start position of TTA, 7 and 8, we find d = 7 - 6 - 1 = 0 ∈ [0,1] and d = 8 - 6 - 1 = 1 ∈ [0, 1]. Structured profile searchHaving outlined our approach for structured pattern search, here we tackle the problem of structured profile search. The profile (also called a position weight matrix) for a structured motif Given a profile structured motif Weighted profile creationGiven the initial "raw" frequency profile for the structured motif where Whereas the weights computed above give the likelihood of observing a given symbol in a given position they do not account for the degree to which some symbols are conserved at some positions. We can adjust the weights by considering the information content at each position. The information content for a profile is given as: where Given the initial profile motif Figure 8 shows an example of computing a profile from a set of aligned structured motifs. There are 8 aligned motifs with different gaps between components. We first obtain the initial frequencies of each symbol x ∈ ΣDNA in each position j (
Profile scoringGiven the weighted profile Note that whereas Equations 4(a) and 4(b) are strictly in the range [0, 1], for 4(c) When applying the score threshold for an occurrence S', we require that its normalized score is above the threshold λ. For example, for Equation 4(b), In other words, instead of normalizing the score for each match, we take λn as the new normalized threshold for scoring the potential matches. Likewise we can get the new thresholds for Equations 4(a) and 4(c). For example, for 4(a) the normalized threshold would be λn = λ · Partial scoresFor profile matching problem, we are only given the score threshold λ for the whole structured motif. We here develop a method to compute a partial score threshold for any sub-profile, which can lead to great pruning efficiency. For a sub-profile In another word, the minimum threshold for any sub-profile Core scoresIn many biologically relevant motifs, some positions are more conserved than others. We call them core positions, and these are precisely those positions with high information content. We choose the top h (usually 4 to 6) positions in the profile with highest information content as the core positions. For any potential match S', we can compute its core score Motif enumerationSimple motif scoringGiven a profile motif To prune matches S' that will eventually not meet the score threshold, we check the score threshold as each position in S' is being considered. If the score for any prefix of S' falls below the score threshold, we can discard S'. In fact, before applying scores over all positions, we first consider the scores for the prefixes of the core positions within the component. This continuous check for the core positions and regular positions leads to very effective pruning. Note that as opposed to previous methods [19-22], our approach does not require the core positions be consecutive so as to find the most conserved parts in the profile. Figure 9 shows how we search for the example profile motif from Figure 8, namely M1 [0, 5] M2 [0, 9] M3. In Figure 9(a), under each component Mi, 1 ≤ i ≤ 3 a set of tuples are listed in the format (j, S',
Structured motif scoring and positional joinsAfter obtaining the pos-lists of simple motif components, we can enumerate structured motifs by doing positional joins on these pos-lists, as already outlined for structured pattern search. We compute the positional joins with Mi as the head and Mi+1 ⋯ Mk as the tail, as we start from Mk as the head and end at M1 as the head. During the positional joins we also check the partial structured score thresholds λn (Mi ⋯ Mk). If the check fails at any stage we prune the match candidate. We keep a pattern (along with its score) only if it satisfies the full structured score threshold λn. Figure 9(a) shows how to enumerate structured motifs via positional joins. The pos-list of each component is simply the set of positions (1st element of the quadruples) under it. For example, Full position recoveryOnce the pos-list for the profile The complete SMOTIF algorithm: complexity analysisThe pseudo-code for the complete SMOTIF algorithm is shown in Figure 10. We distinguish three cases: the direct (or one-step) approach, and the two-step approach for structured pattern search are denoted as SMOTIF-1, and SMOTIF-2, respectively. The approach for structured profile search is called SMOTIF-P. Let
Results and discussionPerformance comparisonWe have implemented the SMOTIF algorithm in standard C++. We compare our results with SMARTFINDER, the best previous algorithm for structured motif search. For fair comparison, the suffix tree used in SMOTIF-2 for finding the pos-list of simple motifs is the same as the one used in SMARTFINDER, namely the lazy mode suffix tree [11], and we apply Sellers' dynamic programming algorithm [26] for approximate matching. The programs were compiled with g++ v3.2.2 at the optimization level 3 (-O3). Unless indicated otherwise, we did the experiments on an Apple G5 with dual 2.7Ghz processors and 4GB memory running Mac OS X; the timings reported are total times for all steps of the algorithms; all our experiments use exact matching for the simple motifs, and we report the full position for the occurrences. SMOTIF: parameter settingsIn the first set of experiments we used the 5 chromosomes of Arabidopsis thaliana as our sequence set. These five chromosomes have lengths 29M, 19M, 22.7M, 16.9M and 25.7M base pairs, respectively. The structured motif Table 7. Real Motifs Figure 11 shows how SMOTIF performs on the A. thaliana chromosome 1, while searching for the Copia retrotransposon. We study the impact of allowing 0, 1 or 2 missing components out of the 6 simple motifs in the Copia structured motif. Figure 11(a) and 11(b) show the effect of sequence segmentation on SMOTIF-1 and SMOTIF-2, respectively. The x-axis shows the length of the segments, whereas the y-axis shows the time. The rightmost end point on the x-axis shows the case when the sequence is not segmented. For these experiments we used the IUPAC pos-list approach. The different curves in each figure show the effect of 0,1, or 2 missing components. Both figures suggest the best segment length is 10,000 base pairs. At that segment length, sequence segmentation can speed up the search around 2 to 3 times for SMOTIF-1 (depending on the number of missing components) and 4 times for SMOTIF-2. In the following experiments, we use 10,000 as the default segmentation length.
Figure 11(c) and 11(d) compare the time and memory usage for SMOTIF-1 when we use the recomputed or indexed full position recovery. We find that the indexed approach is about 2 times faster than recomputing the full positions, and at the same time it can consume up to 20 times less memory! The effects are more pronounced for more missing components. Henceforth, we use the indexed approach to full position recovery. Figure 11(e) compares the two methods for handling IUPAC symbols in SMOTIF-1, namely DNA pos-lists or IUPAC pos-lists. We find that there is not much difference between the two when 0 or 1 missing components are allowed; but for 2 missing components, IUPAC pos-lists have a slight advantage. In terms of memory space, even though the IUPAC pos-lists approach stores the pos-lists for each distinct symbol that appears in the structured motif, since only one segment is processed at one time, the space utilization of the two approaches is comparable. For example, with no missing components, the IUPAC and DNA pos-lists consume 2.98MB and 2.82MB memory, respectively. Henceforth, we use the IUPAC pos-lists approach. Figure 11(f) shows the time for SMOTIF-1 and SMOTIF-2 on each of the 5 chromosomes of A. thaliana, with no missing components. The chromosomes are arranged by increasing length on the x-axis. We find that the search time increases linearly with the lengths of the sequences. We also observe that the direct approach, SMOTIF-1, outperforms the two-step approach, SMOTIF-2, when searching for the Copia retrotransposon. SMOTIF and SMARTFINDER: comparisonComparison on A. thalianaFigure 12(a) and 12(b) compare time and memory usage, respectively, for SMOTIF-1, SMOTIF-2 and SMARTFINDER, when searching for the Copia retrotransposon in chromosome 1 of A. thaliana. We can see that SMOTIF-2 is around 5 times faster and takes around 8 times less memory than SMARTFINDER. Their running time and memory usage increase slowly with the number of missing components. This is because in both approaches, the simple motif search step is the same for different number of missing components and usually takes most of the time and memory. SMOTIF-1 always outperforms SMARTFINDER, both in time and memory usage. SMOTIF-1 is more than 7 times faster and takes 100 times less memory than SMARTFINDER! Its time increases linearly with the number of missing components. While SMOTIF-1 is faster when there are no missing components, SMOTIF-2 does better when there are two missing components. This is because SMOTIF-1 does the positional joins on the pos-lists of individual symbols, rather than simple motifs, so over all the sub-motifs of the structured motif, the pos-lists of simple motifs may be redundantly computed several times in SMOTIF-1, whereas these are computed only once in SMOTIF-2. Thus the time of SMOTIF-1 varies depending on the number of missing components. However, since we keep one pos-list per DNA/IUPAC symbol, the memory usage remains almost unchanged for SMOTIF-1 algorithm.
To directly evaluate SMARTFINDER's constraint graph approach with our positional join approach, in Figure 12(c), we compare the time for the second step of SMOTIF-2 and SMARTFINDER, after the suffix tree is built in the first step. We observe that SMOTIF-2 is orders of magnitude faster than SMARTFINDER in finding all the occurrences that satisfy the structured gap constraints depending on the number of missing components allowed. We conclude that doing positional joins on the inverted index is an efficient way of enumerating all occurrences. We also multiply aligned 36 A. thaliana LTR retrotransposons from Repbase Update [2] database, which belong to Copia repeat class, and have reverse transcriptase as the keyword. We then extracted four conserved features from the alignment results, shown as Table 8. Search Time for Several Real Motifs Searching motifs with long gapsOne application of SMOTIF is to find structured motifs with long gaps. For example, we extracted the motif DNNNNDRYW [2578, 4202]RNNGVHVY, from the same 36 LTRs of A. thaliana mentioned above. Here we retain only the first and last conserved components in the resulting alignment of the 36 sequences. Note the relatively long gap ranges, with minimum gap l = 2578 and maximum gap u = 4202. SMOTIF was able to extract the approximately 83 million full positions, corresponding to 3 million start positions, in just 35.75s (SMOTIF-1) and 15.72s (SMOTIF-2). Comparison on random motifsHere we used chromosome 20 of Homo Sapiens as the sequence; it has length 61M base pairs. We generate 100 random structured motifs in the ΣIUPAC alphabet, with k ∈ [3,8] simple motifs of length l ∈ [5,10] (k and l are selected uniformly at random within the given ranges). The gap range between each pair of simple motifs is a random sub-interval of [-5, 100]. Note that the negative minimum gap shows that SMOTIF can mine overlapping simple motifs. Here we also compare the structured profile search approach SMOTIF-P, as follows: for each random motif, we form a profile by first expanding the IUPAC symbols into their corresponding DNA symbols and assign them a random probability of occurrence which accounts for 90% of the share, whereas the other DNA symbols randomly share in the remaining 10%. We use SMOTIF-P to search for the profiles with 2 as the number of core positions in each simple motif, λc = 0.1 as the core score threshold and λ = 0.6 as the total score threshold. We use random motifs mainly to demonstrate the effects of various parameters on SMOTIF and SMARTFINDER. Figure 13(a)–(d) show the results. Here we do not allow missing components. As noted before, we may find overlapping occurrences if a negative gap is present in a motif. Figure 13(a) shows how the running time varies with the sum of the number of occurrences of the simple motifs in each of the 100 random motifs. For clarity, each point reflects the average time for the number of occurrences in the given range on the x-axis. For example, the first point on the x-axis [0, 1) corresponds to the case when there are between 0 and 1 million occurrences found. The general trend is that it takes more time as the number of occurrences increases. Figure 13(b) shows the time with respect to the number of occurrences of the whole structured motif (again, for clarity, only average times are plotted for occurrences in the given ranges in the x-axis). We observe that the time increases slightly with increase in the occurrences. In general, the times are more sensitive to the number of intermediate (simple) occurrences. Figure 13(c) shows the effect of the number of simple components in the structured motif. Each point shows the average time over all motifs having the given number of simple motifs. Here again the time increases with increasing components. Finally Figure 13(d) shows the impact of the number of IUPAC symbols in the structured motifs; the trend being that the more the symbols the more time it takes to search. We also observe that the approaches scale linearly, on average, with respect to the different parameters. Also SMOTIF remains about 5–10 times faster than SMARTFINDER over all these experiments.
Table 9 shows the mean and variance of the search times over all the 100 structured motifs. It also shows the time for finding only the start positions or the full positions for SMOTIF. Overall, for these random motifs, we find that on average SMOTIF-1 is the fastest, SMOTIF-2 and SMOTIF-P are comparable, and all three outperform SMARTFINDER by a factor of 4 to 6. Table 9. Random Motifs: Mean and Variance It is interesting to note that SMOTIF is more stable than SMARTFINDER: SMOTIF-P has around 8 times less variance than SMARTFINDER, SMOTIF-2 has around 5 times less variance than SMARTFINDER, whereas SMOTIF-1 has around 17 times less variance than SMARTFINDER. Note also that the overhead in recovering the full positions from the start positions is negligible. Application: composite regulatory patternsThe complex transcriptional regulatory network in Eukaryotic organisms usually requires interactions of multiple transcription factors. A potential application of SMOTIF is to search for such composite regulatory binding sites in DNA sequences. We took two such transcription factors, URS1H and UASH, that are known to cooperatively regulate 11 yeast genes [28]. These 11 genes are also listed in SCPD [1], the promoter database of Saccharomyces cerevisiae. In 10 of those genes the URS1H binding site appears downstream from UASH; in the remaining one (HOP1) the binding sites are reversed. We took the binding sites for the 10 genes, and after their multiple alignment, we obtained the composite motif NNDTBNGDWGDNNDH[5,179]WBRGCSGCYVW, where we represent each column in the alignment with the IUPAC symbol corresponding to the bases that appear at that position. We also extracted the profile for these 10 binding sites. Table 10 shows the binding sites for the 10 genes, their alignment, and the start positions and the distances between the sites (the difference of start positions). The smallest distance is 20 and the largest is 194. Since these are start positions, the variable gap range is obtained by subtracting the length (15) of UASH to obtain l = 20 – 15 = 5 and u = 194 – 15 = 179. Notice also how the alignment preserves the highly conserved GCSGC region in URS1H. Table 10. UASH and URS1H Binding Sites We then searched for the structured motif in the upstream regions of all 5873 genes in the yeast genome. We used the -800 to -1 upstream regions, and truncated the segment if it overlaps with an upstream open reading frame (ORF). As a result, 5794 sequences with average length of 497 bases are left. By searching for the IUPAC pattern, we found 65 occurrences, including the 10 originally known sites, within 1 second. By searching for the profile with 5 as the number of core positions in each simple motif, λc = 0.6 as the core score threshold and λ = 0.8 as the total score threshold, we found 56 occurrences in 0.18 seconds. For each occurrence, we then extracted its actual sequence segment in the matching upstream regions. Since the structured motif represented by IUPAC symbols may be too general, for each matching segment, we calculated its hamming distance to one of the 10 known binding sites. We then selected an occurrence as a possibly new binding site if the minimum hamming distance to any of the 10 known sites is within a given maximum threshold value. Table 11 lists the 12 newly found occurrences in the entire yeast genome (upstream regions) using a hamming distance threshold of 5. The sites discovered using both pattern and profile search are listed. These new occurrences could be putative binding sites for the two transcription factors UASH and URS1H. Table 11. Potential Binding Sites Upon further analysis, we found that in fact, the new occurrence in MEK1 (at positions -233,-136) that we found is also listed in the SCPD database as a binding site. SCPD lists one site for UASH at position -233, and two sites for URS1H at positions -136 and -150. To construct the motif, we had used -150 as the site for URS1H, without knowledge of the other site. SMOTIF was thus able to automatically find the other site based on the extracted motif! For REC114, we also found another occurrence at positions -158 (UASH) and -94 (URS1H). However, this is not reported in SCPD. To further analyze the remaining new occurrences, we consulted the SGD (Saccharomyces Genome Database) Gene Ontology Term Finder [29] to find the inter-relationships between the genes. The first three rows of Table 12 show the significant GO terms (biological process or molecular function) that are common to the genes corresponding to a new occurrence and its closest (known) gene. The rest of the table shows the significant terms among the 18 genes. These results indicate that at least some of the new occurrences (such as SPO1, HSP60, MES1, and GNT1) have a potential to be binding sites since they share some significant processes with the known sites' genes. Out of these SPO1 has the highest potential to be a new binding site, since it is known that UASH and URS1H are involved in early meiotic expression, during sporulation [28]. Table 12 shows that SPO1 shares meiosis and M phase of meiotic cell cycle with the rest of the genes. After searching for SPO1 in SGD database, we found that SPO1 is a transcriptional regulator involved in sporulation, and required for middle and late meiotic expression. This increases our confidence that SPO1 has high potential to be a previously unknown binding site. Table 12. Genes and Significant Gene Ontology (GO) Terms Finally, since we knew that in gene HOP1, the URS1H binding site appears upstream from UASH, we wanted to see if we could extract the "reversed" binding site. We search for the original and the reversed motifs using a hamming threshold of 6. We found 34 new binding sites where UASH can appear either up-or down-stream from URS1H. Among these we found two possible potential binding sites for the gene HOP1, with UASH at position -201 and URS1H at positions -534 and -175. The former pair (-201,-534) is in fact a known binding site as reported in the SCPD database [1]. This once again showcases the ability of SMOTIF to find potential new binding sites. ConclusionWe introduced SMOTIF, a fast and efficient algorithm to search structured pattern and profile motifs in biological sequences. We showed its applications in searching for composite regulatory patterns, long terminal repeat retrotransposons, and for searching long range motifs. SMOTIF is also computationally more efficient than previous state-of-the-art methods like SMARTFINDER [6]. In biosequence analysis there are four related structured motif problems depending on whether the simple motifs and gap ranges in the structured motif are known or not: (i) Structured motif search [4-6]: all simple motifs and gap ranges are known; (ii) Structured motif extraction [30-32]: all simple motifs are unknown and all gap ranges are known; (iii) Extended structured motif search: all simple motifs are known and all gap ranges are unknown; and (iv) Extended structured motif extraction: all simple motifs and gap ranges are unknown. In this paper we tackled problem (i). A companion paper [33] tackles problem (ii). In the future, we plan to develop efficient algorithms for the other two motif problems as well. Authors' contributionsAll authors contributed equally to this work. AcknowledgementsThis work was supported in part by NSF CAREER Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, and NSF grants EIA-0103708 & EMT-0432098. We thank the anonymous reviewers for their many helpful suggestions. References
Have something to say? Post a comment on this article! |




on Google Scholar







author email
corresponding author email





Figure 1.
Figure 2.
Figure 3.
Figure 4.


Figure 5.
Figure 6.


Figure 7.





Figure 8.






Figure 9.


Figure 10.
Figure 11.
Figure 12.
Figure 13.