| Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules1Institute of Genetics and Selection of Industrial Microorganisms, GosNIIGenetika, 117545 Moscow, Russia 2MIGEC, INRIA Rocquencourt, 78153 Le Chesnay, France 3GREYC, CNRS UMR 6072, Laboratoire d'informatique, 14032 Caen, France 4Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Puschino, Moscow Region, Russia 5Puschino State University, Puschino, Moscow Region, Russia 6Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
Algorithms for Molecular Biology 2007, 2:13doi:10.1186/1748-7188-2-13 The electronic version of this article is the complete one and can be found online at: http://www.almob.org/content/2/1/13
©
2007 Boeva et al; licensee BioMed Central Ltd. AbstractBackgroundcis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. ResultsWe developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ks or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. MethodThe algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m| ConclusionThe primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs. AvailabilityProject web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/ webcite BackgroundDuring the past few years, a number of computational tools have been designed [1-3] for locating potential transcription factor binding sites (TFBSs) in nucleotide sequences, e.g., in compilations of sequences upstream of putative co-regulated genes. In parallel, experimental approaches were developed [4], which allowed identification of binding motifs for many different transcription factors. Experimental [5] and bioinformatical [6] studies demonstrated that sequences of regulatory DNA that bind transcription factors can exhibit many different types of architecture. In eukaryotes TFBSs found in DNA sequences often form rather dense clusters: this was demonstrated both by experimental [5,7] and computational [8,9] methods. Such clusters can contain sites binding the same factor or several different factors [10]. The cis-regulatory module (CRM) in this case contains respectively homotypic or heterotypic clusters of motifs specifically recognized by binding proteins [11]. The particular arrangement of motifs in a homotypic or heterotypic cluster is not random, and it is commonly accepted, that the motif arrangement within a CRM is important for its functionality [12-20]. Bioinformatics studies indicate that antagonistic factors often bind to overlapping sites [21] whereas synergetic factors are often positioned within a fixed distance [20], often close to the multiple of 10.2 bp, the DNA double-helix pitch value [21]. Non-random arrangements of TFBSs within regulatory segments of DNA sequences are exploited in several TFBS identification tools, and it was observed that cooperativity-based discrimination of TFBSs surpasses the performance of models for individual TFBSs [22]. On observing a cluster of TFBSs in some genome segment one can calculate the probability of observing similar site arrangements in a random sequence. This idea of evaluating the statistical significance of heterotypic clusters of sites was implemented in many programs including ClusterDraw [23], ModuleSearcher [24], MCAST [25], eCIS-ANALYST [26], Cister [27], Cluster-Buster [28] and TargetExplorer [29]. At the moment, such programs use empirical procedures like motif counting in biological and simulated sequences to assess the significance of observed site clustering. But it is highly desirable to have a good statistical measure of site clustering, and we believe that the best measure is the p-value of obtaining the observed cluster by chance in a random sequence of a Markov or Bernoulli (common name for Markov chain of order 0) type. In the case of heterotypic clusters one needs to take into account possible overlapping occurrences of different motifs, a problem that was considered difficult until now [30]. In the case of homotypic clusters, an approximate statistical scoring function was constructed [8,31]; this approach has been implemented in algorithms like FLYENHANCER [32], SCORE [33], and CLUSTER [34]. However, this approximation performs poorly for highly overlapping TFBSs. One cannot ignore site overlapping if the motifs are fuzzy (highly degenerate), which is often the case for so-called "shadow sites" [31]. In the case of heterotypic clusters, competing factors can bind even to very well determined motifs that overlap. Representation of protein binding motifs in nucleotide sequencesExperimental methods on protein binding to DNA usually locate some DNA segment, or word in DNA text, as a probable binding target. Proteins can bind to similar DNA words [4], the whole assembly of which can be called a motif. The simplest motif representation is the enumeration of sequences that can be bound by a transcription factor (TF) [35]. Sometimes, information about binding sites can be found in SELEX [36,37] or Protein Binding Microarray (PBM) experiments [38]. However, it is possible that such experiments do not give the exhaustive list of sequences of binding sites, so one needs to expand the list of putative binding sites using an appropriate criterion, which brings about the problem of the generalization of several known examples. For instance, several words aligned with mismatches, can be generalized to IUPAC string (like RSTGACTNMNW for AP-1 binding sites [39]) by disregarding correlated substitutions in different motif positions [40]. Another example of generalization is the set of words that can deviate from a consensus word for less than a given number of mismatches. The most popular way to represent binding sites is a Position Weight Matrix (PWM), which is also called position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM) [41]. For a text with length D over an alphabet Σ with |Σ| symbols, a PWM is a |Σ| × D matrix: each row corresponding to a symbol of the alphabet Σ, and each column to a position in the motif. For DNA texts, one has Σ = {A, C, G, T}. The PWM score is defined as Any of the three motif representations above can be converted to a list of words. The same is true for many other representations of motifs. In this study, we consider only the motifs that can be represented as a set of words. P-value for clusters of motif occurrences, problem formulationThe objective of this work is to develop a statistical criterion to assess clustering of TFBS. Intuitively, a TFBS cluster is a DNA segment simultaneously containing "too many" TFBSs for given factor proteins; such a segment can often operate as a CRM regulated by these TFs. From a formal point of view, the problem we address here is as follows. Let s sets of words Related workMost previous works address counting problems for one set of several words All methods of solving the problem of p-value calculations for multiple occurrences of words from a set Therefore, one tries to compute the sequence of ( Linear inductionIn the first class of methods [43-46], one computes, implicitly or explicitly, probabilities P (Ln ( Algebraic FormulaeIn a second class of methods [47-52], a preprocessing computes generating functions In a second step, probabilities P (Ln ( In [49,53], A language approach [50] or an induction [48] leads to a formal expression that depends on the words overlaps. The main drawback is that these methods need to compute the determinant of a matrix of polynomials with a huge dimension, e.g. O (| When the preprocessing step is achievable, the extraction step is amenable to the solution of a linear recurrence of degree m| Finally, approximations are available, the computation of which is constant with respect to n, but not to MethodsHere we consider in detail the approach we suggest. A motif assigned to a TF is a finite set of words Let ( To be more precise, there is a probability distribution defined on the set Σn of all texts of length n in the alphabet Σ; the most widely used models are random Bernoulli trials and a Markov model of order K. Denote as Ln ( Our approach to the calculation of this p-value is similar to that published in [61], which was used there to calculate seed sensitivity in local alignment search. The approach exploits the fact that the algorithm of Aho and Corasick [62] can be modified to efficiently determine whether a given text belongs to the set Ln ( We start from the simplest case of one motif Construction of Aho-Corasick traversalAho and Corasick [62] have proposed the algorithm determining if a given text T contains an occurrence of a word from a given set The classic Aho-Corasick algorithm is a tree traversal determined by a transition function Given a text T read from left to right, let T [i] denote the letter of T at position i. Let qi be the largest suffix in text T[1] ⋯ T [i] that belongs to ∀i ≥ 0, qi+1 = δ (qi, T [i + 1]), with the initial condition q0 = ε. Example: Let Table 1. Values of δ function for the set
The combination of tree Bernoulli text model. Probability to find at least one occurrence of a single motifIn this section we consider the simplest case. One computes the p-value for a single motif in a text Tn of length n, assuming that Tn is generated by independent Bernoulli random trials over alphabet Σ. The algorithm computes probabilities P (Ln ( To describe the algorithm we divide the set Σi of all texts Ti of length i into classes that do and do not contain occurrences of Definition 1 A text Ti belongs to class Ci (0; q) iff 1. Length of Ti is i, 2. Ti does not contain words from 3. A traversal AC ( A text Ti belongs to class Gi (1) iff (i) Length of Ti is i, (ii) Ti does contain at least one occurrence of a word from For a given number i larger than m, the union for classes Ci (0; q), where q is in Let P (Cn (0; q)) and P (Gn (1)) denote probabilities that a text Tn belongs to class Cn (0; q) and Gn (1), respectively. Then, Ln ( The algorithm calculates probabilities P (Ci (0; q)) and P (Gi (1)) using induction on length i. For i = 0, these probabilities obviously comply with: P (C0 (0; ε)) = 1; P (C0 (0; q)) = 0, for any q ≠ ε; P (G0 (1)) = 0. The values of P (Ci+1 (0; q)) and P (Gi+1 (1)) are calculated using values of P (Ci (0; q)) and P (Gi (1)). Therefore, the needed space is proportional to the size of Calculation of values P (Ci+1 (0; q)) and P (Gi+1 (1)) is based on the following observations. Let U be a set of texts of the same length over the alphabet Σ, P (U) the probability of U in the Bernoulli model and a a character in Σ. Let U·a be the set of all possible concatenations, i.e., U·a = {xa|x ∈ U}. And in the case of the Bernoulli model P (U·a) = P (U) P (a).(1) Then the following relations hold for any i ∈ {1, ..., n - 1} and Σ: (i) if the text Ti contains a word from Gi (1)·a ⊂ Gi+1 (1).(2) (ii) if the text Ti does not contain a word from if δ (q, a) ∈ otherwise Ci (0; q) ⊂ Gi+1 (1).(4) Remembering that classes Ci (0; q) for different q and Gi (1) form a partition of Σi, we obtain the following relation for the texts containing words from Similarly, classes of texts that do not contain words from Classes Ci (0; q) for different q in The run-time for each step of the computation of Ci+1 (0; q) and Gi+1 (1) is O (| The approach described in this section can be readily extended to the case of multiple occurrences of motif Additional file 1. Bernoulli text model. Probability to find multiple occurrences of a single motif. The detailed description of the algorithm for the p-value calculation in the case of multiple occurrences of a single motif. Format: PDF Size: 95KB Download file This file can be viewed with: Adobe Acrobat Reader Bernoulli text model. Probability to find multiple occurrences of multiple motifsDNA transcription is usually regulated with several factors simultaneously interacting with DNA and specifically recognizing different DNA sites. Individual regulatory segment of DNA can contain many binding sites for several factors, often substantially overlapping with each other [5]. This brings about a problem of studying of co-occurring motifs. Let ( Let us consider the union All texts Tn of length n are classified into classes depending on occurrences of different Definition 2 Let the target number of occurrences of motif Definition 3 A text Ti belongs to class Ci (λ1, ..., λs; q), 0 ≤ λi ≤ ki iff 1. Length of Ti equals i, 2. The occurrence index of motifs ( 3. A traversal AC ( A text Ti belongs to class Gi (k1, ..., ks) if it belongs to the union of classes The desired p-value P (Ln ( where the summation in the second sum is performed over all allowed s-tuples of indexes (r1, ..., rs) which together make the set of s-tuples J. A s-tuple of indexes (r1, ..., rs) belongs to J if it complies with the following conditions: 2. if q' 3. if q' ∈ Implementation detailsOur basic data structure is the prefix tree; we use its standard representation [42] [see also Additional files 2 and 3 for Tree construction from PWM motif representation]. Each tree node q ∈ Additional file 2. Tree construction from PWM motif representation. The brief description of the procedure of the prefix tree construction from PWM motif representation. Format: PDF Size: 102KB Download file This file can be viewed with: Adobe Acrobat Reader Additional file 3. Tree construction from PWM motif representation. Steps of the prefix tree construction for a PWM and a given cut-off. Format: BMP Size: 556KB Download file At stage (i + 1) of probability computation the values P (Ci+1 (λ1, ..., λs; q)) become computed from the values P (Ci (λ1, ..., λs; q)) obtained at the previous stage of induction. Therefore, at stage (i + 1), one no longer needs the values calculated at stage (i - 1). Thus, each node is supplied with two k1 × ⋯ × ks-arrays of real values C0 and C1 for storing P (Ci (λ1, ..., λs; q)) and P (Ci+1 (λ1, ..., λs; q)) for different λj. C0 is used to store probabilities for even text lengths while C1 for odd. In implementation the calculation of values P (Ci+1 (λ1, ..., λs; q')) from P (Ci (λ1, ..., λs; q)) for all q', q ∈ 2. if q' ∈ 3. if q' ∈ At the stage i = n the desired p-value is the sum Markov text modelTree approach and the recursion (11) can be readily extended to calculate p-values of motif occurrences in random texts generated by the Markov model of order K. Given the order K of the Markov model, the probability p(a) in (11) depends on K previous letters. Thus, if the length |q| of the prefix q is less than K, one cannot calculate p(a) knowing only the prefix q. To overcome this we divide each class Ci (r1, ..., rs; q), where |q| = d <min (K, i) into subclasses Ci (r1, ..., rs; q, w); each subclass corresponds to a word w of length min (K, i) - d. Then, a text Ti of length i belongs to class Ci (r1, ..., rs; q, w) if the suffix of Ti of length min (K, i) equals to w·q. Figure 2 gives an example for Markov model of order K = 1. The tree is constructed for the set
The recursive equations for probabilities P (Ln ( The Markov extension is currently implemented for K = 1. ComplexityTo resume, the computation of P (Ln ( When several sets are involved, the number of nodes in the tree Results and discussionWe developed an algorithm for precise calculation of the p-value for multiple occurrences of multiple motifs with possible overlaps. The running time is linear in the text length and depends on the alphabet size, the maximal motif length, the number of words in the motifs, and the number of occurrences of each motif. The algorithm was implemented in the AHOPRO software. Below we give examples of how p-values can be used for studying gene regulation in silico, particularly for selecting optimal cutoff values for motifs represented by PWMs. In the subsection 'Comparison with simulation and approximation methods' we compare our p-value computations with the result of Monte Carlo simulations and the Poisson approximation. Our results confirm the accuracy of our algorithm and show in what cases the Poisson approximation [8,11] cannot be employed. In the subsection 'Optimal cutoffs', we apply AHOPRO to choose an appropriate cutoff score for Position Weights Matrices. In the subsection 'Assessment of gene regulation', we show how AHOPRO can be used for studying regulatory regions containing heterotypic clusters of TFBSs to distinguish genes that are regulated by given transcription factors from those that are not. As a model example, we use in this section data published in [34] on regulatory clusters in D. melanogaster. This compilation includes information on (i) known binding motifs for transcription factors, (ii) known CRM regions, and (iii) known regulatory interactions. Comparison with simulation and approximation methodsIn our first example we use the even-skipped stripe 2 enhancer (eve2) [63] of length 728 bp that is known to contain binding sites for TFs bicoid, kruppel and hunchback. Below we compare p-values calculated by the AHOPRO program and those calculated using compound Poisson approximation with p-values computed through Monte Carlo simulations. AhoPro and Monte Carlo comparisonsTable 2 displays results of comparison of p-values calculated with AHOPRO and with Monte Carlo simulation assuming the Bernoulli model M0. The corresponding results for the first order Markov model M1 are displayed in Table 3. Letters probabilities for M0 and the transition matrix for M1 were evaluated from eve2 sequence. We used the PWM cutoff values taken from [34], i.e., 5.3, 5.0, and 6.2 for bicoid, kruppel, and hunchback respectively. With these threshold values in sequence eve2 we have found 3, 4, and 2 occurrences of motifs of each type respectively. In Tables 2 and 3 we listed the p-values, i.e, the probabilities to find no less than the observed number of occurrences of motifs in a random text of length L, where L is the length of eve2 enhancer. The number of Monte Carlo simulations was set to 106 everywhere, except for the triplet (bcd&kr&hb), where we did 107 simulations. The probability to find the observed number of occurrences of (bcd&kr&hb) simultaneously in the same simulated sequence is extremely low; thus we increased the number of simulations so that the product of the probability by the number of simulations be greater than 1. Table 2. Comparison of p-values calculated by the AHOPRO program, by Monte Carlo simulations and by compound Poisson distribution formula under the M0 model Table 3. Comparison of p-values calculated by the AHOPRO program, by Monte Carlo simulation and by compound Poisson distribution formula under the M1 model The results of comparison of the AHOPRO computation with those obtained from simulated random sequences presented in Tables 2 and 3 confirm the accuracy of our algorithm. Poisson approximationIn practical application, compound Poisson distribution [64] is widely used to assess p-values of multiple motif occurrences [2,8,34,65]. Here we apply it to compute the probability to observe the given number of motif occurrences when the probabilities of individual words are calculated adopting the M0 or M1 models described above. The results of the comparison given in corresponding columns in Tables 2 and 3 show that the p-value calculated using Poisson approximation can be significantly underestimated. This happens most probably because the Poisson approximation does not take into account possible overlaps between motif occurrences and considers motif occurrences as independent. The error increases when the p-value is calculated for simultaneous occurrences of several factors, as it is done in the last two rows. In this case, the Poisson approximation p-value for a combination of several TFs is calculated as a product of p-values calculated independently for each TF. Actually, the motif occurrences can overlap especially when the motifs resemble each other, thus there is no independence, which brings about the error. Optimal cutoffsBelow, we use AHOPRO to determine the optimal cutoff values for PWMs of regulatory factors, given the sequences of regulatory region assumedly interacting with the factors. The distribution of occurrences of TF binding sites in corresponding experimentally confirmed regulatory regions is strongly biased [34]. In CRMs binding sites often tend to occur in clusters, which is not the case for random sequences. Different cutoff values correspond to different numbers of putative binding sites of different quality. The higher the cutoff value, the closer the motif occurrences are to the consensus and the smaller the number of motif occurrences. Therefore, for a given factor it is reasonable to select a cutoff value that minimizes the probability of finding in the random sequence the number of motif occurrences observed in the sequence of the regulatory region. As an example, we considered again transcription factors bicoid, kruppel, which are known to regulate the even-skipped stripe 2 (eve2) enhancer. To select the optimal cutoff value we used the following procedure: first, in the sequence of eve2 we counted occurrences of motifs with a score greater than the cutoff with cutoff values varied from 3 to 8.5. Therefore, each pair of cutoff values (S1, S2) corresponded to (k1, k2) occurrences for motifs of bicoid and kruppel respectively. For each pair (k1, k2), we computed p-value Pn (k1 (S1), k2 (S2)), which is denoted below as P (S1, S2). That is the probability to obtain at least k1 occurrences of bicoid, with scores greater than S1, and at least k2 occurrences of kruppel, with scores greater than S2. In Figure 3, a 3D-surface is shown, where (x, y, z) corresponds to (S1, S2, - log10 P (S1, S2)), the cutoff value for bicoid motif, the cutoff value for kruppel motif and -logarithm of the corresponding p-value calculated for the M1 model respectively. The view to the surface from the above is shown in Figure 3C. The maximal value for – log10 P (S1, S2), 6.3 |




on Google Scholar







author email
corresponding author email











Figure 1.










Figure 2.

