Abstract
Background
Many kmers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultraconserved noncoding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and slidingwindow approaches or arbitrarily chosen distance thresholds.
Results
We introduce here an algorithm to detect clusters of DNA words (kmers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the colocalization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome.
Conclusions
WordCluster seems to predict biological meaningful clusters of DNA words (kmers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php webcite including additional features like the detection of colocalization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes.
Background
Genome entities as diverse as genes [1], CpG dinucleotides [2], transcription factor binding sites (TFBSs [3]) or ultraconserved noncoding regions [4] usually form clusters along the chromosome sequence. Such spatial clustering often translates into genome structures with a clear functional and/or evolutionary meaning: gene clusters encoding the same or similar products and originated through gene duplication events, CpG islands, cisregulatory modules, etc. Thus, the spatial clustering of functional genome elements (in general, words or kmers) would somewhat remember the situation in literary texts, where keywords show a strong clustering, whereas common words are randomly distributed [5].
Despite its potential importance, no algorithm exists to detect the clustering of DNA words in a rigorous way. Most current methods are based on densities and slidingwindow approaches or arbitrary distances. For example, the Galaxy work suite ([6], http://main.g2.bx.psu.edu/ webcite) implements an algorithm which lets the user decide to fix the maximum distance between two entities and the minimum number of entities in the cluster. Recently, we developed an algorithm to detect clusters of CpG dinucleotides in DNA sequences based on the distance between neighboring CpGs, then assigning a statistical significance [7]. Now, we generalize the method to any kmer or any arbitrary combination of them, as well as to any other genome entity defined by its chromosome coordinates.
Implementation
The WordCluster algorithm allows the detection of clusters for DNA words (kmers) and genomic elements (genes, transposons, SINEs, TFBSs, etc.). The algorithm is based on the distances between the entities and an assigned pvalue.
The algorithm
The algorithm is basically the same for kmers and genomic elements except for the detection of the coordinates and the way the success probabilities are calculated. Briefly the algorithm performs the following steps:
1. Detection of all kmer copies in the chromosomes, storing its coordinates (this step is unique to the detection of kmer clusters as the genomic elements already come defined by its coordinates). The copies are detected in a nonoverlapping way, i.e. once a copy is found the search is resumed at the end of the word, thus preventing the detection of overlapping copies.
2. Calculation of the distances between consecutive copies. The distance is defined as: "start coordinate of the downstream copy" minus "end coordinate of the upstream copy". This implies that the minimum distance is 1 when the two entities are located directly next to each other.
3. Detection of the clusters, defined as those chromosomal regions where all distances are equal or below a given maximum distance. A cluster is defined by its start and end coordinates and the number of kmers or genomic elements it contain.
4. Calculation of the statistical significance for each cluster by means of the negative binomial distribution. A pvalue threshold is then used to filter out those clusters which are not statistically significant.
A main difference to the originally described algorithm is the way Nruns in the DNA sequence (ambiguous sequence sites occupied by any nucleotide) are treated. While the original CpGcluster method allows up to 10 Ns between two consecutive CpGs, WordCluster detects the DNA words and the distances strictly within the contigs, i.e. not a single N is allowed to lie between two copies.
Statistical significance
From now on, we will have to use the word kmer in different contexts. Therefore, to avoid confusion we define as "target kmer(s)" the kmer(s) which are being analysed, i.e. those for which the clusters are going to be detected. On the contrary, "notarget kmer(s)" are all the remaining kmer(s). We use kmer in a generic way, referring to all DNA words of length k.
The statistical significance is calculated as the cumulative density function of the negative binomial distribution:
being n the number of target kmers within the cluster, n_{f }the number of "failures", i.e. the number of notarget kmers. For example, if we are detecting clusters of AGCT, all kmers other than AGCT would be considered as failures. Finally, p is the success probability, i.e. the probability to find a target kmer or genomic element within the DNA sequence. Note that in the above equation we use (n1) instead of n, as the first appearance of a target kmer within the cluster is trivial (i.e. all the clusters start with a target kmer). While the negative binomial distribution can be defined in the same way for kmers and genomic elements, differences exist in the way the number of "failures" and the success probability are calculated.
For kmers, the number of failures n_{f }is simply given by
being L_{c }the length of the cluster, k the length of the target kmer and n the number of nonoverlapping target kmers in the cluster. The number of failures is the number of notarget kmers within the cluster. For example, given the target kmer ATGC, the cluster ATGCATGC would give n_{f }= 0 while ATGCAATGC would give n_{f }= 1. Each kmer can overlap with itself and other kmers, but here we consider just nonoverlapping occurrences. In such a case, the probabilities for kmers are given by the following equation
being N the number of nonoverlapping occurrences of the target kmers in the sequence, k the length of the kmer and L_{s }the sequence length. The formula is simply the number of target kmers in the sequence divided by the total number of kmers in the sequence. As we do not consider overlapping instances, N*(k1) was subtracted from the total number of kmers (L_{s } k + 1), as those sequence positions are not considered, in order to take this effect into account.
For genomic elements, it is less clear how to define the number of failures. For example, one has a cluster with 5 elements which have mean length of 300 bp and 250 bp of distance on average between each other. The question is how many "noelements" contain this cluster, i.e. how many failures. We define the number of failures as
being L_{no }the number of bases in the cluster not belonging to the genomic element and L_{mean }the mean length of the genomic element. Thus, this number is an approximation to the number of "noelements" within the cluster. Finally, the success probability is then given as
being L_{s }the length of the sequence, L_{mean }the mean length of the genomic elements and N the number of genomic elements.
Distance models
The maximum distance is the main parameter of the algorithm determining the copies belonging to each cluster. We have shown previously [7] that, for most human chromosomes, the median of the observed distance distribution of CpGs lies near the intersection between the observed and the expected distance distribution. The intersection can be interpreted as the point separating the intracluster from the intercluster distances. In this new tool, we added two more distance models based on the direct detection of the mentioned intersection (one genome wide and the other for each chromosome separately). In this way, WordCluster implements a total of 4 different distance models:
1. Percentile distance: The distance corresponding to a given percentile of the observed distance distribution is calculated and used as the maximum distance threshold.
2. Chromosomal intersection: The distance corresponding to the intersection between the observed and the expected distributions is used as the maximum distance (see Figure 1).
Figure 1. Distance distributions. Expected and observed distance distributions for human chromosomes 16 (above) and 5 (below). It can be seen that for chr16 the median, the chromosome intersection and the genome intersection are very close (within 1 bp), while for chromosome 5 notable differences exist (from 33 bp to 49 bp).
3. Genome intersection: The distance distributions for all chromosomes are merged, then calculating the distance corresponding to the "genome intersection point". If this distance model is chosen, the success probabilities (i.e. the probability to find the target kmers in the chromosome) are not calculated for each chromosome separately (like in the two models above), but a genome wide success probability (probability to find the target kmers) is calculated.
4. Fixed distance: the user can set the distance threshold.
Webserver
We implemented the described algorithm into a web server. The tool uses PHP for the interaction with the user, to access the core program (written in Java) and the MySQL database. Two types of input data can be supplied: 1) a group of kmers and a genomic sequence to be scanned by the program (the user can upload his own sequence or choose one of the 24 genome assemblies stored in our database  see below); and 2) a file in BED format [8,9] with the coordinates of the genomic elements whose clustering properties should be analyzed. No mandatory input parameters exist, but the user can select between different distance models (the default is the chromosome intersection) and set the cutoff for the statistical significance (the default here is pvalue ≤ 1E5).
The output generated by the web server depends on whether the user chooses a genome assembly from our database or supplies an anonymous sequence. The minimum output consists of the basic statistics of the clusters (base composition, entity composition and statistical significance) and the statistics by chromosome. Furthermore, for all species in the database, the colocalization of detected clusters with different gene regions (promoters, introns, etc.) is reported.
Finally, for some species (human, mouse, rat, cow, C. elegans, zebrafish and chicken) an enrichment/depletion analysis for the genes overlapped by the clusters is carried out using the Gene Ontology [10] and the AnnotationModules database [11,12].
Database
Currently, the genomes of 24 genome assemblies are stored into our database. The following sequences where downloaded from the UCSC genome browser or the corresponding project homepages (plant genomes): Human (hg18, hg19), Mouse (mm8, mm9), Rat (rn4), Fruit fly (dm3), Anopheles gambiae (anogam1), Honey bee (apimel2), Cow (bosTau4), Dog (canFam2), C. briggsae (cb3), C. elegans (ce6), Sea squirt (ci2), Zebrafish (danrer5), Chicken (galgal3), Stickleback (gasacu1), Medaka (orylat2), Chimp (pantro2), Rhesus macaque (rhemac2), S. cerevisiae (saccer1), Tetraodon (tetnig1), Arabidopsis thaliana (tair8, tair9), and Zea mays (zm1). To determine the colocalization with genes, we used RefSeq genes whenever they were available [13], Ensembl genes otherwise [14].
Results and Discussion
To demonstrate the ability of our algorithm in finding biologically significant and relevant clusters in the genome, at the same time illustrating the different distance models, we carried out three analysis: 1) detection of clusters of CpGs (CpG islands) using different distance models, 2) detection of clusters of the word CWG (where W = A, T) and 3) detection of clusters of olfactory receptor genes in the human chromosome 11.
Detection of CpG islands with different distance models
We choose this example as the detection of CpG islands was the reason to develop the algorithm from which WordCluster [7] was derived. In the original CpGcluster algorithm, we used the percentile of the observed distance distribution as distance model (apart from the fixed distance), suggesting the median as the default parameter. We did this since we observed that the intersection between the observed and expected distance distributions is often very close to the median of the observed distance distribution (see Figure 1). This intersection can be interpreted in the following way. When the observed curve lies above the expected, theoretical curve, it means that more CpGs exist at this distance than expected by chance. We can observe in Figure 1 that this is generally the case for short distances, thus indicating the clustering (overrepresentation of short distances) of CpG dinucleotides. The intersection defines the "reversal point", i.e. at larger distances than this point, the CpG dinucleotides are not clustered any more. Therefore, it might be that the strict use of the intersection defines better clusters that the use of the median, which is a mere approximation to the intersection point. Furthermore, we observed that for some chromosomes the intersection and the median differ slightly. To clarify the impact of this change in the maximum distance, we predict CpG islands by means of the median (cpg50), the chromosome intersection (cpgISc) and the genome intersection (cpgISg), then assessing the prediction quality by some of the criteria previously described [7,15]. Table 1 shows that the mean length of both intersection models are clearly below the mean length of the original cpg50 islands. This can be explained as the intersection models produce on average shorter distance thresholds, which leads to fragmentation, shortening and disappearance of some cpg50 CpG islands. Consequently, the chromosome intersection model (cpgISc) predicts fewer islands than the original cpg50 algorithm (3979). Nevertheless, the genome intersection (cpgISg) yields more predictions compared to cpg50 (5535). The latter observation can be explained as the predictions are done with a single, genome wide probability. The pvalue assigned to each cluster depends on the success probability, and in G+C rich chromosomes the genome wide probability is much lower than the chromosome probability. This leads to smaller pvalues in G+C rich chromosomes, so that more islands can pass the pvalue threshold. For example, cpg50 predicts 2434 islands in chromosome 22 while cpgISg predicts 5197. Of course, in ATrich chromosomes this effect is reverted but less pronounced (the difference between genome wide and chromosome probabilities are smaller in ATrich compared to GCrich chromosomes), and therefore a higher total number of islands are predicted.
Table 1. WordCluster predictions of CpG clusters*
Next, we analyzed the predictions under functional aspects. Table 2 shows the overlap of the predictions with RefSeq genes [13], Alu elements and phylogenetically conserved PhastCons elements [16]. The cpgISg predictions show the highest overlap with the promoter region (R13), and conserved PhastCons elements, simultaneously showing the lowest overlap with spurious Alu elements. This might indicate that cpgISg predictions are slightly better than the other two, the original cpg50 and cpgISc. However, 1) the differences seem to be rather small and 2) a more detailed analysis would be needed to resolve this question.
Table 2. Biological meaning of WordCluster predictions*
Independently of this open question, we can summarize: 1) the chromosome intersection seems to be a good replacement for the median and furthermore removes one input parameter from the method, as the intersection is a fixed statistical property of the chromosome; 2) the genome intersection may be used when the expected clusters are known to be not dependent on the chromosome. The CpG islands are probably not dependent on the chromosome, as the biological mechanisms forming and maintaining them are probably the same for all chromosomes. This may suggest the use of the genome intersection, which is confirmed by producing slightly better results than the other two tested distance models.
Detection of CWG clusters
Besides the conventional CpG context, the CWG context has recently been shown to be a potential target for methylation [17]. WordCluster detects 84996 CAG/CTG clusters in the human genome (NCBI 36, hg18) significant at the 1E5 level using the chromosome intersection (Table 1). We found a high number of statistically significant CWG clusters scattered along all human chromosomes, many of which are overlapping gene regions (Table 3). To check if the detected clusters might be biologically meaningful, we compared the percentage of methylated words (CAG and CTG) inside and outside of the clusters. We observed that 26.7% of all CAG/CTG trinucleotides are methylated inside the clusters while 45.3% of them are methylated when located outside a cluster. It seems therefore, as occurs in CpG islands, that CAG/CTG clusters remain unmethylated with a much higher probability than the bulk DNA.
Table 3. Clusters of CWG trinucleotides*
Detection of olfactory gene clusters
As a third example, we used WordCluster to search for significant clusters of olfactory receptor (OR) genes, the largest multigene family in multicellular organisms whose members are known to be clustered within vertebrate genomes [18,19]. Table 4 shows the basic statistics for the 13 clusters of OR genes detected by our algorithm in human chromosome 11. Figure 2 shows a comparative analysis of the clusters predicted by WordCluster to the clusters currently annotated in the CLIC/HORDE database [19] in a selected region of chromosome 11. Our algorithm predicts a higher number of clusters, being all of them statistically significant.
Table 4. Clusters of OR genes in human chromosome 11*
Figure 2. Clusters of OR genes. A region of human chromosome 11 showing OR genes (green), the clusters annotated in the CLIC/HORDE database (blue) and the statistically significant clusters predicted by WordCluster (red). Our algorithm predicts more compact clusters compared to the CLIC/HORDE annotation. For example, in the first and third HORDE clusters pronounced gaps exist between the genes, which is detected by WordCluster but ignored by the CLIC/HORDE annotation. The figure was generated using the UCSC Genome Browser [8].
Conclusions
WordCluster generalizes the previous CpGcluster algorithm [7] to any word or genomic element in the genome, at the same time associating a statistical significance to the clusters found. It outperforms current methods relying on densities and slidingwindow approaches or arbitrarily chosen distance thresholds. The implementation as a web server connected to a MySQL backend allows for colocalization studies with different gene regions, as well as for genome wide enrichment/depletion analysis of functional terms (GO).
Availability and requirements
The WordCluster webserver (http://bioinfo2.ugr.es/wordCluster/wordCluster.php webcite) is freely available. No registering is needed but every access is logged. For large jobs, a longlife web link to the results is provided.
List of abbreviations used
kmer: DNA word (oligonucleotide) with length k; SINEs: Short interspersed nuclear elements; TSS: Transcription Start Site; TFBS: Transcription Factor Binding Site; R13: promoter region [TSS1500 bp; TSS+500 bp].
Competing interests
None declared
Authors' contributions
MH developed and implemented the algorithm and wrote the manuscript (with JLO), PC and PB carried out the theoretical analysis of word clustering and help with the interpretation of statistical results, GB and AMA retrieve and organize the genome and methylation databases, and JLO developed the algorithm and wrote the manuscript (with MH). All the authors critically read and approved the final version.
Acknowledgements
The Spanish Government grants BIO200801353 to JLO, mobility PR20090285 to PC, Spanish Junta de Andalucía grants P07FQM3163 to PC and P06FQM1858 to PB are acknowledged. The Spanish 'Juan de la Cierva' grant to MH and Basque Country 'Programa de formación de investigadores del Departamento de Educación, Universidades e Investigación' grant to GB are also acknowledged.
References

Durand D, Sankoff D: Tests for gene clustering.
J Comput Biol 2003, 10:453482. PubMed Abstract  Publisher Full Text

GardinerGarden M, Frommer M: CpG islands in vertebrate genomes.
Journal of molecular biology 1987, 196:261282. PubMed Abstract  Publisher Full Text

Makeev VJ, Lifanov AP, Nazina AG, Papatsenko DA: Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information.
Nucleic acids research 2003, 31:60166026. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B: Arrays of ultraconserved noncoding regions span the loci of key developmental genes in vertebrate genomes.
BMC Genomics 2004, 5:99. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Carpena P, BernaolaGalván P, Hackenberg M, Coronado AV, Oliver JL: Level statistics of words: finding keywords in literary texts and DNA.
Phys Rev E 2008, 79:035102035104. Publisher Full Text

Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, et al.: Galaxy: a platform for interactive largescale genome analysis.
Genome Res 2005, 15:14511455. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hackenberg M, Previti C, LuqueEscamilla PL, Carpena P, MartínezAroza J, Oliver JL: CpGcluster: A distancebased algorithm for CpGisland detection.
BMC Bioinformatics 2006, 7:446. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al.: The UCSC Genome Browser Database: 2008 update.
Nucleic acids research 2008, 36:D773779. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features.
Bioinformatics (Oxford, England) 26:841842. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Nature genetics 2000, 25:2529. PubMed Abstract  Publisher Full Text

Hackenberg M, Matthiesen R: AnnotationModules: a tool for finding significant combinations of multisource annotations for gene lists.
Bioinformatics (Oxford, England) 2008, 24:13861393. PubMed Abstract  Publisher Full Text

Hackenberg M, Matthiesen R: Algorithms and methods for correlating experimental results with annotation databases.
Methods in molecular biology (Clifton, NJ 2009, 593:315340. Publisher Full Text

Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins.
Nucleic acids research 2007, 35:D6165. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al.: Ensembl 2009.
Nucleic acids research 2009, 37:D690697. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hackenberg M, Barturen G, Carpena P, LuqueEscamilla PL, Previti C, Oliver JL: Prediction of CpGisland function: CpG clustering vs. slidingwindow methods.
BMC Genomics 2010, 11:327. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.
Genome Res 2005, 15:10341050. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, TontiFilippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al.: Human DNA methylomes at base resolution show widespread epigenomic differences.
Nature 2009, 462:315322. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aloni R, Olender T, Lancet D: Ancient genomic architecture for mammalian olfactory receptor clusters.
Genome biology 2006, 7:R88. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

The HORDE Project http://genome.weizmann.ac.il/horde/ webcite [http://bioportal.weizmann.ac.il/HORDE] webcite