Abstract
Chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabasesized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionallyrelevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The ensemble of domains we identify allows us to quantify the degree to which the domain structure is hierarchical as opposed to overlapping, and our analysis reveals a pronounced hierarchical structure in which larger stable domains tend to completely contain smaller domains. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone marks at the boundaries.
Keywords:
Alternative topological domains; Chromatin conformation capture; Dynamic programmingBackground
Chromatin interactions obtained from a variety of recent experimental techniques in chromosome conformation capture (3C) [1] have significantly advanced our understanding of the geometry of chromatin structure [2], its relation to the regulation of gene expression, nuclear organization, cancer translocations [3], and copy number alterations in cancer [4]. Recently, dense, contiguous regions of chromatin termed topological domains have been discovered in both mammals [5] and in fruit flies [6]. Topological domains have since been incorporated into many subsequent analyses [79] due to the fact that they are persistent across cell types, conserved across species, and serve as a skeleton for the placement of many functional elements of the genome [10,11].
3C experiments result in matrices of counts that represent the frequency of crosslinking between restriction fragments of DNA that are spatially near one another. The identification of domains in Dixon et al. [5] employed a Hidden Markov Model (HMM) on these interaction matrices to identify regions initiated by significant downstream chromatin interactions and terminated by a sequence of significant upstream interactions. A defining characteristic of the domains from their analysis is that higher frequency 3C interactions tend to occur within domains as opposed to across domains. This aspect of domains is also reflected in the blockdiagonal structure of 3C interaction matrices as shown in Figure 1. In this sense, domains can be interpreted as contiguous genomic regions that selfinteract frequently and are more spatially compact than their surrounding regions.
Figure 1. Interaction matrix for a portion of human chromosome 1 from a recent HiC experiment by Dixon et al. [5]. Each axis represents a location on the chromosome with a step of 40kbp. Densely interacting domains identified by the method of Dixon et al. are shown in red. Alternative domains are shown as dotted black lines on the upper triangular portion of the matrix. Visual inspection of the lower triangular portion suggests domains could be completely nested within another and highly overlapping when compared to Dixon et al.’s domains. This motivates the problem of identifying alternative domains across length scales.
However, the single collection of megabasesized domains may not be the only topologically and functionally relevant collection of domains. On closer inspection of the blockdiagonal matrix structure in Figure 1, it becomes clear that there are alternative contiguous regions of the chromosome that selfinteract frequently and are likely more spatially compact than their surrounding regions (dotted lines). Some of these regions appear to be completely nested within others, suggesting a hierarchy of compact regions along the chromosome, while others appear to overlap each other. These observations suggest that functionallyrelevant chromosomal domains may exist at multiple scales potentially contributing to a hierarchy of domains or a more complex relationship between domains.
We introduce a new algorithm to efficiently identify topological domains in 3C interaction matrices for a given domainlength scaling factor γ. Our formulation of this problem as a dynamic program allows for an efficient traversal of the solution space to obtain alternative optimal and nearoptimal domain sets. Our results suggest that there exist a handful of characteristic resolutions across which domains are similar. Based on this finding, we identify a consensus set of domains that persists across various resolutions. We find that domains discovered by our algorithm are dense and cover interactions of higher frequency than interdomain interactions. Additionally, we show that interdomain regions within the consensus domain set are highly enriched with insulator factor CTCF and histone modification marks. We analyze a set of domains from multiple optimal domain sets across scales and establish that the organization of domains is highly hierarchical, suggesting that the generated domains can be used as the basis for understanding the hierarchical organization of the genome and its role in gene regulation. We argue that our straightforward approach retains the essence of the more complex multiparameter HMM introduced in [5] while allowing for the flexibility to identify biologically relevant domain structures at various scales.
Problem definition
Given the resolution of the 3C experiment (say, 40kbp), the chromosome is broken into n evenly sized fragments. 3C contact maps record interactions between different sections of the chromosome in the form of a weighted adjacency matrix A where two fragments i and j interact with frequency A_{ij}.
Problem1 (Resolutionspecific domains).
Given a n×n weighted adjacency matrix A and a resolution parameter γ≥0, we wish to identify a set of domains D_{γ} where each domain is represented as an interval d_{i}= [ a_{i},b_{i}], 1≤a_{i}<b_{i}≤n such that no two d_{i} and d_{j} overlap for any i≠j. Additionally, each domain should have a larger interaction frequency within the domain than to its surrounding regions.
Specifically, we seek to identify a set of nonoverlapping domains D_{γ} that optimizes the following objective:
where D_{γ} chosen from the set of all possible domains, and q is a function that quantifies the quality of a domain [a_{i},b_{i}] at resolution γ. Here, the parameter γ is inversely related to the average domain size in D_{γ}: lower γ results in sets of larger domains and higher γ corresponds to sets of smaller domains. Since domains are required to contain consecutive fragments of the chromosome, this problem differs from the problem of clustering the graph of 3C interactions induced by A, since such a clustering may place noncontiguous fragments of the chromosome into a single cluster. In fact, this additional requirement allows for an efficient optimal algorithm.
Problem2 (Consensus domains across resolutions).
Given A and a set of resolutions Γ={γ_{1},γ_{2},…}, identify a set of nonoverlapping domains D_{c} that are most persistent across resolutions in Γ:
where D_{c} is the set of nonoverlapping persistent domains across resolutions, and p(a_{i},b_{i},Γ) is the persistence of domain [ a_{i},b_{i}] corresponding to how often it appears across resolutions.
Algorithms
Domain identification at a particular resolution
Since each row and corresponding column in a 3C interaction matrix encodes a genomic position on the chromosome, we can write the solution to objective (1) as a dynamic program:
where OPT_{1}(l) is the optimal solution for objective (1) for the submatrix defined by the first l positions on the chromosome (OPT_{1}(0)=0). The choice of k encodes the size of the domain immediately preceding location l. We define negativescoring domains as nondomains and, as such, only domains with q>0 in the max term in (3) are retained.
Our quality function q is:
where
is a scaled density of the subgraph induced by the interactions A_{gh} between genomic loci k and l. When γ=1, the scaled density is the weighted subgraph density [12] for the subgraph induced by the fragments between k and l, which is the uppertriangular portion of the submatrix defined by the domain in the interval [ k,l] divided by the scaled length (l−k)^{γ} of the domain. When γ=2, the scaled density is half the internal density of a graph cluster [13]. For larger values of γ, the length of a domain in the denominator is amplified, hence, smaller domains would produce larger objective values than bigger domains with similar interaction frequencies. Equation (4) is the zerocentered sum of (5). μ_{s}(l−k) is the mean value of (5) over all submatrices of length l−k along the diagonal of A, and can be precomputed for a given A. We disallow domains where there are fewer than 100 submatrices available to compute the mean. By doing this, we are only excluding domains of size larger than n−100 fragments, which in practice means that we are disallowing domains that are hundreds of megabases long. Values for the numerator in (5) are also precomputed using an efficient algorithm [14], resulting in an overall runtime of O(n^{2}) to compute OPT_{1}(n).
Enumerating multiple optimal and nearoptimal solutions
The set of domains found by the dynamic program in Equation 3 may not be the only set obtaining the maximum value of OPT_{1}(·). In fact, there may be multiple optimal solutions and solutions which are near optimal. The domain structures that appear in alternative optimal or near optimal solutions are of interest, especially if they are significantly different, since they represent a potentially diverse array of alternative domains that are only precluded from the initially computed optimal solution as a result of the arbitrary breaking of ties that takes place in the dynamic program. We wish to be able to account for such alternative solutions by enumerating them efficiently and in order of a decreasing solution score.
Since Equation 3 will allow ‘nondomains’ (i.e. intervals on the chromosome with q(k,l,γ)≤0) to be split arbitrarily without affecting the optimal score, we modified the procedure as shown in Equation 6 to explicitly disallow adjacent nondomains:
where the optimal score of l ending a domain is
and the quality function for the domain is
for l ∈ {0,1}. In Equation 6, maxk<lOPT_{D}(k−1) represents the optimal score at l where l ends a nondomain region. This solution to Problem 1 produces a set of domains with the same optimal score as Equation 3, but guarantees that alternative optimal and nearoptimal domain sets do not contain nondomains that are adjacent.
To efficiently identify alternative optimal and nearoptimal solutions, we use the fact that the dynamic program in Equation (6) can be conceptually represented as a directed acyclic graph where each and OPT_{D}(l) is connected by an edge to every other term it depends on: and {OPT_{D}(k)}_{k<l}. For each edge e=(k,l) in , the weight of e is q^{′}(k,l,γ). Thus, finding a set of domains with an optimal score is equivalent to finding a highestweight path in starting from the node representing . To find the topK solutions, we then find the K highest weight paths in using a standard procedure [15].
Obtaining a consensus set of persistent domains across resolutions
For objective (2), we use the procedure above to construct a set . is a set of overlapping intervals or domains, each with a quality score defined by its persistence p across resolutions. To extract a set of highly persistent, nonoverlapping domains from , we reduce problem 2 to the weighted interval scheduling problem [16], where competing requests to reserve a resource in time are resolved by finding the highestpriority set of nonconflicting requests. To find a consensus set of domains, we map a request associated with an interval of time to a domain and its corresponding interval on the chromosome. The priority of a request maps to a domain’s persistence p across length scales.
The algorithm to solve problem 2 is then:
where OPT_{2}(j) is the optimal nonoverlapping set of domains for the jth domain in a list of domains sorted by their endpoints (OPT_{2}(0)=0), and c(j) is the closest domain before j that does not overlap with j. The first and second terms in (9) correspond to either choosing or not choosing domain j respectively. We precompute a domain’s persistence p as:
Equation (10) is therefore a count of how often domain i appears across all resolutions in Γ for domain sets identified by the dynamic program at a single resolution. It may be desirable to treat multiple highly overlapping, nonequivalent domains as a single domain, however, we conservatively identify exact repetitions of a domain across resolutions since this setting serves as a lower bound on the persistence of the domain. If , then precomputing persistence takes O(mΓ) time, and c(j) is precomputed after sorting the intervals by their endpoints. The limiting factor when computing OPT_{2}(m) is the time to compute c(j), which is of order m logm. Thus, the overall algorithm runs in O(m logm+(n^{2}+m)Γ) time taking into account an additional O(n^{2}Γ) time for computing .
Results
We used chromatin conformation capture data from Dixon et al. [5] for human fibroblast and mouse embryonic cells. The 3C contact matrices were already aggregated at fragment size 40kb and were corrected for experimental bias according to [17]. We compared our multiscale domains and consensus sets against the domains generated by Dixon et al. for the corresponding cell type and species. For human fibroblast cells, we used CTCF binding sites from [18]. For mouse embryonic cell CTCF binding sites and chromatin modification marks, we used data by Shen et al. [19].
Ability to identify densely interacting domains across scales
Multiresolution domains successfully capture high frequency interactions and leave interactions of lower mean frequency outside of the domains. We compute the mean interaction frequency for all intra and interdomain interactions at various genomic lengths and plot the distribution of means for multiple resolutions (Figure 2(a)). The mean intradomain interaction frequency (blue) is consistently higher (up to two times) than the mean frequency for interactions that cross domains (red). Compared to the domains reported by Dixon et al., our domains tend to aggregate interactions of higher mean frequency, especially at larger γ. The distribution of mean intradomain frequencies for Dixon et al. is skewed more to the left than that of the multiscale domains (Figure 2(b)). This difference can be partially explained by the fact that multiscale domains on average are smaller in size (μ=0.2Mb, σ=1.2Mb) than domains reported by Dixon et al. (μ=1.2Mb, σ=0.9Mb).
Figure 2. Our approach identifies densely interacting domains across scales.(a) Our algorithm discovers domains with mean frequency value for inter and intradomain interactions (solid lines) at or better than that of Dixon et al. domains (dotted lines). Each solid line represents domains at different resolution γ in human fibroblast cells. (b) Multiscale domains identified in human fibroblast cells by our dynamic program tend to have higher mean frequency than those of Dixon et al. (distributions are plotted after outliers >μ+4σ were removed).
Domain persistence across scales
Domain sets across resolutions share significant similarities, even as the distribution of domains and their sizes begin to change (Figure 3). The patterns of similarity are particularly obvious if we plot the domains at various resolutions (Figure 4(a)): many domains identified by our algorithm persist at several resolutions and are aggregated into larger domains at smaller γ, suggesting a hierarchical domain structure. The stability of these domains across resolutions indicates that the underlying chromosomal structure is dense within these domains and that these domains interact with the rest of the chromosome at a much lower frequency.
Figure 3. Domain sizes and count across resolutions. The domain sizes increase and the domain count decreases as the resolution parameter drops. Above: plotted are maximum (red), average (blue), and minimum (green) domain size averaged over all chromosomes for the domains on human fibroblast cells (IMR90). The magenta line shows the average domain size for domains reported by Dixon et al. Below: the number of domains increases for higher values of resolution parameter. The magenta line displays domain count for Dixon et al.
Figure 4. Domain persistence across scales.(a) Domains identified by our algorithm (black) are smaller at higher resolutions and merge to form larger domains at γ close to 0. Visual inspection shows qualitative differences between consensus domains (red) and domains reported by Dixon et al. (green). Data shown for the first 4Mb of chromosome 1. (b) Variation of information for domains identified by our algorithm across different resolutions for chromosome 1 in human fibroblast cells.
A pairwise comparison of domain configurations displays regions of stability across multiple resolutions (Figure 4(b)). We use the variation of information (VI) [20], a metric for comparing two sets of clusters, to compute the distance between two sets of domains. To capture the similarities between two domain sets D and D^{′} and the interdomain regions induced by the domains, we construct new derivate sets C and C^{′} where C contains all domains d∈D as well as nondomain regions (C^{′} is computed similarly). To compute entropy , we define the probability of seeing each interval c_{i}= [ a_{i},b_{i}] in C as p_{i}=(b_{i}−a_{i})/L where L is the length of the chromosome. When computing the mutual information between two sets of intervals C and C^{′}, we define the joint probability p_{ij} to be [ a_{i},b_{i}]∩[ a_{j},b_{j}]/L.
We then compute variation of information on these two new sets: VI(C,C^{′})=H(C)+H(C^{′})−2I(C,C^{′}). Chromosome 1, for example, has three visually pronounced groups of resolutions within which domain sets tend to be more similar than across (γ= [0.000.20], [0.250.70], and [0.751.00] — see Figure 4(b)).
Comparison with the previously identified set of domains in Dixon et al
At higher resolutions, domains identified by our algorithm are smaller than those reported by Dixon et al. (Figure 3). As the resolution parameter decreases to 0.0, the average size of the domains increases. The composition of the domains we identify is different from that of Dixon et al. as illustrated in Figure 4(a) and captured by the variation of information in Figure 4(b).
We use the consensus domains algorithm to obtain a consensus set of domains D_{c} persistent across resolutions. We construct the set Γ by defining the range of our scale parameter to be [0,γ_{max}] and incrementing γ in steps of 0.05. In order to more directly compare with previous results, we set γ_{max}=0.5 for human and 0.25 for mouse since these are the scales at which the maximum domain sizes in Dixon et al.’s sets match the maximum domain sizes in our sets.
Our consensus domain set agrees with the Dixon et al. domains better than with a randomized set of domains adhering to the same domain and nondomain length distributions (Figure 5 and [21]). Comparing to a set of random domains also helps to verify that our observations are due to the observed sequence of domains and not the distribution of domain lengths. To shuffle Dixon’s domains, we record the length of every domain and nondomain region, and then shuffle these lengths to obtain a randomized order of domains and nondomains across the chromosome. The fact that variation of information is lower between consensus domains and domains reported by Dixon et al. demonstrates that, though the approaches find substantially different sets of topological domains, they still agree significantly more than one would expect by chance.
Figure 5. Comparison of Dixon et al.’s domain set with the multiscale consensus set for chromosomes 1–22 (xaxis). We used the variation of information (VI) (yaxis) to compute distances between domain sets for the multiscale consensus set vs. Dixon et al. (blue dots) and the multiscale consensus vs. randomly shuffled domains (red diamonds).
Enrichment of CTCF and histone modifications near boundaries
We assess the enrichment of transcription factor CTCF and histone modifications H3K4me3 and H3K27AC within the interdomain regions induced by the consensus domains. These enrichments provide evidence that the boundary regions between topological domains correlate with genomic regions that act as insulators and barriers, suggesting that the topological domains may play a role in controlling transcription in mammalian genomes [5].
Figure 6 illustrates the enrichment of insulator or barrierlike elements in domain boundaries in both the human fibroblast (IMR90) and mouse embryonic stem cell (mESC) lines. Specifically, we observe that the boundaries between consensus domains are significantly enriched for all of the transcription factors and histone marks we consider. In certain cases — specifically in the case of CTCF — we notice that the CTCF binding signals peak more sharply in the boundaries between the domains we discover than in the boundaries between the domains of Dixon et al.
Figure 6. Enrichment for chromatin marks and histone modifications in domain boundaries. Enrichment of CTCF binding (a) in IMR90 and (b) in mESC and histone modifications (c), (d) in mESC around domain boundaries for our consensus set of persistent domains (left, blue), and for those identified by Dixon et al. (right, blue). Green lines represent the presence of CTCF at the midpoint of the topological domains.
We also observe that, when compared with the domain boundaries predicted by Dixon et al., our boundaries more often contain insulator or barrierlike elements (see Table 1). Specifically, we normalize for the fact that we identify approximately twice as many domains as Dixon et al., and generally observe a twofold enrichment in the fraction of boundaries containing peaks for CTCF markers. This suggests that structural boundaries identified by our method are more closely tied to functional sites which serve as barriers to longrange regulation. We also observe a depletion of insulator CTCF elements within our domains when compared to the domains of Dixon et al. This observation is consistent with the assumption that transcriptional regulation is more active within spatially proximate domains since there are fewer elements blocking regulation within these domains. Table 1 also shows similar patterns for histone modifications which suggests that our domain boundaries are enriched for functional markers of gene regulation.
Table 1. Chromatin marks and histone modification enrichments within and between domains
Multiple optimal solutions across scales reveal the hierarchical organization of topological domains
It has been recently hypothesized that chromatin is packed into the nucleus in a hierarchical manner suggesting that smaller, spatially compact domains combine to form larger superdomains that may be functionally similar [2,3,6]. This hypothesis is partially motivated by the fact that the distribution of 3C interaction frequencies better matches a fractal globule model of chromatin organization than an essentially random equilibrium organization of chromatin in the nucleus [22] and by an initial exploration of the hierarchical organization of the Drosophilla genome [6]. By combining alternative optimal and nearoptimal domains across scales, we quantitatively determine the extent to which domains at different γ conform to a hierarchical structure empirically identifiable in Figures 4(a) and 7.
Figure 7. Domain sets at various resolutions. 10 best optimal and nearoptimal solutions for resolutions γ=0.5,0.35,0.15,0.10 for a portion of human fibroblast chromosome 20 (IMR90). Variations in the domain assignments within a single γ and across resolutions correspond with visually identifiable, hierarchical regions of dense HiC interactions. All histone mark tracks were obtained from IMR90 cells. Plotted with WashU EpiGenome Browser [23].
We determine the extent to which all identified optimal and nearoptimal topological domains are hierarchically organized by combining alternative optimal and nearoptimal domains and computing a score characterizing the hierarchy. Specifically, we combine all nearoptimal domains across all resolutions into a single set: where is the ith optimal solution at resolution γ and K total solutions are found at each resolution. We quantify the extent to which domains in this set are nested by determining the fraction of sufficiently different domain pairs {d_{i},d_{j}} where either d_{i} or d_{j} is completely contained in the other:
and contains all pairs of domains {d_{i},d_{j}} from domains in such that α=d_{i}Δd_{j}/d_{i}∪d_{j} — a fraction of genomic fragments different between two domains d_{i} and d_{j} in relation to the union of all fragments comprising the two domains — is greater than a userspecified value. For our tests, we define two domains to be different if more than 10% of their fragments differ (α=0.1). If no domain is contained fully in any other domain the score h(·)=0. If, for every pair of domains, one of the domains is fully contained in the other, the score attains its maximum value h(·)=1. We empirically observe that randomly generated domains result in h(·)≈0.5.
To determine whether the set of all identified domains we observe is significantly more hierarchical than expected by chance, we randomly shuffle domains while maintaining the same domain and nondomain length distributions as the sets of domains we find [21]. At each resolution, we identify the K=10 optimal and nearoptimal solutions for all chromosomes in human fibroblast cell line (IMR90) as well as mouse embryonic cells (mESC). The choice of K=10 is computationally beneficial given that even for such low K, the score for the next optimal solution drops off fast at lower γ, but for γ=0.5 the optimal score only changes by 0.02% (from 16774.7 to 16771.2) after 50000 solutions are considered. Alternatively, a weaker null hypothesis could be constructed that uses randomly shuffled HiC matrix. However, this approach does not control for the distribution of domain lengths — a previously established property of topological domains [5,6]. In addition, it has recently been shown that randomly shuffled HiC matrices lack a clear domain structure since they exhibit significantly depleted insulation scores [24]. This weaker null hypothesis is thus not appropriate for determining the significance of hierarchical domain structure. For both organisms, we find that h(·) for the identified set of domains is significantly larger than h(·) for the randomized domains (BenjaminiHochberg corrected P<0.001 over all chromosomes). The mean value of the identified set of domains is ≈0.95 as opposed to ≈0.70 for 1,000 randomized sets of domains sampled from each resolution. Computing h(·) on the combined set of domains is conservative since it is likely that domains from multiple optimal and nearoptimal solutions can overlap but may not be completely contained in one another within a length scale. This suggests that the multiple optimal and nearoptimal domains across scales exhibit a hierarchical structure and that the ensemble of domains can be used as the basis of a more detailed analysis of the hierarchical organization of these genomes.
Discussion and conclusions
In this paper, we introduce an algorithm to identify topological domains in chromatin using interaction matrices from recent highthroughput chromosome conformation capture experiments. Our algorithm produces domains that display much higher interaction frequencies within the domains than inbetween domains (Figure 2) and for which the boundaries between these domains exhibit substantial enrichment for several insulator and barrierlike elements (Figure 6). To identify these domains, we use a multiscale approach that finds domains at various size scales and generates multiple optimal and nearoptimal solutions.
We define a consensus set to be a set of domains that persist across multiple resolutions and give an efficient algorithm that finds such a set optimally.
Our method uses a score function that encodes the quality of putative domains in an intuitive manner based on their local density of interactions. Variations of the scoring function in (4), for example, by median centering rather than mean centering or by optimizing the homogeneity of interaction frequencies instead of total frequencies, can be explored to test the robustness of the enrichments described here.
Our method is particularly appealing in that it requires only a single userspecified parameter γ_{max}. For our experiments, the parameter γ_{max} was set based on the maximum domain sizes observed in Dixon et al.’s experiments so that we could easily compare our domains to theirs. This parameter can also be set intrinsically from properties of the HiC interaction matrices. For example, we observe similar enrichments in both human and mouse when we set γ_{max} to be the smallest γ∈Γ such that the median domain size is >80kbp (two consecutive HiC fragments at a resolution of 40kbp). This is a reasonable assumption since domains consisting of just one or two fragments do not capture higherorder spatial relationships (e.g. triad closure) and interaction frequencies between adjacent fragments are likely large by chance [22].
We compared the fraction of the genome covered by domains identified by Dixon et al. vs. the domains obtained from our method at various resolutions. Dixon et al.’s domains cover 85% of the genome while our sets tend to cover less of the genome (≈ 65% for a resolution that results in the same number of domains as those of Dixon et al.). The fact that our domain boundaries are more enriched for CTCF sites indicates that our smaller, more dense domains may be more desirable from the perspective of genome function. The dense, functionallyenriched domains discovered by our algorithm provide strong evidence that alternative chromatin domains exist and that a single length scale is insufficient to capture the hierarchical and overlapping domain structure visible in heat maps of 3C interaction matrices.
We provided the first quantitative analysis testing the hypothesis that the domain structure across scales is significantly hierarchically organized, suggesting that the domains we identify can be used as the basis for studying the hierarchical organization of genomes and how this structure impacts gene regulation. By incorporating multiple optimal and near optimal solutions into this analysis, we provide evidence that the observed hierarchical structure persists not only across scales but across a variety of plausible highscoring domain sets. However, multiple optimal solutions are not necessary to quantify the hierarchical structure of the domains since single optimal solutions across scales can already reveal a hierarchical structure. There are many more nearoptimal solutions at higher values of γ since the domain sizes tend to be smaller. For this special case, it would be desirable to develop a method that more concisely characterizes these larger solution spaces, and this is an interesting direction for future work. The quantitative evidence of the hierarchical structure of topological domains also motivates the development of novel methods for domain discovery that directly account for such hierarchy in the models they assume and the functions they optimize.
The method for discovering topological domains that we have introduced is practical for existing datasets. Our implementation is able to compute the consensus set of domains for the human fibroblast cell line and extract the consensus set in 24 minutes when run on a personal computer with 2.3GHz Intel Core i5 processor and 8Gb of RAM. Computing optimal and nearoptimal solutions adds only a small overhead to overall running time: when computing 20 top optimal and nearoptimal solutions per each γ setting (with γ 0.00.9 with a step of 0.05) the computation finishes in 25 minutes 34 seconds.
A preliminary version of this manuscript appeared in the 2013 Workshop on Algorithms for Bioinformatics [25].
Availability and requirements
A C++11 implementation of the algorithms and instructions for compilation and use are available at http://www.cs.cmu.edu/~ckingsf/software/armatus/ webcite.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
DF, RP, and GD contributed equally to the development of methods and manuscript. CK contributed to the manuscript. All authors reviewed the manuscript and approved it.
Acknowledgements
This work has been partially funded by National Science Foundation (CCF1256087, CCF1053918, and EF0849899) and National Institutes of Health (R21AI085376, R01HG007104). C.K. received support as an Alfred P. Sloan Research Fellow. D.F. is a predoctoral trainee supported by NIH T32 training grant T32 EB009403 as part of the HHMINIBIB Interfaces Initiative.
References

de Wit E, de Laat W: A decade of 3C technologies: insights into nuclear organization.
Genes Dev 2012, 26:1124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gibcus JH, Dekker J: The hierarchy of the 3D genome.
Mol Cell 2013, 49(5):773782. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cavalli G, Misteli T: Functional implications of genome topology.
Nat Struct Mol Biol 2013, 20(3):290299. PubMed Abstract  Publisher Full Text

Fudenberg G, Getz G, Meyerson M, Mirny LA: High order chromatin architecture shapes the landscape of chromosomal alterations in cancer.
Nat Biotechnol 2011, 29(12):110913. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological domains in mammalian genomes identified by analysis of chromatin interactions.
Nature 2012, 485(7398):37680. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G: Threedimensional folding and functional organization principles of the drosophila genome.
Cell 2012, 148(3):458472. PubMed Abstract  Publisher Full Text

Hou C, Li L, Qin ZS, Corces VG: Gene density, transcription, and insulators contribute to the partition of the Drosophila genome into physical domains.
Mol Cell 2012, 48(3):47184. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kölbl AC, Weigl D, Mulaw M, Thormeyer T, Bohlander SK, Cremer T, Dietzel S: The radial nuclear positioning of genes correlates with features of megabasesized chromatin domains.
Chromosome Res 2012, 20(6):73552. PubMed Abstract  Publisher Full Text

Lin YC, Benner C, Mansson R, Heinz S, Miyazaki K, Miyazaki M, Chandra V, Bossen C, Glass CK, Murre C: Global changes in the nuclear positioning of genes and intra and interdomain genomic interactions that orchestrate B cell fate.
Nat Immunol 2012, 13(12):1196204. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bickmore WA, van Steensel B: Genome Architecture: domain organization of interphase chromosomes.
Cell 2013, 152(6):12701284. PubMed Abstract  Publisher Full Text

Tanay A, Cavalli G: Chromosomal domains: epigenetic contexts and functional implications of genomic compartmentalization.
Curr Opin Genet Dev 2013, 23(2):197203. PubMed Abstract  Publisher Full Text

Finding a maximum density subgraph.
Tech. Rep. 171, University of California, Berkeley, CA 1984

Schaeffer SE: Graph clustering.
Comput Sci Rev 2007, 1:2764. Publisher Full Text

Filippova D, Gadani A, Kingsford C: Coral: an integrated suite of visualizations for comparing clusterings.
BMC Bioinformatics 2012, 13:276. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Huang L, Chiang D: Better kbest parsing. In Proceedings of the Ninth International Workshop on Parsing Technology. Stroudsburg, PA, USA: Association for Computational Linguistics; 2005:5364.

Kleinberg J, Tardos E: Algorithm Design. Boston. MA: AddisonWesley; 2005.

Yaffe E, Tanay A: Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture.
Nat Genet 2011, 43(11):10591065. PubMed Abstract  Publisher Full Text

Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B: Analysis of the vertebrate insulator protein CTCFbinding sites in the human genome.
Cell 2007, 128(6):12311245. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shen Y, Yue F, McCleary DF, Ye Z, Edsall L, Kuan S, Wagner U, Dixon J, Lee L, Lobanenkov VV, Ren B: A map of the cisregulatory sequences in the mouse genome.
Nature 2012, 488:116120. PubMed Abstract  Publisher Full Text

Meilă M: Comparing clusterings by the variation of information.

Duggal G, Wang H, Kingsford C: Higherorder chromatin domains link eQTLs with the expression of faraway genes.

LiebermanAiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J: Comprehensive mapping of longrange interactions reveals folding principles of the human genome.
Science 2009, 326(5950):289293. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhou X, Lowdon RF, Li D, Lawson HA, Madden PA, Costello JF, Wang T: Exploring longrange genome interactions using the WashU Epigenome Browser.
Nat Methods 2013, 10(5):375376. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, Laue ED, Tanay A, Fraser P: Singlecell HiC reveals celltocell variability in chromosome structure.
Nature 2013, 502(7469):5964. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Filippova D, Patro R, Duggal G, Kingsford C: Multiscale Identification of Topological Domains in Chromatin. In Proceedings of 13th Workshop on Algorithms in Bioinformatics (WABI), Volume 8126. Heidelberg, Germany; 2013:3003012.