Abstract
Background
Chromosome structure is closely related to its function and Chromosome Conformation Capture (3C) is a widely used technique for exploring spatial properties of chromosomes. 3C interaction frequencies are usually associated with spatial distances. However, the raw data from 3C experiments is an aggregation of interactions from many cells, and the spatial distances of any given interaction are uncertain.
Results
We introduce a new method for filtering 3C interactions that selects subsets of interactions that obey metric constraints of various strictness. We demonstrate that, although the problem is computationally hard, nearoptimal results are often attainable in practice using welldesigned heuristics and approximation algorithms. Further, we show that, compared with a standard technique, this metric filtering approach leads to (a) subgraphs with higher statistical significance, (b) lower embedding error, (c) lower sensitivity to initial conditions of the embedding algorithm, and (d) structures with better agreement with light microscopy measurements. Our filtering scheme is applicable for a strict frequencytodistance mapping and a more relaxed mapping from frequency to a range of distances.
Conclusions
Our filtering method for 3C data considers both metric consistency and statistical confidence simultaneously resulting in lowererror embeddings that are biologically more plausible.
Keywords:
3C; Chromosome conformation capture; Metric violations; Triangle inequality; Graph embeddingBackground
Chromosome conformation capture (3C) is an experimental technique designed to observe how the genome folds in a cell [1]. Measurements from 3C experiments have been used to construct threedimensional models of chromosomes at a higher resolution than what is possible with light microscopy [2], and these models are correlated with longrange regulation [3], chromatin accessibility [4], as well as cancerrelated genome alterations [5]. Since its introduction, the 3C technique has become widely adopted and has been applied to bacterial, yeast, fruit fly, and human genomes [3,613].
Measured interactions between genomic locations are aggregated into a chromosome conformation graph. The frequency of an interaction between a particular pair of genomic locations in the assayed population of cells can be converted to a distance, and this mapping allows the graph to be embedded in three dimensions. Before embedding, interactions in 3C graphs are usually filtered so that only interactions with unusually high frequencies given their genomic distance are kept. For example, contact matrices normalize the observed frequency of an interaction within a chromosome by the expected frequency within an entire genome (e.g. [4,14]) while others more explicitly model the distribution of interaction frequencies (e.g. [3,6]). In this sense, traditional statistical filtering methods retain highconfidence interactions.
However, because 3C measurements are aggregated over millions of cells, the distances associated with these highconfidence interactions are often metrically inconsistent. For example, among 2,257,241,015 triplets of measurements that form triangles in the yeast 3C data of Duan et al. [6], 679,480,886 (30%) do not satisfy the triangle inequality. These inconsistencies make it difficult to reason about conformational properties of the genome. Further, existing filtering procedures do not use relationships between the edges to, for example, discard highconfidence edges that are apparently inconsistent with many lowerconfidence edges, or to include seemingly lowconfidence edges that are nonetheless consistent with many others.
To address these shortcomings, we introduce the idea of metric filtering where we seek a highconfidence metrically consistent subset of the 3C graph. We frame the procedure as a family of optimization problems where we want to find a subgraph of high total weight (confidence) such that the set of chosen edges satisfies metric constraints of various stringency. We show that this family of problems, like embeddability in [15], is NPhard, and provide four algorithms for the approximate solution of the least and most stringent versions.
We apply the metric filtering algorithms to 4C measurements for budding yeast [6] and show that the heuristics are able to find nearoptimal solutions to the variant of the problem where only triangles are consistent. Despite the additional metric constraints, the selected set of edges is often of higher total confidence than the data set considered in Duan et al. [6] that had an estimated 1% false discovery rate (FDR).
We show that embeddings based on these filterings have lower embedding error than those based on an existing filtering technique [6]. The structures also exhibit lower variation when different initial conditions are chosen for a previously proposed nonlinear optimization embedding technique. Finally, we provide anecdotal evidence that the structure resulting from the metrically filtered interactions is in better agreement with known biology than the structure derived using standard filtering techniques. The improved agreement is a result of the metric filtering being able to include longerdistance, but lowerconfidence, interactions.
Problem definition
Problem 1 (ConsistentkPaths)
Given an integer k≥2, and a graph G=(V,E), where each edge e∈E is associated with a nonnegative length d(e) and a positive reward r(e), find a subset S⊆E of edges that maximizes and such that, for all e∈S and for any path in G of k or fewer edges in S joining the endpoints of e, the following condition holds:
In other words, we seek the highest total reward subgraph where the length of every chosen edge is shorter than any path of k or fewer hops joining the endpoints of that edge. If an edge satisfies condition (1) for a given k, we say it is kconsistent, or simply consistent if k is clear from the context. If the edge is not consistent, it is violated.
ConsistentkPaths is a family of problems parameterized by k. The value of k allows the strictness of the metric condition to be varied. To obtain an idea as to how relatively stringent the filterings are, we focus on the two extreme cases of k=2 and k=V−1. The strictest condition is k=V−1, when every alternative path must be at least as long as the direct edge connecting the endpoints of the path, while the most lenient condition is k=2, where consistency is only enforced for triangles. Because of their importance, we give names to these two special cases.
Definition 1 (ConsT)
ConsT is an instance of ConsistentkPaths with k=2, i.e. every triangle must satisfy inequality (1).
Definition 2 (ConsP)
ConsP is an instance of ConsistentkPaths with k=V−1 implying that all paths are consistent.
The ConsT formulation (and any formulation with k<V−1) is motivated by the fact that each measured distance is associated with some uncertainty that propagates when summing distances over longer paths. The ConsP property is more strict, requiring all paths to be consistent, but it suffers from the propogation of errors as distances are summed over large paths.
NPHardness of ConsistentkPaths
Theorem 1
ConsistentkPaths is NPhard for k>1.
Proof
Reduction from Independent Set: given and a graph G=(V,E), construct a graph H=(V∪{u},E∪E^{′}) as follows. Let d(e)=r(e)=3 for all e∈E. Create a new vertex u, and a new set of edges E^{′}={{u,v}∣v∈V}. Set d(e)=r(e)=1 for all e∈E’. We show that G has an independent set of size ≥ℓ ⇔ H has a solution of total reward R≥3E+ℓ. Note that a violating path in H contains exactly 2 edges and, along with the violated edge, forms a triangle. This is because every edge in H has d≥1, so that every path containing 3 or more edges will have a total d≥3. Such a path is as long as any edge in H, and hence can violate no edge. It follows that this reduction applies for all k≥2.
⇒ Let S⊆V be an independent set of size ≥ℓ. Choose all edges of E and the edges E_{S}={{u,w}∣w∈S}. The total reward of this set is R=3E+S≥3E+ℓ. Since all of the edges in E have d=3, all 2hop paths formed by these edges have d=6 and do not violate eq. (1). Further, since S is an independent set, no triangle involving u is selected. Therefore, the graph induced by the selected set of edges, E∪E_{S}, is consistent.
⇐= Assume E^{∗} is a solution to the ConsistentkPaths problem with R≥3E+ℓ. First, note that no triangle {u,w},{w,v},{v,u} can be selected since this would violate {v,w} because d(v,u)+d(u,w)<d(v,w). Due to the following argument, we assume, without loss of generality, that all edges of G are chosen: Suppose a pair of edges {u,v} and {u,w} was chosen and edge {v,w} exists in E but was not chosen. Then we can remove {u,v} and add {v,w}. This is still a solution with reward ≥3E+ℓ, since the swap only increases the value of the solution. Repeated application of this will produce a solution of cost ≥3E+ℓ that includes all of E.
In the transformed solution, the edges of E contribute a reward of 3E. Further, to avoid violating edges, the endpoints of selected edges adjacent to u must form an independent set, and to achieve a total reward R≥3E+ℓ, there must have been an independent set size ≥ℓ. □
Corollary 1
ConsistentkPaths restricted to r(e)=1 for all e∈E is NPhard for k>1.
Proof
Apply the same reduction as above, except that r(e)=1 instead of 3 for all e∈E, and using a total reward threshold of E^{∗}≥E+ℓ instead of 3E+ℓ. In the new formulation, E^{∗} might include two edges {u,w} and {u,v} without picking {v,w}∈E if it exists. However, any such solution can be transformed into one of the form above with equal cost by removing {u,w} and adding {v,w}. This neither changes the number of edges nor decreases the total reward of E^{∗}, and the proof can proceed as in Theorem 1. □
Algorithms
Since ConsT and ConsP are NPhard, it is unlikely that there exist algorithms that solve these problems in polynomial time. Thus, we have developed several approximation algorithms and heuristics to tackle them in practice. We present five algorithms below; the first three apply to the ConsT problem while the latter two apply to the ConsP problem.
A setcoverbased algorithm
We formulate ConsT as a minimum weight set cover problem by removing the lowest weight set of edges that restores consistency, and therefore maximizes the weight of the remaining graph (the complement of the original problem). Let be the set of violated triangles in G (where a triangle is violated if it does not obey the triangle inequality). For edge {u,v} in E, let S_{uv} be the subset of triangles in that contain {u,v}, and let C={S_{uv} for all {u,v}∈E}. Define the cost c(S_{uv})=r({u,v}). We then seek the smallest weight collection of sets S_{uv} that cover all the violated triangles , a direct application of minimumweight set cover. Removing the edges corresponding to each chosen subset S_{uv} will resolve all of the violated triangles. This problem can be approximated using either an LP relaxation or a greedy algorithm [16]. Note that, since each violated triangle belongs to at most 3 sets of the collection C, there is an algorithm that finds a solution to this SETCOVER instance with a cost no more than 3 times OPT[16]. For the experiments described here, we use the greedy algorithm. There exist exact algorithms for the related hitting set problem [17], but these are only efficient when the number of edges that need to be removed is small, which is not what we observe in the 3C data we analyze.
A hierarchical maximum cut approach
Another approach to solve ConsT uses a solution to the MAXCUT problem to find a maximum weight (i.e. maximum total reward) cutset E’ separating vertices of G into V_{1} and V_{2}. Because E’ is bipartite, it will have no triangles, and thus, no violated triangles. The LOCALCUT algorithm [18] guarantees that E’ has at least 1/2 the total reward of G. Therefore, this algorithm is a 1/2approximation to ConsT. We add all edges in E’ to the growing solution set E^{∗}. Then, for every pair of edges {u,v},{u,w} in E’, if there is an edge {v,w}∈E∖E’ that forms a violated triangle, we remove {v,w} from G, and we recursively apply this procedure to the two partitions induced by the maximum cut. Because the subgraphs induced by V_{1} and V_{2} only contain the set of edges that form nonviolating triangles with the edges in E’, the constructed solution contains no triangle violations.
Integer linear program
The ConsT problem can be expressed succinctly as an integer linear program (ILP), mirroring a standard ILP for set cover:
where are the set of metrically inconsistent triangles in G. Of course, (2) is computationally difficult to solve. However, by relaxing the condition in eq. (3) to 0≤x_{e}≤1, we obtain a linear program which can be solved efficiently and whose objective provides an upper bound on total confidence of the optimal metrically consistent solution.
Taking the union of shortest paths
Let be the set of edges in all shortest paths (according to d) going from node u to node v in G=(V,E). A feasible solution to ConsP is to take the edges in . We call this the SPUNION heuristic. The intuition behind it is that, by definition, no edge that is part of some shortest path in G can be violated. Assume such an edge {u,v} was violated. Then, there must exist some path p between u and v with d(p)<d(u,v). However, this contradicts the fact that {u,v} belongs to some shortest path, because we could replace {u,v} with p and shorten this path.
Unfortunately, there may be an exponential number of shortest paths in G. However, by removing from E all edges that are not part of some shortest path, we can obtain the desired set of edges without explicitly enumerating all shortest paths. The SPUNION heuristic first computes, for every edge {u,v}∈E, the length of the shortest path between its endpoints, d(sp_{u,v}). Then, the solution is simply given by E^{∗}=E∖{{u,v}∈E∣d(sp_{u,v})<d(u,v)}.
Maximum spanning tree heuristic
The final heuristic, MSTADD, first constructs a maximumreward spanning tree T=(V,E_{T}) on G=(V,E) and adds its edges to E^{∗}. This can be computed using any standard maximumweight spanning tree algorithm. By construction, T has a high total reward. Since it is a tree, it contains no cycles, and hence no violations. We sort the remaining edges E∖E_{T} by their reward and, iterating through them in descending order, add them to E^{∗} if they do not violate any shortest paths in the growing E^{∗}.
Metric filtering with uncertain distances
In ConsT and ConsP, we assume that every observed frequency f(e) maps to a single spatial distance. However, it is more plausible that an observed frequency maps to a range of distances. In this case, we suppose that there is a range of distances [l(e),u(e)] where l(e) and u(e) represent the lower and upper bounds of all valid distances to which the frequency f(e) of edge e can map.
Given such a range for every edge e, the new condition for metric consistency (previously inequality (1)) becomes:
This allows distances assigned to the edges to be “flexible,” and they can be extended or contracted within the range’s bounds to satisfy the new metric consistency condition (4). An edge e is violated only if, for some path of k or fewer hops connecting the endpoints of e, the sum of the upper bounds of the distances for edges e’ in this path is less than the lower bound of the distance range for e.
All of our algorithms can be modified in a straighforward way to use this relaxed definition of metric consistency. Now, the SETCOVER algorithm need only cover triangles that are violated according to (4). In the MAXCUT algorithm, we need only remove edges from the left and right bipartition that form triangles involving cut edges that are violated under the new definition. When computing the set of shortest paths (SPUNION) and the maximumreward spanning tree (MSTADD), the shortest path lengths are now computed using u(e^{′}) and, to test whether any edge e is violated, these lengths are compared to l(e).
Computational results
Weights for 3C interactions in budding yeast
We use the measurements from Duan et al. [6], who used a 3C variant called 4C to assay interactions for the entire S. cerevisiae genome during interphase with two experiments based on the HindIII MseI and MspI restriction enzymes. In total, Duan et al. measured 4,097,539 interactions across 4,193 genome fragments (nodes). Each of these interactions e is associated with a frequency f(e) — the number of times it was observed.
Duan et al. process these raw frequency counts to derive several other measures for each interaction. A spatial distance d(e), which we use as the edge length in condition (1), is assigned to every interaction using a frequencytodistance mapping based on the observed inverse relationship between genomic separation and frequency for intrachromosomal interactions. Such a distance mapping is common to most approaches that seek embeddings of 3C data [3,6,10,19]. Because the distance mapping is based on intrachromosomal interactions, we have more confidence in the spatial distances d(e) for interactions within a chromosome. Therefore, in most of the experiments below, we consider each chromosome individually. Table 1 gives the sizes of the graphs for each chromosome of yeast.
Table 1. Sizes of yeast chromosome conformation graphs
Duan et al. also compute a pvalue for every e. Using these pvalues, they further derive a “qvalue” that accounts for multiple hypothesis testing caused by the large number of edges sampled. See the “Computational methods” of the supplementary material of Duan et al. for a description of how the qvalues are computed. We reproduced their pvalue and qvalue computations and use r(e)=1−q(e) as the reward for including an edge in the solution in condition (1). The value 1−q(e) is a measure of confidence: high values indicate low pvalues, which indicate that interactions occur with a frequency that one would not expect by chance.
The input to the filtering procedures is thus the graph G=(V,E) where V is a set of restriction fragments and E is the set of interactions. The distance on an edge e is d(e) and the reward on an edge is the confidence r(e). The goal of metric filtering is to find a subgraph with high confidence (i.e. generally low pvalues) with no metric violations as defined by condition (1).
Ability of the algorithms to find highweight subgraphs
The heuristics of the Algorithms section were tested on each of the 16 yeast chromosomes, and Figure 1 summarizes the algorithms’ ability to find highconfidence solutions. In 15 out of 16 cases, MAXCUT finds a ConsT subgraph with the largest total confidence (Figure 1, green triangles). However, in all 16 chromosomes, the SETCOVER method finds a graph of nearly the same quality, indicating that this method is competitive in terms of its ability to optimize the objective function.
Figure 1. Metric filtering performance on various chromosomes of yeast. The total confidence that each algorithm recovers normalized by the total number of edges is plotted. Higher values are better. The objective value of the linear program (blue diamonds) gives an upper bound for both the ConsT and ConsP solutions. We also compare our algorithms to the Duan et al. filtering method at FDR 1% (black circles) which is their largest and highestconfidence filtered interaction set. (A bug in the SPUNION script in the conference version of this paper [20] erroneously led to plotting the sum of edge qvalues instead of the sum of confidence. The plots have been updated to correctly plot total confidence with little qualitative difference).
The ConsT subgraphs have similar—and usually higher—total confidence than FDR 1% while eliminating all violated triangles (compare black circles with green triangles and red squares in Figure 1). The set of interactions at FDR at 100q^{′}% is the set of edges originally considered by Duan et al. with q<q’. Both SETCOVER and MAXCUT achieve total confidence that is higher than the Duan et al. FDR 1% filtering for all but the smallest chromosomes (1, 3, and 6). Even in those cases, SETCOVER and MAXCUT solutions are no more than 25% away from the FDR 1% total confidence.
Due to the NPhardness of the problems, optimal solutions for ConsT and ConsP are difficult to obtain. However, the ConsT problem can be expressed as an integer linear program (2). While this ILP is also difficult to solve, its linear relaxation is solvable in practice and provides a provable upper bound on the value of the optimal solution, shown as blue diamonds in Figure 1. This bound reveals that the SETCOVER and MAXCUT approaches find solutions that are close to optimal. Experiments on all chromosomes achieve total confidence values that are at least 70% of the linear program upper bound, and four cases achieve total confidence of around 90% of the upper bound. Since the LP overestimates the optimal value, it is likely that the heuristics provide solutions that are much closer than 70% of the true optimal solution.
The algorithms for ConsP (MSTADD and SPUNION) find graphs with far fewer edges and far lower total confidence than any of the solutions for ConsT (Figure 1), and they sacrifice a significant proportion of the total confidence to obtain a completely metric subgraph. This is a strong indication of how much more strict the ConsP condition is compared with ConsT. In addition, the SPUNION algorithm performed worse then MSTADD. The severe condition required by ConsP is likely too strict for the noisy 3C data, and ConsT provides a more reasonable tradeoff between avoiding metric violations and keeping a useful fraction of the interactions.
Metric filtering produces lowererror embeddings
The ConsT and ConsP filterings both result in lowererror embeddings than their associated confidenceranked filterings when embedded using a nonlinear optimization technique. To control for the size of filterings we compare a metric filtering with m interactions to an associated set of the m highest confidence interactions (CRANK). The embedding attempts to place nodes to minimize the sumsquared error of between the original d(e) and the embedded o(e) distance. The SETCOVER filtering of chromosome 1 resulted in a mean sumsquared error of 0.97 across 10 embeddings while CRANK resulted in an error of 1.58. Similarly, MSTADD had an average error of 0.067 while CRANK produced an error of 0.39. Our improved performance may be due to the fact that metric violations result in distance contradictions that cannot be resolved by the optimization procedure.
To confirm the hypothesis that metric violations cause increased errors in the embeddings, we systematically reintroduced violated triangles using the following procedure. We choose a triangle {u,v,w} at random. If d(u,v)<d(u,w), then we set d(u,v)=αd(v,w)−d(u,w). Otherwise, we set d(u,w)=αd(v,w)−d(u,v), for some choice of 0<α<1. As the percentage of violated triangles increased, the embedding error increased as well with 1.2,1.4,1.6,2.0,2.3,2.8 average error for 10,20,30,40,50,60% violated triangles respectively with α=0.9. Metric filtering therefore has the desirable property of removing the embedding error that results from the existence of violated triangles.
Metric filtering produces lowvariance, more biologically plausible embeddings
We embedded the various sets of filtered constraints for the chromosomes using an established nonlinear optimization technique [6,9] that incorporates chromatin packing constraints consistent with known biology in yeast. We obtained ensembles of structures by providing random initial conditions for our implementation of this optimization, a technique previously used to study conformational differences between cancer and healthy genomes [3].
Ensembles of 10 embeddings for SETCOVER on chromosome 1 are shown in Figure 2a and the ensembles for a nonmetric filtering with equal number of edges (obtained by taking the corresponding number of edges with the highest confidence) are in Figure 2b. We focus on observations for chromosome 1, but have observed similar trends for other chromosomes. For each filtering, the embeddings are aligned to each other using a maximum likelihood superpositioning technique [21].
Figure 2. Filtered graphs embedded in 3D. Superposition of 10 embeddings for both ConsT and CRANK filterings. (a)SETCOVER. (b)CRANK of the same size as the SETCOVER. (c)SETCOVER after removing ≈20% of the lowestconfidence edges.
Lowervariance embeddings
Both ConsT (Figure 2) and ConsP filterings produce ensembles of embeddings that are more homogeneous than those from the associated CRANK sets as indicated by the superposition of structures in Figure 2. We can quantify the variance of an ensemble by computing the sum of the branch lengths of a minimum spanning tree of a complete graph where the nodes represent the embedded structures and the edge weights are the RMSD between the alignments of pairs of structures. The minimum spanning tree on this graph represents a parsimonious way to describe the variability among embedded structures. The MSTbased variability between the SETCOVER and MSTADD embeddings of chromosome 1 are 0.17 and 0.0093 respectively while the MST variability of the associated CRANK embeddings are much larger, at 0.26 and 0.32 respectively.
Low variance among the embeddings of metric subgraphs indicates that selecting edges for their metric consistency allows fewer highly different solutions to be found. This is desirable, because we do not want embeddings to be sensitive to the initial conditions of the optimization procedure. Further, because they were taken from a population of cells, the 3C measurements are in fact taken from an ensemble of structures. The large variance among CRANK edges may reflect this fact. In contrast, the metric filtering appears to be selecting subsets of constraints that could plausibly represent a single structure. Hence, metric filtering may be one way to partially deconvolve the populationaveraged measurements.
Biologically plausible embeddings
The ConsT embeddings result in telomere distances that match known microscopy distances better than the associated CRANK set. A recent experiment [22] establishes that the distance between the telomeres of chromosome 1 in budding yeast are often about 1 μm. The embeddings of CRANK in Figure 2b have an average distance of 0.45 μm while the embeddings of SETCOVER (Figure 2a) have an average distance of 0.96 μm, which is a much better match to the experimentally observed value.
Despite having edge sets of the same or larger total confidence, the metric filtering produces very different structures than the CRANK filtering. However, removing the 71 lowestconfidence edges from the ConsT embedding does result in a structure similar to the CRANK filterings Figure 2c. Thus, it seems these lowerconfidence, but metricallyconsistent, interactions are crucial to obtaining the more distended structures that are more consistent with microscopy experiments.
Analysis of the types of edges kept by metric filtering
For all chromosomes, both ConsP and ConsT keep more lowconfidence edges than the CRANK filtering (ConsT shown in Figure 3a). Although, in general, more lowconfidence edges are kept, the ConsT filtering of chromosome 1 preserves overall higher distance interactions than the associated CRANK filtering with a mean distance of 0.55 while the CRANK filtering has a mean distance of 0.25 (all distances in μm). Of these interactions, the highconfidence ones in the ConsT filtering (i.e. those above 0.8) have a mean distance of 0.31 while the CRANK filtering has a mean distance of 0.25. This is due to the fact that the interactions in the CRANK filtering are concentrated in a small region of chromosome 1, while the SETCOVER filtering distributes the interactions across the entire chromosome: for the interactions in the CRANK set, but not in the ConsT filtering, 76 out of the 131 interactions lie between positions 75881 and 130646 of chromosome 1 while the densest region of similar size in the SETCOVER filtering has only 26 interactions. The preservation of largerdistance, higherconfidence interactions in the ConsT filtering is likely what results in the expanded structure where telomere distances are more in line with microscopy experiments. For the ConsP embedding, however, the large disparity in mean distance of interactions (CRANK: 0.165, MSTADD: 1.69) is due mostly to lowconfidence edges. This creates an undesirable structure that contains very little useful information about longrange interactions. This is another indication that the strictness of the ConsP filtering may be too severe compared with the more relaxed and biologically plausible ConsT approach. The inclusion of longrange interactions resulting from lowerconfidence edges represent some of the most interesting and desired information obtained from 3C experiments. CRANK necessarily ignores many of these longrange constraints, while the metric filtering allows the inclusion of both metrically consistent and higherconfidence constraints.
Figure 3. Histogram of genomic distances and interaction confidences between CRank and ConsT. Distances and confidences for both filtering methods were computed on the intrachromosomal interactions. (a) The ConsT filtering kept more lowconfidence edges than CRank (“FDR Filter” in the legend). (b) The ConsT filtering kept interactions with larger genomic distances than CRank..
In addition, the ConsT method generally keeps interactions with larger genomic distances than CRANK. The average genomic distance of CRANK is 243.5 kilobases while the average genomic distance of SETCOVER is 285.0 kilobases (Figure 3b). Surprisingly, while ConsP keeps more lowfrequency interactions, these tend to be at shorter genomic distances.
Various heuristics result in very different sets of edges
Although the MAXCUT and SETCOVER algorithms aim to optimize the same objective and find subgraphs of approximately the same total weight, they result in very different edge sets (Figure 4). Further, their intersections with the most confident edges are also different: of the 219,483 edges returned by MAXCUT, 53% are among the top 219,483 most confident edges, while 60% of the 86,866 edges returned by SETCOVER are among the most confident edges (Table 2). The differing number of edges in solutions with similar total confidence also indicates that the edges in the MAXCUT solution are of lower average confidence.
The structure of the graphs returned by MAXCUT is also very different than that of those returned by SETCOVER. The MAXCUT solution has very few triangles. For example, on chromosome 1 MAXCUT retains only 27 out of the original 10091 triangles while SETCOVER keeps 495. This difference is somewhat intuitive since SETCOVER is explicitly trying to throw away few triangles while MAXCUT is explicitly looking for trianglefree (i.e. bipartite) subgraphs. For fewer, higherweight edges with many triangles the SETCOVER should be preferred. This is likely the scenario that is most applicable to 3C chromosome embedding. Because optimal solutions cannot be found for large instances, it is unclear at this point whether the large variation in the returned edge sets is due to the objective function admitting many solutions or whether, if optimal solutions could be found, they would all be similar.
The two algorithms designed for ConsP also result in very different graphs, but this is primarily because the MSTADD algorithm is far more successful at finding a good solution than the SPUNION approach. The two algorithms had similar intersections with the topmost confident edges: for MSTADD, 46% of the edges were among the topmost confident edges, while for SPUNION the fraction was 41%.
Performance under uncertain distances
In the experiments conducted here, we set l(e)=(1−ρ)d(e) and u(e)=(1+ρ)d(e), 0≤ρ≤1. Here, d(e) denotes the distance corresponding to f(e) using the original frequencytodistance mapping. The parameter ρ determines the size of the distance range to consider. When ρ is set to 0, no flexibility is allowed and condition (4) reduces to condition (1); when ρ is set to 1, the lower bound for the distance of an edge e goes to 0, allowing all edges to satisfy condition (4). Therefore, with ρ=1 no edges will be removed by the filtering. We systematically sample ρ in the range of 0 to 1 in increments of 0.01.
Using distance ranges l(e),u(e), we compute the total confidence included in the filtered graph as a function of the slack factor ρ (Figure 5). The total confidence for chromosome 1 gradually increases as a function of ρ and there is no ρ for which there is a sudden jump in the total confidence of the output graph. If some choice of ρ had lead to significantly more total confidence, filtering with ranges defined by that ρ would make sense if there was independent evidence that the computed distances are uncertain with that factor. Here, however, a small nonzero ρ does not substantially increase the confidence of the resulting graph, indicating that there are not many inconsistent edges that are on the cusp of metric consistency.
Figure 5. Total and normalized surprises of included edges vs.ρ. Total confidences (top) and normalized confidences (bottom) of included edges for different metric filtering algorithms on yeast chromosome 1 dataset. Normalized confidence is the total confidence divided by the number of edges in the output graph.
The linearly increasing total confidence observed in Figure 5 is largely explained by the distribution of the size of edge violations for edges included when ρ=0 (Figure 6). In any given error bin, a large fraction of the violated edges are corrected at ρ=0, providing more evidence that the algorithms generally perform well with a strict distance mapping. We also observe that the edges that are corrected by the metric filtering are not associated with any particular magnitude of error. For example, even though most of the violating edges are low error, they are not preferentially removed in the metric filtering. Because of this, the histogram of errors that remain to be resolved is more uniform in shape than the original distribution of errors, and therefore we would expect, as observed, a gradual increase in total confidence as ρ increases.
Figure 6. Distribution of edge violation magnitudes. Histogram of d(u,w)−[d(u,v)+d(v,w)] for all edges {u,w} in violating triangles in the original graph (total bar height) and the number of these edges that are included in the filtered graph when ρ=0 using the SETCOVER filtering (yellow portion). The blue portion of the bar represents the number of edges that are filtered out when ρ=0.
Between the ConsP algorithms, MSTADD obtains higher total confidences for different values of ρ, while SPUNION obtains higher average confidence on the edges in the output graph. In other words, SPUNION includes higher quality edges, and MSTADD includes more edges. A ρ of 0.34 is needed before MSTADD acheives the total confidence of SETCOVER with ρ=0. This provides more evidence that the ConsP condition is too strict.
Interestingly, although MAXCUT and SETCOVER perform similarly for low values of ρ, SETCOVER obtains the highest total (and normalized) confidence under most values of ρ. The MAXCUT method fails to include more edges when the metric consistency condition is relaxed because the cut it finds in the first pass is not affected by the choice of ρ and this cut includes most of the edges it will eventually keep. This resistance to the flexibility provided by (4) is another reason that MAXCUT is less favorable than SETCOVER.
Practical running times of the algorithms
When applied to the largest yeast chromosome (chromosome 4), the SETCOVER and MAXCUT implementations take 21 seconds and 21.5 minutes to run respectively on an Opteron 8431 processor. The SPUNION and MSTADD methods take 2.25 minutes and <2 days respectively. The current implementation of MSTADD recomputes the shortest paths after every edge addition, and this could be substantially sped up with a dynamic shortestpaths method. The SETCOVER implementation is fast enough to be run on the entire yeast genome, including interchromosomal interactions, within 5 hours. In this case the ConsT filtering results in a significantly different edge set than the CRANK embedding (the size of the intersection with CRANK is only 350960 out of 657177 edges). The SETCOVER algorithm also yields a relatively high average confidence (0.88) when compared to the average confidence from CRANK (0.97).
Conclusions
We have provided evidence that a filtering scheme for 3C data that uses both statistical confidence and metric consistency as criteria produces sets of interactions that are more embeddable, and creates more consistent and more biologically plausible estimations for the 3D structures of the chromosomes. We show that such filtering in general is NPhard, but by comparing to LPbased upper bounds, we empirically demonstrate that both a set cover approach and a hierarchical maximum cut algorithm produce nearly optimal solutions avoiding any violated triangles. Finally, we demonstrate that the methods can be extended in a straightforward way to account for ranges of allowable distances.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
GD, RP, ES, HW and DF implemented the algorithms and performed the experiments. All authors helped design the algorithms and experiments, and helped write the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was partially supported by the the National Science Foundation [CCF1053918, EF0849899, and IIS0812111], National Institutes of Health [1R21AI085376], and a University of Maryland Institute for Advanced Studies New Frontiers Award. C.K. received support as an Alfred P. Sloan Research Fellow.
References

Dekker J: Capturing chromosome conformation.
Science 2002, 295(5558):13061311. PubMed Abstract  Publisher Full Text

MartiRenom MA, Mirny LA: Bridging the resolution gap in structural modeling of 3D genome organization.
PLoS Comput Biol 2011, 7(7):e1002125. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Baù D: The threedimensional folding of the αglobin gene domain reveals formation of chromatin globules.
Nat Struct & Mol Biol 2010, 18:107114. PubMed Abstract  Publisher Full Text

LiebermanAiden E: Comprehensive mapping of longrange interactions reveals folding principles of the human genome.
Science 2009, 326(5950):289293. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fudenberg G, Getz G, Meyerson M, Mirny L: High order chromatin architecture shapes the landscape of chromosomal alterations in cancer.
Nat Biotechnol 2011, 29:11091113. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Duan Z: A threedimensional model of the yeast genome.
Nature 2010, 465(7296):363367. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Kalhor R: Genome architectures revealed by tethered chromosome conformation capture and populationbased modeling.

Sexton T: Threedimensional folding and functional organization principles of the Drosophila genome.
Cell 2012, 148(3):458472. PubMed Abstract  Publisher Full Text

Tanizawa H: Mapping of longrange associations throughout the fission yeast genome reveals global genome organization linked to transcriptional regulation.
Nuc Acids Res 2010, 38(22):81648177. Publisher Full Text

Umbarger MA: The threedimensional architecture of a bacterial genome and its alteration by genetic perturbation.
Mol Cell 2011, 44(2):252264. PubMed Abstract  Publisher Full Text

Sanyal A, Lajoie BR, Jain G, Dekker J: The longrange interaction landscape of gene promoters.
Nat 2012, 489:109113. Publisher Full Text

McCord R: Correlated alterations in genome organization, histone methylation, and DNA, — lamina interactions in HutchinsonGilford Progeria syndrome.
Genome Res 2012.
in press

Dixon J: Topological domains in mammalian genomes identified by analysis of chromatin interactions.
Nature 2012, 485(7398):376380. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Yaffe E, Tanay A: Probabilistic modeling of HiC contact maps eliminates systematic biases to characterize global chromosomal architecture.
Nat Genet 2011, 43(11):10591065. PubMed Abstract  Publisher Full Text

Saxe J: Embeddabilityof weighted graphs in kspace is strongly NPhard. In 17th Allerton Conference in Communications, Control and Computing. Monticello: CarnegieMellon University, Dept. of Computer Science, 1980; 1979:480489.

Hochbaum DS (Ed): Approximation Algorithms for NPhard Problems. Boston: PWS Publishing Co.; 1997.

Niedermeier R, Rossmanith P: An efficient fixedparameter algorithm for 3Hitting Set.
J Discrete Algorithms 2003, 1:89102. Publisher Full Text

Gomes C, Williams R: Approximation algorithms. In Search Methodologies: Introductory, Tutorials in Optimization and Decision Support Techniques. Edited by Burke EK, Kendall G. New York: Springer; 2005.

Rousseau M: Threedimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling.

Duggal G, Patro R, Sefer E, Wang H, Filippova D, Khuller S, Kingsford C: Resolving spatial inconsistencies in chromosome conformation data. In Proceedings of the 12th international conference on Algorithms in Bioinformatics. WABI’12, Berlin, Heidelberg: SpringerVerlag; 2012:288300.

Theobald DL, Wuttke DS: THESEUS: maximum likelihood superpositioning and analysis of macromolecular structures.
Bioinformatics 2006, 22(17):21712172. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Therizols P: Chromosome arm length and nuclear constraints determine the dynamic relationship of yeast subtelomeres.
Proc Natl Acad Sci USA 2010, 107(5):20252030. PubMed Abstract  Publisher Full Text  PubMed Central Full Text