Abstract
In this paper, we study the problem of constructing perfect phylogenies for threestate characters. Our work builds on two recent results. The first result states that for threestate characters, the local condition of examining all subsets of three characters is sufficient to determine the global property of admitting a perfect phylogeny. The second result applies tools from minimal triangulation theory to the partition intersection graph to determine if a perfect phylogeny exists. Despite the wealth of combinatorial tools and algorithms stemming from the chordal graph and minimal triangulation literature, it is unclear how to use such approaches to efficiently construct a perfect phylogeny for threestate characters when the data admits one. We utilize structural properties of both the partition intersection graph and the original data in order to achieve a competitive time bound.
Keywords:
Perfect phylogeny; Chordal graph; Minimal triangulation; Minimal separatorBackground
In this paper, we study the problem of constructing phylogenies, or evolutionary trees, to describe ancestral relationships between a set of observed taxa. Each taxon is represented by a sequence and the evolutionary tree provides an explanation of branching patterns of mutation events transforming one sequence into another.
We will focus on the widely studied infinite sites model from population genetics, in which the mutation of any character can occur at most once in the phylogeny. Without recombination, the phylogeny is a tree called a perfect phylogeny. The problem of determining if a set of binary sequences fits the infinite sites model without recombination corresponds to determining if the data can be derived on a perfect phylogeny. A generalization of the infinite sites model is the infinite alleles model, in which any character can mutate multiple times but each mutation of the character must lead to a distinct allele (state). Again, without recombination, the phylogeny is tree, called a multistate perfect phylogeny. Correspondingly, the problem of determining if multistate data fits the infinitealleles model without recombination corresponds to determining if the data can be derived on a multistate perfect phylogeny.
Dress and Steel [1] and Kannan and Warnow [2] both give algorithms that construct perfect phylogenies for threestate characters when one exists. The goal of this work is to extend the results in [3] using the minimal separators of the partition intersection graph to create a three state construction algorithm that is competitive with Dress and Steel’s algorithm.
Notation and prior results
The input to our problem is a set of n taxa defined over a set of mcharacters. We denote the states of character χ^{i} by for 0 ≤ j ≤ r − 1. A species is any sequence s = s_{1}, s_{2}, …, s_{m }with for i = 1, 2, …, m. The ∗ denotes a missing value. χ^{i} can also be considered as a function mapping species to character states, writing χ^{i}(s) = s_{i}. In this paper, every taxon is a species without missing values ( is also called a set of full characters in the literature). We will consider the set of taxa as an n × m matrix M, where each row corresponds to a taxon and each column corresponds to a character (or site).
The perfect phylogeny problem is to determine whether the taxa defined by a matrix M can be displayed on a tree T such that
1. each taxon of M labels exactly one node in T,
2. each leaf in T is labeled by a taxon of M,
3. each node of T is labeled by a species,
4. for every character χ^{i} and for every state of character χ^{i}, the set of all nodes in T labeled by species whose state of character χ^{i} is forms a connected subtree of T.
Any tree satisfying conditions 1  4 is called a perfect phylogeny for M. Any character satisfying condition 4 is said to be compatible with T. The general perfect phylogeny problem (with no constraints on r, n, and m) is NPcomplete [4,5]. However, the perfect phylogeny problem becomes polynomially solvable (in n and m) when r is fixed. For r = 2, this follows from the Splits Equivalence Theorem [6,7]. For r = 3, Dress and Steel gave an O(nm^{2}) algorithm [1] and for r = 3 or 4, Kannan and Warnow gave an O(n^{2}m) algorithm [2]. For any fixed r, Agarwala and FernándezBaca gave an O(2^{3r}(nm^{3} + m^{4})) algorithm [8], which was improved to O(2^{2r}nm^{2}) by Kannan and Warnow [9].
Definition 2.1
[7,10]For a set of input taxa M, the partition intersection graph G(M) is obtained by associating a vertex for each character state and an edge between two verticesandif there exists a taxon s withand.
Note that by definition, there are no edges in the partition intersection graph between states of the same character. It will be useful to consider the partition intersection graph G(χ^{i}, χ^{j}, χ^{k}) of the submatrix of M defined by the three characters χ^{i}, χ^{j}, χ^{k}.
Definition 2.2
A graph H is chordal, or triangulated, if there are no induced chordless cycles of length four or greater in H.
See [11] and [12] for further details on chordal graphs.
Consider coloring the vertices of the partition intersection graph G(M) by colors 1, 2, …, m as follows. For each character χ^{i}, assign color i to the vertices . A pair of distinct vertices u,v of G(M) with the same color is called a monochromatic pair. A proper triangulation of the partition intersection graph G(M) is a chordal supergraph of G(M) such that every edge has endpoints with different colors. In [10], Buneman established the following fundamental connection between the perfect phylogeny problem and triangulations of the corresponding partition intersection graph.
Theorem 2.3
[7,10]A set of taxa M admits a perfect phylogeny if and only if the corresponding partition intersection graph G(M) has a proper triangulation.
A triangulation of a graph G is minimal if it does not have a proper subgraph that is also a triangulation of G. Theorem 2.3 can be restated in terms of proper minimal triangulations of G(M) because removing edges from a proper triangulation will preserve the coloring of the graph. If G(M) has a proper triangulation H, then a perfect phylogeny for M can be constructed from a clique tree of H. is a clique tree for a graph G if
1. the nodes of are in bijection with the maximal cliques of G,
2. for each vertex v of G, the maximal cliques containing v form a connected subtree of .
That is, given a clique tree for a proper triangulation H of G(M), we label each node by its corresponding maximal clique. Because H is properly colored, this maximal clique includes at most one state per character and therefore defines a species. Each taxon t defines a clique K_{t} of size m in G(M), and because H is a triangulation of G(M), K_{t} is a clique in H as well. Furthermore, H is a proper triangulation, so K_{t} is a maximal clique of H. For a clique tree , we label the node corresponding to K_{t} by t to obtain a perfect phylogeny for M. Conversely, if M has a perfect phylogeny T, then the species in T define a set of additional edges to obtain a proper triangulation for G(M). This is due to the following characterization of chordal graphs by the intersections of subtrees of a tree.
Theorem 2.4
[10,13]G is a chordal graph if and only if there is a tree T such that each vertex u of G induces a subtree T_{u} of T and uv is an edge of G if and only if subtrees T_{u} and T_{v} share at least one node.
In particular, if a pair of character states appear in the same species of a perfect phylogeny for M but not in any input taxon of M, this pair defines a fill edge to add to obtain a proper triangulation of the partition intersection graph. This fill edge preserves the proper coloring because intersecting subtrees from the same character would contradict conditions 3 and 4 of the perfect phylogeny definition.
To illustrate some of these notions, consider the example in Figure 1. The species with sequence 2100 defines a fill edge which is not an edge of G(M) (this is the only such fill edge). Nevertheless G(M) itself is chordal, and adding this fill edge would result in a proper triangulation that is not minimal.
Figure 1. Partition intersection graphs and perfect phylogenies. A 3state matrix M, partition intersection graph G(M), and perfect phylogeny T. There are no species with missing values in T.
In recent work, it is shown that there is a complete description of minimal obstruction sets for threestate characters analogous to a wellknown result on obstruction sets for binary characters (the four gamete condition) [3]. These results allow us to expand upon recent work of Gusfield [14] which uses properties of triangulations and minimal separators of partition intersection graphs to solve several problems related to multistate perfect phylogenies.
An (a,b)separator of a graph G is a set of vertices whose removal from G separates a and b. A minimal (a,b)separator is an (a,b)separator such that no proper subset is an (a,b)separator, and a minimal separator is a separator that is a minimal (a,b)separator for some pair of vertices a and b. For a set of vertices X, let GX be the induced subgraph of G after removing vertices X. If S and S^{′} are two minimal separators of G, we say S is parallel to S^{′} if there is a single connected component C of G − S^{′} such that S ⊆ C ∪ S^{′} (otherwise S and S^{′ }cross). A pair of vertices a and b cross S if S is an (a,b)separator. The neighborhood of a set of vertices X is N(X) = {v ∈ G − X : (u, v) ∈ E(G) for some u ∈ X}. A component C of GS is full if the neighborhood N(C) is equal to S. The following characterization of minimal separators is critical to our arguments.
Lemma 2.5
[15]Let S be a subset of vertices of graph G. Then S is a minimal separator of G if and only if GS has two or more full components.
In a colored graph, a legal separator is a separator such that no two vertices have the same color. Let Δ_{G} denote the minimal separators of graph G. For S ∈ Δ_{G}, we saturateS by adding edges between every pair of vertices in S to create a clique. For Q ⊆ Δ_{G}, G_{Q} denotes the graph obtained by saturating every S ∈ Q. The following theorem shows the connection between minimal triangulations and collections of parallel minimal separators of a graph.
Theorem 2.6
(Minimal Triangulation Theorem [1618]). Suppose Q ⊆ Δ_{G} is a maximal set of pairwise parallel minimal separators of G. Then G_{Q} is a minimal triangulation of G and . Conversely, if H is a minimal triangulation of G, then Δ_{H} is a maximal pairwise parallel set of minimal separators of G.
The following are necessary and sufficient conditions for the existence of a perfect phylogeny for data over arbitrary number of states. We refer the reader to [14] for the proofs.
Theorem 2.7
(Theorem 2 (MSP) [14]). For input M over r states (r ≥ 2), there is a perfect phylogeny for M if and only if there is a set Q of pairwise parallel legal minimal separators in G(M) such that every illegal minimal separator in G(M) is crossed by at least one separator in Q.
Theorem 2.8
(Theorem 3 (MSPN) [14]). For input M over r states (r ≥ 2), there is a perfect phylogeny for M if and only if there is a set Q of pairwise parallel legal minimal separators in partition intersection graph G(M) such that every monochromatic pair of nodes in G(M) is separated by some separator in Q.
For the special case of input M with characters over three states (r = 3), the partition intersection graph satisfies additional structure and the following theorems give necessary and sufficient conditions for the existence of a perfect phylogeny for M[3].
Theorem 2.9
[3]Given an input set M with at most three states per character (r ≤ 3), M admits a perfect phylogeny if and only if every subset of three characters of M admits a perfect phylogeny.
Furthermore, there is an explicit description of all minimal obstruction sets to the existence of a perfect phylogeny.
Theorem 2.10
[3]For input M over 3state characters, there exists a perfect phylogeny for M if and only if both of the following conditions hold:
1. for every pair of columns of M, the partition intersection graph induced by the columns is acyclic and
2. for every triple of columns of M, the partition intersection graphs induced by the columns does not contain any of the graphs shown in Figure 2 up to relabeling of the character states.
Figure 2. Minimal obstruction sets. Minimal obstruction sets for threestate characters up to relabeling. The boxes highlight the input entries that are identical for three of the obstruction sets.
This complete characterization of minimal obstruction sets allows us to simplify Theorem 2.8 in the case r = 3.
Theorem 2.11
[3]For input M on at most three states per character (r ≤ 3), there is a threestate perfect phylogeny for M if and only if the partition intersection graph for every pair of characters is acyclic and every monochromatic pair of vertices in G(M) is separated by a legal minimal separator.
Theorem 2.11 shows that the requirement of Theorem MSPN that the legal minimal separators in Q be pairwise parallel can be removed for the case of input data over threestate characters. The condition in Theorem 2.11 that the input is over three state characters is necessary, as there are examples showing that the theorem does not extend to data with fourstate characters.
All of the legal minimal separators for threestate input can be found in O(nm^{2}) time and the algorithm to check if each monochromatic pair is separated by a legal minimal separator can be performed during the algorithm for generating the legal minimal separators (see Section “Proper triangulation algorithm”). Therefore, the 3state perfect phylogeny decision problem can be solved in O(nm^{2}) time using minimal separators. However, it is not clear how minimal separators can be used to solve the construction problem in a similar time bound. In [14], Gusfield used the minimal separator approach and integer linear programming methods to solve both the decision and construction problem for kstate perfect phylogeny. Since integer linear programming methods in general do not have polynomial time bounds, this naturally leads to the following question: is there an O(nm^{2}) algorithm for the construction problem for 3state perfect phylogeny using the separator approach? In this paper, we answer in the affirmative, and show that any algorithm which explicitly computes the partition intersection graph has a time bound of at least O(nm + m^{2}).
We first study the structure of separators in the partition intersection graph for 3state input with the goal of answering this question. We first state two lemmas from [3].
Lemma 2.12
(Lemma 3.4 [3]). Let M be a set of input taxa with at most three states per character, and consider any three characters χ^{i}, χ^{j}, χ^{k}in M. If the partition intersection graph G(χ^{i}, χ^{j}, χ^{k}) is properly triangulatable, then the only possible chordless cycles in G(χ^{i}, χ^{j}, χ^{k}) are chordless 4cycles, with two colors appearing once and the remaining color appearing twice.
Lemma 2.12 implies that if a subset of three characters χ^{i}, χ^{j}, χ^{k} in M is properly triangulatable, then there is a unique set of edges F(χ^{i}, χ^{j}, χ^{k}) that must be added to triangulate the chordless cycles in G(χ^{i}, χ^{j}, χ^{k}). Construct a new graph G^{′}(M) on the same vertices as G(M) with edge set . G^{′}(M) is the partition intersection graph G(M) together with additional edges to properly triangulate all chordless cycles in each G(χ^{i}, χ^{j}, χ^{k}) for 1 ≤ i < j < k ≤ m (note these are the chordless 4cycles of G(M) on three colors). In G^{′}(M), edges from the partition intersection graph G(M) are called Eedges and edges that have been added as triangulation edges for some triple of columns are called Fedges.
Lemma 2.13
(Lemmas 4.2, 4.3, 4.7 [3]) Let M be a set of input taxa with at most three states per character, and suppose G(M) is properly triangulatable. Then G^{′}(M) cannot contain a chordless cycle with one or more Fedges. If C is a chordless cycle in G^{′}(M) with only Eedges, then C has length exactly four with four distinct colors.
Structure of separators
In this section, our goal is to study the relationship between minimal separators in G(M) and G^{′}(M) when M is a set of taxa over 3state characters. Our ultimate goal is to show that it suffices to consider only the legal minimal separators of G(M) while disregarding the illegal minimal separators. We first prove the following theorem on the separator structure of G^{′}(M).
Theorem 3.1
Let M be a set of taxa over 3state characters. M allows a perfect phylogeny if and only if G^{′}(M) (the partition intersection graph G(M) together with Fedges) does not contain any illegal minimal separators.
Proof
Suppose M allows a perfect phylogeny and suppose there is an illegal minimal separator S in G^{′}(M) with a monochromatic pair of vertices u and v. By Lemma 2.5, there exist two full components C, D of G  S, and by definition of a full component, there are paths connecting u and v in both C∪{u,v} and D∪{u,v}. Consider the shortest such paths P_{C} and P_{D} respectively (note that there are no chords within P_{C} and no chords within P_{D}). Since C and D are components separated by S, there are no edges between C and D. Also, u and v are not adjacent in G^{′}(M) since u and v have the same color and G^{′}(M) contains no illegal edges. This implies the union of P_{C} and P_{D} creates a chordless cycle. By Lemma 2.13, G^{′}(M) cannot contain any chordless cycles of length five or greater or chordless cycles with Fedges, so the union of the paths P_{C} and P_{D} must be a four cycle C and in particular, must be a cycle u → x → v → x^{′} → u, where u and v have the same color. C is a chordless four cycle in G(M) on at most three colors, which cannot occur since we have triangulated all such cycles by Fedges. This contradiction implies S cannot be an illegal minimal separator.
Now, suppose G^{′}(M) does not contain any illegal minimal separators. By Theorem 2.7, graph G^{′}(M) has a proper triangulation and since G(M) is a subgraph of G^{′}(M), G(M) also has a proper triangulation. It follows that M has a perfect phylogeny. □
This suggests that analyzing the minimal separators of G^{′}(M) suffices for 3state construction. However, the algorithm for enumerating the minimal separators of G(M) necessary for proper triangulations in O(nm^{2}) time uses M (rather than G(M)), and it is not clear if it is possible to extend this approach to enumerate the necessary minimal separators of G^{′}(M). In order to use techniques in [14], the the goal of our next two results will be to describe the relationship between the minimal separators of G^{′}(M) and the legal minimal separators of G(M) when M has a perfect phylogeny.
Lemma 3.2
Let M be a set of taxa over 3state characters allowing a perfect phylogeny. Then H is a proper minimal triangulation of G(M) if and only if H is a minimal triangulation of G^{′}(M).
Proof
Suppose H is a proper minimal triangulation of G(M). Each Fedge of G^{′}(M) comes from a chordless cycle of length four on three colors (see Lemma 2.12), so this edge must appear in any proper triangulation of G(M). Hence the Fedges must be edges of H, so G^{′}(M) ⊆ H and H is a proper triangulation of G^{′}(M). If H is not minimal with respect to G^{′}(M), there exists H^{′} such that G^{′}(M) ⊆ H^{′} ⊂ H and thus G(M) ⊆ H^{′} ⊂ H, contradicting the minimality of H with respect to G(M). Thus H is a minimal triangulation of G^{′}(M).
Conversely, suppose M allows a perfect phylogeny and H is a minimal triangulation of G^{′}(M). By Theorem 2.6, H = G^{′}(M)_{Q} for a set Q of maximal pairwise parallel minimal separators of G^{′}(M), and these minimal separators must be legal by Theorem 3.1. Every edge in H not in G(M) is either an Fedge of G^{′}(M) or a fill edge defined by Q, and in both cases it must be a legal fill edge. Therefore H is a proper triangulation of G(M). If there is some proper triangulation H^{′} of G(M) where G(M) ⊆ H^{′} ⊆ H then the Fedges of G^{′}(M) must be edges of H^{′}, otherwise H^{′} has a chordless four cycle. Thus H^{′} is a proper triangulation of G^{′}(M), and because H is a proper minimal triangulation of G^{′}(M) it must be that H^{′} = H. Therefore H is also a proper minimal triangulation of G(M). □
Let denote the set of legal minimal separators of G(M).
Theorem 3.3
Suppose M is a set of taxa on 3state characters that allows a perfect phylogeny. Then the legal minimal separators of G(M) are exactly the minimal separators of G^{′}(M) (i.e.,).
Proof
Assume M has a perfect phylogeny. Consider a minimal separator S of G^{′}(M), and suppose Q is a set of maximal pairwise parallel minimal separators of G^{′}(M) with S ∈ Q. Let H = G^{′}(M)_{Q}. H is a minimal triangulation of G^{′}(M) by Theorem 2.6, and H is a proper minimal triangulation of G(M) by Lemma 3.2. By Theorem 2.6, Q is precisely the set of minimal separators of H. Furthermore, because H is also a minimal triangulation of G(M), the same theorem states that Q is a subset of the minimal separators of G(M). Therefore S ∈ Δ_{G(M)}, so . Each minimal separator of G^{′}(M) is legal by Theorem 3.1. Hence .
Conversely, let . First we show that if no Fedge f of G^{′}(M) crosses S (i.e. f = xy where S separates x and y), then S is a minimal separator of G^{′}(M). Let C be a connected component of G(M) − S. C is still connected in G^{′}(M), and because no Fedge of G^{′}(M) crosses S, . Hence C is a connected component of G^{′}(M) − S. Further, we have only added edges to obtain G^{′}(M), so . Therefore if C is a full component of G(M) − S we have , and it is also a full component of G^{′}(M)−S. By Lemma 2.5, S is a minimal separator of G^{′}(M).
Now consider a minimal separator S^{′} of G(M). If an Fedge f = xy crosses S^{′}, there is a four cycle x → u → y → v → x in G(M) with monochromatic pair u,v, and further, u,v ∈ S^{′}. Hence S^{′} is illegal, and any legal minimal separator of G(M) is not crossed by any Fedge. From our previous argument, this implies . Therefore . □
The second half of the proof of Theorem 3.3 proves the following.
Corollary 3.4
Suppose M is a set of taxa on 3state characters that allows a perfect phylogeny. Ifthen C is a connected component of G(M) − S if and only if C is a connected component of G^{′}(M) − S.
We now prove the main result of this section.
Theorem 3.5
Suppose M is a set of taxa on 3state characters. Then M has a perfect phylogeny if and only if any maximal pairwise parallel set of legal minimal separators Q of G(M) induces a proper minimal triangulation G(M)_{Q }of G(M).
Proof
First, suppose that M has a perfect phylogeny, and let Q be a maximal pairwise parallel set of legal minimal separators of G(M). We show that G(M)_{Q} is a proper triangulation of G(M). By Theorem 3.3, Q is a maximal set of minimal separators of G^{′}(M), and they are pairwise parallel because the connected components of each minimal separator in Q are the same in G(M) and G^{′}(M) (Corollary 3.4). Hence H = G^{′}(M)_{Q} is a minimal triangulation of G^{′}(M) with minimal separator set Q (Theorem 2.6), and by Lemma 3.2, H is a proper minimal triangulation of G(M). Because Δ_{H} = Q, Theorem 2.6 implies Q is a maximal pairwise parallel set of minimal separators of G(M) and therefore H = G(M)_{Q}. Thus H = G(M)_{Q} is a proper minimal triangulation of G(M).
For the converse, pick any maximal pairwise parallel set of legal minimal separators Q of G(M) that induces a proper minimal triangulation G(M)_{Q} of G(M). Then M has a perfect phylogeny by Theorem 2.3. □
Proper triangulation algorithm
In this section, we build on techniques developed in [14] to generate the minimal separators of G^{′}(M) and their parallel relations in O(nm^{2}) time. This will allow us to use a greedy approach to pick a maximal pairwise parallel set of legal minimal separators. These minimal separators will then define a set of fill edges for a proper minimal triangulation, and a perfect phylogeny will be constructed in the form of a clique tree using Maximum Cardinality Search (MCS).
Lemma 4.1
[14]Let Q be a set of maximal pairwise parallel legal minimal separators of a partition intersection graph G(M). Then for each S ∈ Q, S < m.
Define . We first state our algorithm and then analyze the running time of each step.
Algorithm: proper triangulation for 3state characters
1. Stop if there is a pair of characters whose partition intersection graph contains a cycle.
2. Compute using proper clusters.
3. Stop if there is a monochromatic pair not separated by any legal minimal separator.
4. Compute the crossing relations for .
5. Greedily construct a maximal pairwise parallel subset Q of ; stop if Q has more than 2n − 3 minimal separators.
6. Add edges to G(M) to make each S ∈ Q a clique. Call this graph G_{Q}.
7. Use MCS to construct a clique tree for G_{Q}.
We proceed with a series of lemmas that will be used in Theorem 4.11 to show that each step is O(nm^{2}). The following simple observation is important for many of our time bounds.
Observation 4.2
Let M be a set of taxa whose characters have at most three states. Then G(M) has O(m) vertices (one vertex per state of each character) and O(m^{2}) edges.
Step 2 of the algorithm uses concepts from [2,8,9,14], which we detail here for completeness. A proper cluster is a bipartition of the taxa (i.e. the taxa are split into two disjoint nonempty sets) such that each character shares at most one state across the bipartition, and at least one character is not shared across this bipartition [8,9]. There are O(m) proper clusters when r is fixed. In particular, suppose χ is not shared across the bipartition of a proper cluster. Then the proper cluster also creates a bipartition of χ’s character states (see Figure 3). Hence, we can compute the set of proper clusters by exhaustively checking, for each character, if some bipartition of its states split the taxa into a proper cluster (there are O(2^{r}) ways to split each character).
Figure 3. Minimal separators and proper clusters. In this figure, the bipartition abcdef gives rise to the proper clusters ab and cdef. The shared character states form a legal minimal separator S in G(M). G(M) − S has three connected components, of which two are full (components C_{1} and C_{2}). The Spartition gives rise to the bipartition because t(C_{1}) = {a, b} and t(C_{2}) ∪ t(C_{3}) = {c, d, e, f}. is a clique tree for G(M) (in this case, G(M) happens to be chordal). T is obtained from by resolving the nodes labeled b, c, f. Note that S is represented in on edge bc because . For a clique tree of a chordal graph, every minimal separator of the chordal graph behaves this way [11,21]. In this sense, legal minimal separators are analagous to splitting vectors.
Proper clusters generate the minimal separators in as follows [14]. For a connected component C of G(M) − S, let t(C) be the set of taxa with characterstate for at least one . We will refer to the set of t(C) determined by the connected components of G(M) − S as the Spartition of the taxa. Recall S has at most m − 1 vertices by Lemma 4.1, so every taxon must have a characterstate that is not a vertex of S. Hence no taxon can have all of its characterstates as vertices of S. Additionally, each taxon defines a clique, so it cannot have vertices in more than one connected component of G(M) − S (this would define an edge between connected components). By Lemma 2.5, G(M) − S has two or more full components C_{1} and C_{2}. Place t(C_{1}) and t(C_{2}) in separate parts of the bipartition, then for the remaining connected components C of G(M) − S add t(C) to either part. This defines a bipartition where the shared character states (known as the splitting vector [9]) are exactly the vertices of S. To see this, suppose a characterstate is a vertex of S. Because C_{1} is a full component, there is a vertex adjacent to . Because these vertices are adjacent, and appear in the same row of M, which in turn is a taxon t_{1} of t(C_{1}). Similarly, there exists t_{2} ∈ t(C_{2}) such that χ_{i}(t_{2}) = j, so is shared in the bipartition. See Figure 3 for an illustration of these concepts. This implies that . The following two lemmas are special cases of those found in [14].
Lemma 4.3
[14]For any set of taxa M on 3state characters, can be computed in O(nm^{2}) time. Further, .
Proof
Our previous discussion proves that has at most O(m) minimal separators, so we focus on the running time. Let g be a proper cluster with splitting vector x and let S_{x} be the vertices of G(M) appearing as characterstates in x. Define the equivalence relation g/x by the transitive closure of the relation tRt^{′} if and only if there is a character χ^{i} where χ^{i}(t) = χ^{i}(t^{′}) = j and is not a shared character state in x; calculating g/x takes O(nm) time [9]. Given an equivalence class t of g / x, the vertices are a connected component of G(M) − S_{x}, and every connected component can be described in this way. For a connected component C of G(M) − S_{x}, the size of its neighborhood can be calculated using the t(C) rows of M (i.e. for t ∈ t(C), count the character states of t also in x, being careful not to overcount). if and only if there are distinct equivalence classes t and t^{′} that share all character states in x. For each equivalence class, we examine each taxon once, so this requires a single pass through every row of M and can be done in O(nm) time per proper cluster, so step 2 takes O(nm^{2}) time. □
In the proof of Lemma 4.3, we showed how to compute the Spartition of the taxa for in O(nm) time. It is now easy to calculate the connected components of G(M) − S: if t(C) is part of the Spartition, then C is obtained by listing the characterstates that appear in at least one t ∈ t(C) but not in S. This proves the following.
Lemma 4.4
[14]Let M be a set of 3state taxa and. There is an O(nm) algorithm that calculates the connected components of G(M) − S and determines which of these connected components is full.
Before discussing the running time required to compute crossing relations, we first state two structural lemmas on minimal separators; the second follows from a lemma in [19].
Lemma 4.5
[18]Let S and S^{′ }be nonparallel minimal separators. Then for each full component C of G − S^{′}, S has a vertex in C.
Lemma 4.6
(Lemma 3.10, [19]). Let S and S^{′ }be two minimal separators of a graph G. Then S and S^{′ }are parallel if and only if there exists a full component C_{S }of G − S and a connected componentof G − S^{′ }such that.
Because of the slight change from Lemma 3.10 in [19] and for completeness, we give a proof of Lemma 4.6.
Proof
Suppose S and S^{′} are parallel. Since S is a minimal separator, there are at least two full components in G − S and because S^{′} is parallel to S, there is a full component C_{1} of G − S that does not intersect S^{′}. C_{1} is connected in G − S^{′}, so there is a connected component C of G − S^{′} containing C_{1}.
Now, suppose there are C_{S} and satisfying the conditions of the lemma. Then , implying that S and S^{′}are parallel. □
Lemma 4.7
There is an O(nm^{2}) algorithm to calculate the crossing relations of.
Proof
Let . We begin by showing that S and S^{′} are parallel if and only if there is a full component C of G(M) − S and connected component C^{′} of G(M) − S^{′} such that t(C) ⊆ t(C^{′}) (i.e. t(C) is contained in a single part of the S^{′}−partition). Suppose S and S^{′} are parallel. From Lemma 4.6, there are connected components C of G(M) − S and C^{′} of G(M) − S^{′} such that C ⊆ C^{′} and consequently t(C) ⊆ t(C^{′}).
Conversely, assume that S and S^{′}are not parallel. Let C_{1} be a full component of G(M) − S and C_{2} be a full component of G(M) − S^{′}. By Lemma 4.5, there is a vertex , and because C_{2} is full, there is a u ∈ C_{2} ∩ N(v). The taxa form an edge clique cover for G(M), so there is a taxon t having both character states corresponding to u and v. Note v ∈ C_{1} so t ∈ t(C_{1}) and u ∈ C_{2} so t ∈ t(C_{2}). S^{′} has at least two full components, and repeating this argument yields another full component of G(M) − S^{′} such that . Thus t(C_{1}) shares at least one taxon with at least two parts of the S^{′}−partition, so t(C_{1}) is not contained within any single part of the S^{′}−partition. This proves our characterization of parallel minimal separators of .
It suffices to check for each full component C of G(M) − S and connected component C^{′} of G(M) − S^{′} if t(C) ⊆ t(C^{′}). There are O(m^{2}) pairs of legal minimal separators, and this check takes O(n) time (O(nm^{2}) time overall) when the Spartition has been calculated for each . □
It is critical for our time bound that any proper minimal triangulation of G(M) have O(n) minimal separators because this impacts the computation of edges contained in the proper minimal triangulation. A result bounding the number of minimal separators in an earlier version of this paper (Lemma 7 in [20]) was incorrect, as demonstrated in Figure 4. We present a corrected bound for the number of minimal separators in the following Lemma.
Figure 4. Bounding minimal separators of proper minimal triangulations. An example of the bound in Lemma 4.8. H has five minimal separators, obtained from every pair of vertices from the set except . There are four taxa, so Lemma 4.8 gives an upper bound of five minimal separators. Therefore, the bound in Lemma 4.8 is tight for n = 4 taxa.
Lemma 4.8.
Suppose that H is a proper minimal triangulation of G(M). Then H has at most 2n − 3 minimal separators.
Proof
Let be a clique tree of H. Recall that the nodes of are in bijection with the maximal cliques of H. To make this correspondence explicit, for each node x of we will write K_{x} to mean the maximal clique of H that corresponds to x. A classic result in chordal graph theory says that if S ∈ Δ_{H}, there is an edge xy of such that S = K_{x} ∩ K_{y}[11,21]. Therefore the number of minimal separators in H is at most the number of edges of .
First, consider any leaf a of . We claim that K_{a} contains a vertex of G that is not in any other maximal clique of G (this fact is well known in the chordal graph literature [22], but we prove it here for completeness). Suppose a^{′}is the neighbor of a in . By maximality, so there is a vertex v of H that is contained in K_{a} but not contained in . If v is contained in a maximal clique of G that is not K_{a}, then the second property of clique trees implies that as well. Hence v is only contained in K_{a}, proving the claim. Further, v is some characterstate , and there is a taxon t of M such that χ^{i}(t) = j. Taxon t can only label a because no other node of corresponds to a maximal clique that contains . Thus for each leaf of there is a unique taxon that labels it.
To complete the proof, we show a similar result for internal nodes of with degree two. Let z be such a node with neighbors z_{1} and z_{2}. If z contains a vertex that is only contained in z^{′}s maximal clique K_{z}, our previous argument shows that z can be labeled by a unique taxon. Suppose this is not the case. Let for i = 1, 2. It must be that K_{z} = S_{1} ∪ S_{2} because we are considering the case when K_{z} does not contain a unique vertex. Further, we cannot have S_{1} ⊆ S_{2} since otherwise would not be maximal. Similarly, S_{2} ⫅ S_{1}. Pick u_{1} ∈ S_{1} − S_{2} and u_{2} ∈ S_{2} − S_{1}, noting that and . We argue that K_{z} is the only maximal clique containing both u_{1} and u_{2}. This is because if any other maximal clique K contains both vertices, then either or is on the path from K_{z} to K in (K has degree two) and by the second property of clique trees, this maximal clique also contains both vertices. Further, because each S ∈ Δ_{H} is of the form S = K_{x} ∩ K_{y} for an edge xy of , there is no minimal separator of H containing both u_{1} and u_{2}. By Theorem 2.6, is an edge of G(M) (i.e. it is not a fill edge) because H is a minimal triangulation of G(M), so all fill edges come from saturating each S ∈ Δ_{H}. Therefore there is a taxon t^{′} of M such that and . As in the unique vertex case, z is the unique node with label t^{′}.
Therefore any node of with degree at most two is labeled by a unique taxon, implying there are at most n such nodes. Any tree containing at most n leaves and internal nodes of degree two has at most 2n − 3 edges. Hence has at most 2n − 3 edges, and in turn H has at most 2n − 3 minimal separators, proving the bound. □
Remark
The proof of Lemma 4.8 requires minimality of the triangulation, but it does not require that M lacks missing values or that the number of states for each character is bounded.
This Lemma along with the fact that each has fewer than m vertices gives the following result.
Lemma 4.9
Suppose that H is a proper minimal triangulation of G(M) obtained by saturating a maximal pairwise parallel legal set of minimal separators Q. Then H has O(n) minimal separators, O(m) vertices, and O(m^{2}) edges. Furthermore, H can be calculated in O(nm^{2}) time.
Proof
The minimal separator bound follows from Lemma 4.8, and the vertex and edges bounds follow from Observation 4.2 and the fact that H and G(M) have the same vertex set. In order to calculate H, we must calculate the fill edge set E(H) − E(G(M)). Recall that, by Theorem 2.6, the fill edges of H are obtained by saturating each minimal separator in Q. Each S ∈ Q has fewer than m vertices by Lemma 4.1 and Q = O(n) by Lemma 4.8. It is straightforward to check for each S ∈ Q and each pair u, v ∈ S if uv defines a fill edge with an amortized running time of O(nm^{2}). □
In [23], Tarjan and Yannakakis developed Maximum Cardinality Search (MCS), which recognizes chordal graphs in linear time. Blair and Peyton [11] showed how MCS can be used to construct a clique tree for a chordal graph while retaining the linear time bound.
Lemma 4.10
[11]Let G be a chordal graph. Then Maximum Cardinality Search (MCS) can be implemented to produce a clique treeof G with running time O(V(G) + E(G)).
Combining these lemmas show that our minimal separator algorithm for constructing perfect phylogenies for r = 3 is competitive with the algorithm of Dress and Steel [1], giving our main result.
Theorem 4.11
The algorithm Proper Triangulation for 3State Characters runs in O(nm^{2}) time.
Proof
The first step can be implemented in O(m^{2}) time as follows. Each pair of characters has a partition intersection graph with at most six vertices, and it is straightforward to check for cycles. There are O(m^{2}) such pairs of characters. Lemma 4.3 states that step two takes O(nm^{2}) time. For the third and fourth step, we first compute the connected components of G(M) − S for each . Lemmas 4.3 and 4.4 tell us there are O(m) computations that require O(nm) time, so computing all the sets of connected components takes O(nm^{2}) time. There are O(m) monochromatic pairs (three pairs per character), and for each monochromatic pair we check the connected components of each and ensure at least one of these minimal separators is a separator. Hence step three takes O(m^{2}) time. Lemma 4.7 shows that step four has a running time of O(nm^{2}). Step five runs in O(nm) time due to the bounds in Lemmas 4.3 and 4.8. That is, after picking a minimal separator S to be in Q, there are O(m) minimal separators that can cross S and we repeat this process O(n) times to construct Q. Constructing G_{Q} was shown to take O(nm^{2}) time in Lemma 4.9. Lemma 4.9 shows that O(V(G_{Q}) + E(G_{Q})) = O(m^{2}) so using MCS in step 7 takes O(m^{2}) time. Hence each step and the preprocessing for each step takes at most O(nm^{2}) time, so the algorithm takes at most O(nm^{2}) time. □
Large partition intersection graphs
Ideally, one would like to find an O(n^{2}m) or O(nm) algorithm for 3state perfect phylogeny (i.e., m is squarefree). In this section, we will construct a family of 3state matrices M that have a perfect phylogeny and Λ(m^{2}) edges in G(M). This discourages attempts to improve our time bound using an approach that explicitly computes the partition intersection graph.
Any 3state character compatible with a perfect phylogeny can be obtained from choosing any two edges of the phylogeny, removing them, and using the three resulting subtrees to define each taxon’s state for that character. 2state characters are obtained in a similar manner, removing a single edge instead of two edges. Therefore, if a 3state matrix M with distinct columns (up to relabeling) has a perfect phylogeny, .
Consider the tree T with taxa t_{1}, t_{2}, …, t_{n} as depicted in Figure 5, and suppose i < j. We construct the character χ^{(i,j)} using the partition {t_{1}, t_{2}, …, t_{i}}, {t_{i + 1}, t_{i + 2}, …, t_{j}}, {t_{j + 1}, t_{j + 2}, …, t_{n}} as in Figure 5. Each set in the partition is called the cell 0, cell 1, and cell 2 of χ^{(i,j)}, respectively. That is, χ^{(i,j)}(t_{1}) = 0, χ^{(i,j)}(t_{i + 1}) = 1, χ^{(i,j)}(t_{j + 1}) = 2, and so on. Let M^{∗} be the matrix whose columns are the characters χ^{(i,j)} for 1 ≤ i < j < n. T is clearly a perfect phylogeny for M^{∗}, and . Next, we show that G(M^{∗}) has Λ(m^{2}) edges.
Figure 5. Characters of a perfect phylogeny with a large partition intersection graph. A 3state character created using “intervals” of taxa from a fully resolved tree T. The 0^{th} piece of χ^{(i,j)} is the interval .
Observation 5.1
Let χ^{(i,j) }andbe distinct characters of M^{∗}. Thenis an edge of G(M^{∗}) iff cell k of χ^{(i,j) }and the cell k^{′}ofhave a nonempty intersection (i.e. share a taxon).
For example, the cell 1 of χ^{(3,5)} and cell 1 χ^{(4,6)} share taxon t_{5} so is an edge in G(M^{∗}). In contrast, cell 0 of χ^{(3,5)} and cell 1 of χ^{(4,6)} do not share any taxa, so is not an edge in G(M^{∗}). Consider the characters χ^{(i,j)} and for distinct i, i^{′}, j, and j^{′}. There are at least pairs of these characters, and each such pair provides at least one edge to G(M^{∗}) because both cell 0 of χ^{(i,j)} and cell 0 of share t_{1}. Therefore G(M^{∗}) has at least o(n^{4}) = o(m^{2}) edges. There are at most edges in any partition intersection graph, so G(M^{∗}) has Λ(m^{2}) edges, and reading each entry of M to compute G(M) requires at least nm time. Hence any construction algorithm that explicitly computes the partition intersection graph requires at least O(nm + m^{2}) time.
Conclusions
We have demonstrated how to use the minimal separator approach introduced in [14] to construct a perfect phylogeny for 3state data in O(nm^{2}) time. We also constructed a 3state matrix M with a perfect phylogeny that has Λ(m^{2}) edges. Thus, any explicit analysis of the edges of G(M) or of a proper triangulation of G(M) is inadequate to speed up our approach. Faster proper triangulation algorithms should use M for computation instead of G(M) aided with theoretical results about G(M). Constructing tree representations in order to minimally triangulate a graph without explicitly computing the fill edges was studied in [19] in order to achieve a faster time bound, and it would be interesting to see if these ideas can be extended to find a faster construction algorithm for 3state perfect phylogeny.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
The authors wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
This research was partially supported by NSF grants CCF0515378, IIS0803564, and CCF1017580. We thank an anonymous referee for careful reading and valuable comments.
References

Kannan S, Warnow T: Inferring evolutionary history from DNA sequences.
SIAM J Comput 1994, 23:713737. Publisher Full Text

Lam F, Gusfield D, Sridhar S: Generalizing the four gamete condition and splits equivalence theorem: perfect phylogeny on three state characters.
SIAM J Discrete Math 2011, 25:11441175. Publisher Full Text

Bodlaender H, Fellows M, Warnow T: Two strikes against perfect phylogeny.
Automata, Languages and Programming, 19th International Colloquium, ICALP 1992, Vienna, Austria, July 1317, 1992, Proceedings. LNCS 1992, 623:273283.

Steel MA: The complexity of reconstructing trees from qualitative characters and subtrees.
J Classification 1992, 9:91116. Publisher Full Text

Gusfield D: Efficient algorithms for inferring evolutionary trees.
Networks 1991, 21:1928. Publisher Full Text

Semple C, Steel M: Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications. New York: Oxford University Press; 2003.

Agarwala R, FernandezBaca D: A polynomialtime algorithm for the perfect phylogeny problem when the number of character states is fixed.
SIAM J Comput 1994, 23:12161224. Publisher Full Text

Kannan S, Warnow T: A fast algorithm for the computation and enumeration of perfect phylogenies.
SIAM J Comput 1997, 26:17491763. Publisher Full Text

Buneman P: A characterisation of rigid circuit graphs.
Discrete Math 1974, 9:205212. Publisher Full Text

Blair J, Peyton B: An introduction to chordal graphs and clique trees.
IMA Volumes in Mathematics and its Applications 1994, 56:127.

Dirac GA: On rigid circuit graphs.
Abh Math Sem Univ Hamburg 1961, 25:7176. Publisher Full Text

Gavril F: The intersection graphs of subtrees in trees are exactly the chordal graphs.
J Comb Theory, Ser B 1974, 16:4756. Publisher Full Text

Gusfield D: The multistate perfect phylogeny problem with missing and removable data.
J Comput Biol 2010, 17:383399. PubMed Abstract  Publisher Full Text

Golumbic M: Algorithmic graph theory and perfect graphs. Volume 57 of Annals of Discrete Mathematics. Essex, UK: Elsevier Science Publishers; 2004.

Kloks T, Kratsch D, Spinrad J: On treewidth and minimum fillin of asteroidal triplefree graphs.
Theor Comput Sci 1997, 175(2):309335. Publisher Full Text

Parra A, Scheffler P: How to use the minimal separators of a graph for its chordal triangulation.
Automata, Languages and Programming, 22nd International Colloquium, ICALP 1995, Szeged, Hungary, July 1014, 1995, Proceedings. LNCS 1995, 944:123134.

Parra A, Scheffler P: Characterizations and algorithmic applications of chordal graph embeddings.
Discret Appl Math 1997, 79:171188. Publisher Full Text

Berry A, Bordat JP, Heggernes P, Simonet G, Villanger Y: A widerange algorithm for minimal triangulation from an arbitrary ordering.
J Algorithms 2006, 58:3366. Publisher Full Text

Gysel R, Lam F, Gusfield D: Constructing perfect phylogenies and proper triangulations for threestate characters.
Algorithms in Bioinformatics, 11th International Workshop, WABI 2011, Saarbrücken, Germany, September 57, 2011, Proceedings. LNCS 2011, 6833:104115.

Ho C, Lee R: Counting clique trees and computing perfect elimination schemes in parallel.
Inf Process Lett 1989, 31:6168. Publisher Full Text

Blair J, Peyton B: On finding minimumdiameter clique trees.

Tarjan R, Yannakakis M: Simple lineartime algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce acyclic hypergraphs.
SIAM J Comput 1984, 13:566579. Publisher Full Text