Abstract
Background
Reconciliation methods compare gene trees and species trees to recover evolutionary events such as duplications, transfers and losses explaining the history and composition of genomes. It is wellknown that gene trees inferred from molecular sequences can be partly erroneous due to incorrect sequence alignments as well as phylogenetic reconstruction artifacts such as long branch attraction. In practice, this leads reconciliation methods to overestimate the number of evolutionary events. Several methods have been proposed to circumvent this problem, by collapsing the unsupported edges and then resolving the obtained multifurcating nodes, or by directly rearranging the binary gene trees. Yet these methods have been defined for models of evolution accounting only for duplications and losses, i.e. can not be applied to handle prokaryotic gene families.
Results
We propose a reconciliation method accounting for gene duplications, losses and horizontal transfers, that specifically takes into account the uncertainties in gene trees by rearranging their weakly supported edges. Rearrangements are performed on edges having a low confidence value, and are accepted whenever they improve the reconciliation cost. We prove useful properties on the dynamic programming matrix used to compute reconciliations, which allows to speedup the tree space exploration when rearrangements are generated by Nearest Neighbor Interchanges (NNI) edit operations. Experiments on synthetic data show that gene trees modified by such NNI rearrangements are closer to the correct simulated trees and lead to better event predictions on average. Experiments on real data demonstrate that the proposed method leads to a decrease in the reconciliation cost and the number of inferred events. Finally on a dataset of 30 k gene families, this reconciliation method shows a ranking of prokaryotic phyla by transfer rates identical to that proposed by a different approach dedicated to transfer detection [BMCBIOINF 11:324, 2010, PNAS 109(13):4962–4967, 2012].
Conclusions
Prokaryotic gene trees can now be reconciled with their species phylogeny while accounting for the uncertainty of the gene tree. More accurate and more precise reconciliations are obtained with respect to previous parsimony algorithms not accounting for such uncertainties [LNCS 6398:93–108, 2010, BIOINF 28(12): i283–i291, 2012].
A software implementing the method is freely available at http://www.atgcmontpellier.fr/Mowgli/ webcite.
Keywords:
Evolution; Reconciliation; Gene Tree Correction; Method; Software; Duplication; Transfer; Loss; Nearest Neighbor InterchangeBackground
A phylogenetic tree or phylogeny is a tree depicting evolutionary relationships among biological entities that are believed to have a common ancestor. A gene family is a group of genes descending from a common ancestor, that retains similar sequences and often similar functions [1]. A species tree depicts the evolutionary history of a group of species, whereas a gene tree depicts the evolutionary history of a gene family. Gene trees often differ from the species tree due to familyspecific evolutionary events such as gene duplications, gene losses and horizontal gene transfers. By comparing a gene tree with the species tree, reconciliation methods try to recover those major evolutionary events. Reconciliation is indeed the process of constructing a mapping between a gene tree and a species tree to explain their differences and similitudes with evolutionary events such as speciation (), duplication (), loss (), and horizontal gene transfer () events. Reconciliations are most often inferred on the basis of a parsimony criterion: a cost is given to each event type, the total cost of a reconciliation is the sum of the costs of the individual events it uses, and a reconciliation of minimum total cost is sought for. This computational problem is often called Most Parsimonious Reconciliation, or MPR in short, and many works have been devoted to it recently [28].
The first proposed models focused on parsimonious reconciliations involving only duplications and losses (the DL model) [911] or only horizontal transfers and losses [12]. Probabilistic methods have also been developed for the DL model, such as that of Arvestad et al. [13] (see Doyon et al. [14] for a review). Most recent works using a parsimony approach have been devoted to models incorporating duplications, losses and transfers all together (the DTL model) [2,4,5,8], which is necessary to handle prokaryotes. When accounting for transfer events, the history proposed by a reconciliation is consistent if, for any transfer, the donor and receiver species coexist. Ensuring such a time consistency is difficult and leads to an NPhard problem in the general case [7,15] which cannot be solved by just examining couples of species tree edges. However, in the case divergence dates are available for nodes of the species tree, the problem becomes amenable [2,16]. The difficulty to handle transfers has led to a split within proposed DTL methods, namely those that ensure timeconsistency [2,16] and those that do not [3,4,7]. The fastest parsimony algorithms for the later category runs in O(mn logn) where m and n are the sizes of the gene and species trees respectively [3], while the fastest timeconsistent algorithm runs in O(mn^{2}) [2]. Probabilistic methods also have been extended recently to the DTL model. Inspired by the work of Tofigh [17], Szőllösi et al. recently proposed a timeconsistent procedure to estimate the species tree by reconciliations from a set of gene trees [18].
A major problem, when applying reconciliation methods, is that parts of the gene trees can be incorrect. This leads reconciliation methods to overestimate (), (), () and () events [19,20]. Errors within a binary gene tree can be due to sequence alignment problems, phylogenetic reconstruction artifacts (e.g. long branch attraction) or a lack of phylogenetic signal (especially for genes encoded by short sequences). Such phenomena are wellknown in phylogenetics and several support measures, such as bootstrap values or bayesian posterior probabilities, have been proposed to detect unreliable edges in a gene tree. Up to now, very few works have tackled the reconciliation problem in the presence of unsupported edges, and most of them consider only the DL model [19,2126]. Durand et al. proposed an exponential exact algorithm to find the best rearrangement of a gene tree while preserving its strongly supported edges [19]. Another approach is to collapse unsupported edges, thereby creating nodes with more than two children (i.e., polytomies), and then to rely on a generalization of the least common ancestor mapping (LCA) to avoid the need for examining all possible binary rearrangements of the polytomies [2123,26]. In this way, Chang et al. and Lafond et al. proposed polynomial time algorithms to solve the MPR problem for a binary species tree and a nonbinary gene tree [22,26]. When both the species tree and the gene tree are nonbinary, Berglund et al. proved that finding a refinement of the gene tree using less than a given number of duplications is an NPcomplete problem [21]. They also proposed a heuristic approach to refine the gene tree by first minimizing duplications and then losses. Zheng et al. showed that minimizing together duplication and loss costs is NPhard for reconciling a nonbinary species tree with a binary gene tree [25]. For this specific case, Vernot et al. proposed a fixed parameter tractable (FPT) algorithm whose complexity is exponential only in the maximum degree of nodes [23]. More recently, Stolzer et al. extended this FPT algorithm by allowing transfers [27].
Overall, several works relied on tree edit operations to deal with uncertainties in the gene trees. Durand et al. used Nearest Neighbor Interchange (NNI) edit operations to rearrange the local topology of the gene trees in the regions of low supports [19]. Górecki and Eulenstein proposed an efficient algorithm to do a similar task and at the same time root the gene trees, while restraining their search to trees that are at most k NNI moves away from the original gene trees [28]. Chaudhary et al. investigated Subtree Prune and Regraft (SPR) and Tree Bisection and Reconstruction (TBR) edit operations to search for the gene tree rearrangement that minimizes the number of duplications, regardless of losses [24].
It seems hard to have an exact polynomial time algorithm for the MPR problem under the DTL model even when the polytomies are present only in the gene tree or in the species tree. Following the works cited above to deal with uncertainties in the gene trees, we propose a heuristic method relying on NNI edit operations to search for a gene tree rearrangement that preserves strongly supported edges and minimizes the cost of reconciliation to a fixed binary species tree, but in the context of the more complex DTL model. The resulting dynamic program, called MowgliNNI, is a generalization of Mowgli[2], a program initially developed for fixed binary gene trees.
Experiments on simulated data show that MowgliNNI provides a gene tree that is closer to the true evolutionary history of the gene family, and leads to more accurate , and predictions. Experiments on real data show a significant decrease in the number of predicted events and an increased precision, that is a decrease in the number of equally most parsimonious reconciliations. We conducted a large scale experiment where 30 k prokaryotic gene families covering several phyla were reconciled using MowgliNNI. These phyla were then ordered according to their inferred transfer rate. We obtained the same phyla ordering as the one obtained using Prunier, a method dedicated to transfer prediction [29,30], and our reconciliation based approach has the advantage of providing extra information: explicit donor and receiver branches for transfers, prediction and localization of duplications and losses.
Methods
Basic notations
Trees considered in this paper are rooted and labeled at their leaves only, each leaf being labeled with the name of a studied species. Given a tree T, its node set, edge set, leaf node set and root are resp. denoted V(T), E(T), L(T) and r(T). The label of a leaf u of T is denoted by and the set of labels of leaves of T is denoted by . When a node u has two children, they are denoted u_{1} and u_{2}.
Given two nodes u and v of T, we write u ≤ _{T}v (resp. u < _{T}v) if and only if v is on the unique path from u to r(T) (resp. and u ≠ v); if neither u < _{T}v nor v < _{T}u then u and v are said to be incomparable. As we consider rooted trees T only, we adopt the convention that an edge denoted (v,u) means that u < _{T}v. For a node u of T, T_{u} denotes the subtree of T rooted at u, p(u) the parent node of u, while (u_{p},u) is the parent edge of u. A tree T^{′} is a refinement of a tree T if T can be obtained from T^{′} by collapsing some edges in T^{′}, i.e. by merging the two extremitites of these edges [31].
A species tree is a rooted binary tree depicting the evolutionary relationships of ancestral (internal nodes) species leading to a set of extant (leaf) species. A species tree S is considered here to be dated, that is associated to a time function such that if y < _{S}x then θ (y) < θ (x). Such times are usually estimated on the basis of molecular sequences [32] and fossil records. Note that to ensure the time consistency of inferred transfers, absolute dates are not required, the important information being the ordering of the nodes of S induced by the dating.
Given a dated binary species tree S, the reconciliation model we rely on considers a variant S^{′} of S called a subdivision (as done also in [2,6,17]), associated to a time function . More precisely, for each node x ∈ V(S) ∖ L(S) and each edge (y_{p},y) ∈ E(S) s.t. θ_{S}(y_{p}) > θ_{S}(x) > θ_{S}(y), an artificial node w is inserted along the edge (y_{p},y), with (see Figure 1). Note that the height of S^{′} nodes (i.e. the number of their ancestors) is a valid time function that preserves the same partial order on nodes as and that the restriction of this time function to V(S)⊆V(S^{′}) preserves the partial order induced by θ_{S}.
Figure 1. Example of subdivision of a species tree. The subdivision S^{′} is obtained from the species tree S by splitting some of its edges thanks to additional artificial nodes (∘), i.e. nodes with a single child. These nodes are added on edges at the precise date a node appears somewhere else in S. For instance here, the artificial node s_{1} is placed at the same date as the node s_{2} of S, while s_{4} and s_{5} are placed at the same date as s_{3}.
A gene tree is a rooted binary tree depicting the evolutionary history of a gene family, that lead to a set of homologous sequences observed in current organisms. Each leaf of the gene tree has a unique label, corresponding to specific extant sequences of the gene. Indeed, several leaves of a gene tree can be associated to a same species due to duplication and transfert events. We denote by s(u) the species associated to leaf u ∈ V(G).
A gene tree Gwith supports is a gene tree whose internal edges each have a support value. Let wk_{t}(G) ⊆ E(G) be the set of edges having a support value weaker than threshold t and let str_{t}(G) be E(G)  wk_{t}(G), that is the edges having a support equal or stronger than t.
Reconciliation model
Reconciling a gene tree G with a species tree S means building a mapping α that associates each gene g ∈ V(G) to a sequence of nodes in V(S), namely the species in which the sequence g evolved. This evolution is submitted to different kinds of biological events such as speciation, duplication and transfer. The following definition presents a discrete models of this evolution.
Definition 1 (Reconciliation model). Consider a gene tree G, a species tree S with a time function θ_{S}, and its subdivision S^{′} with a time function θ_{S′}.
Let α be a function that maps each node u of G onto an ordered sequence of nodes of S^{′}, denoted α(u). For u ∈ V(G), let ℓ denote the length of α(u) and let α_{i}(u) be its i^{th} element (where 1 ≤ i ≤ ℓ). α is said to be a reconciliation between G and S^{′} if and only if exactly one of the following atomic events occurs for each couple of nodes u of G and α_{i}(u) of S^{′} (where α_{i}(u) is denoted x):
• x is the last vertex α_{ℓ}(u) and exactly one of the cases below is true.
1. u∈L(G), x ∈ L(S^{′}) and . ( event)
2. x is not artificial and {α_{1}(u_{1}),α_{1}(u_{2})} = {x_{1},x_{2}}. ( event)
3. α_{1}(u_{1})=α_{1}(u_{2}) = x. ( event)
4. α_{1}(u_{1}) = x, and α_{1}(u_{2}) = x^{′} is such that x^{′} ≠ x and . ( event)
• otherwise, one of the cases below is true.
5. x is an artificial vertex and α_{i+1}(u) is its only
child. ( event)
6. x is not artificial and α_{i+1}(u) ∈ {x_{1},x_{2}}. ( event)
7. α_{i+1}(u) = x^{′} is such that x^{′} ≠ x and . ( event)
The combinatorial events mentioned above (, , , , , , ) are those defined in [2]. See Figure 2 for an illustration of these events and Figure 3 for an example of reconciliation according to this model.
Figure 2. The events of the reconciliation DTL model (Definition 1). Each possible event is displayed for a node u of G and a node x of the subdivided species tree S^{′} on which u is mapped. Note that a same node u can be mapped to several nodes. As a result of the mapping of its nodes, the gene tree G, extended here with losses induced by the mapping (‡), is embedded in S^{′} (here dashed lines represent edges of G, and plain lines those of S^{′}, grayed rectangular zones represent nodes of S^{′}).
Figure 3. Example of a DTL reconciliation. (a) A gene tree G, represented with lost copies of the gene (∘), and a subdivided species tree S^{′}. (b) A reconciliation α between G and S^{′}. The reconciliation maps each node of G onto a sequence of nodes in S^{′}, inducing evolutionary events. For instance, nodes w,d_{1} and u from G are mapped as follows: α(w) = [y], where α_{1}(w) = y is an event; α(d_{1}) = [x^{′},x,D], where α_{1}(d_{1}) = x^{′} is a event, α_{2}(d_{1}) = x is an event, and α_{3}(d_{1}) = D is a event; α(u) = [y^{′},x,C], where α_{1}(u) = y^{′}, α_{2}(u) = x, and α_{3}(u) = C are respectively a , an , and a event.
Note that among these events, and are in fact a combination of two independent biological events. However, the fact that a loss is always taken into account jointly with another event allows to obtain a recursive algorithm and is done without loss of generality, i.e. does not reduce the power of the model [2].
Given a gene tree G and species tree S, there is an infinite number of possible reconciliations. Discrete evolutionary models compare them by counting the number of events they respectively induce. As different types of event can have different expectancies (e.g. are thought to be more frequent than and ), reconciliation models allow for a specific cost to be given to each kind of event. The cost of a reconciliation is then the sum of the costs of the individual events it induces. In that setting, the parsimony approach is then to prefer a reconciliation of lower cost. This is formalized in the following definition.
Definition 2. Let us consider a gene tree G, a subdivision S^{′} of a species tree, and a reconciliation α between trees G and S^{′}. The cost of α is defined as
where δ, τ, and λ respectively denote the cost of , , and events, while d, t, and l denote the number of the corresponding events in α. Moreover, a event is atomic and costs (τ + λ), while a event just costs λ. Indeed, speciation events are most of the time considered as having a null cost, but the model easily accommodates for nonnull costs if necessary.
The optimal reconciliation cost is
over all reconciliations α between G and S^{′}.
Note that several distinct alternative reconciliations can have an optimal reconciliation cost.
Lemma 1 (Consecutive events)
Consider a gene tree G, the subdivision S^{′} of a species tree, and a reconciliation α of optimal cost C(G,S^{′}) = c(α). For any node u of G, if α_{i}(u) corresponds to a event, then α_{i + 1}(u) does not.
This results from the observation that two in a row can be replaced by single , leading to a reconciliation of lesser cost.
Finding a most parsimonious reconciliation
To find one of the most parsimonious reconciliations between a gene G and a species tree S we will rely on the dynamic programming algorithm of Doyon et al. [2] that computes the optimal reconciliation cost, C(G,S^{′}) on G and the subdivision S^{′} of S. This algorithm successively examines the nodes u of G and their possible mapping on nodes x of S^{′} (or equivalently on edges ending at such nodes). A node u of G can be mapped on such a vertex x according to different scenarios, each postulating a different event at node u among those of Definition 1. The optimal cost for mapping u at x is defined according to the scenario of minimal cost. For running time optimization reasons, the scenario involving a event, whose cost is denoted , is computed after the other possible scenarios, denoting the minimum cost that can be achieved among the latter. This decomposition is possible since a event is always followed by a , , , , , or event (see Lemma 1). As a result, the best receiver for a event of node u with donor branch x can be computed from the costs over all vertices y other than x such that . The cost are themselves computed from values but for children u_{i} of u (see below). These intricate notions are formally detailed in Definition 3
Definition 3 (Reconciliation cost matrix).
Consider a gene tree G and the subdivision S^{′} of a species tree S with a time function θ_{S′}. Let denote the cost matrix recursively defined as follows for a node u of G and a vertex x of S^{′}: and , where the costs for all events x are defined below
• , if u ∈ L(G), x ∈ L(S^{′}) and .
• c(u_{1},x_{2}) + c(u_{2},x_{1})}, if u ∉ L(G) and .
• + τ, with u ∉ L(G) and z (resp. y) denoting a vertex that minimizes c(u_{2},z) (resp. c(u_{1},y)) over all vertices such that .
• , where y denotes a vertex that minimizes over all vertices x^{′} ∈ V(S^{′}) ∖ {x} such that .
If the above constraints for an event on node u and vertex x are not respected, the corresponding cost is set to ∞.
The value c(u,x) is the optimal cost when mapping gene node u to node x in S^{′}. The optimal cost for reconciling G with S^{′}, denoted C(G,S^{′}), is then .
The algorithm of Doyon et al. [2], called Mowgli, fills the dynamic programming cost matrix by two embedded loops: one loop visits all species nodes of S^{′} in time order (e.g. according to the partial order, while the other loop visits nodes of the gene tree G in postorder. Due to an optimization in precomputing the best receiver edge for transfer events of nodes u at a given time, this algorithm has O(S^{2}.G) time complexity.
The problem considered in this paper is the following:
MOST PARSIMONIOUS RECONCILIATION GENE TREE (MPRGT)
INPUT:
• A dated species tree S with a time function θ_{S}
• a gene tree G with supports on its edges and whose leaves are associated to leaves of S
• costs δ, τ, resp. λ for , , resp. and
• a threshold t.
OUTPUT: a gene tree G^{′} such that both and , and such that C(G^{∗},S^{′}) is minimum among all such trees.
Algorithm
We describe here a heuristic for the MPRGT problem that relies on a hillclimbing strategy to seek a (rooted) gene tree G of minimum reconciliation cost (see Definition 3) using NNI edit operations [33].
Performing an NNI operation around an internal edge (w,v) means swapping the position of one of the two subtrees connected to v with that of the subtree connected to the sibling of v. Given an initial gene tree G and an edge of G, two “alternative” trees can be obtained from G by performing an NNI operation (see Figure 4). The hillclimbing proceeds as follows: (1) select a weak edge of G; (2) compute the reconciliation cost for the two alternative gene trees obtained by NNI on that edge; (3) if none of these trees decreases the reconciliation cost, then try another weak edge; if none of the weak edges allows to progress, then G is a local minimum and the hill climbing stops; (4) otherwise one of the alternative gene trees leads to a decrease in reconciliation cost, and the above process continues with the alternative tree of minimum reconciliation cost. MowgliNNI outputs the final binary rearrangement along with its most parsimonious reconciliation. In the worst cases, MowgliNNI examines all unreliable edges and does not find any better binary rearrangement of the given gene tree G since the topology G is already (locally) optimal. The whole scheme of MowgliNNI is described in Figure 5.
Figure 4. A gene tree G with a weak edge(w,v) selected for an NNI.v is connected to two subtrees G_{c} and G_{d}, while w is connected to v and to the subtree G_{b}. Performing an NNI operation around (w,v) means interchanging subtree G_{b} with either G_{c} or G_{d}, leading to trees and respectively.
Figure 5. Algorithmic scheme of MowgliNNI^{n} (nonoptimized version of the method).
Consider now the time complexity of MowgliNNI. Identifying the weak edges is done in O(G) and generating the two alternative gene trees for a NNI operation is done in constant time. Hence, the complexity bottleneck of MowgliNNI is the number of times (denoted N) the Θ(S^{2} · G)Mowgli algorithm is called. Overall, the time complexity of MowgliNNI is Θ(S^{2} · G · N). The next section describes how we can avoid recomputing large parts of the cost matrix, and hence greatly reduce the running time of MowgliNNI.
Combinatorial optimization
We now present results that take advantage of the way the dynamic programming matrix is computed (Definition 3) to avoid recomputing from scratch the cost matrix associated to a gene tree G^{′} obtained by an NNI edit operation from a gene tree G. Consider the gene tree G of Figure 4, the NNI operation applied on edge (w,v) that swaps the two subtrees G_{b} and G_{c}, and the resulting gene tree denoted G^{′}. We can observe that despite the global architecture of G and G^{′} differs, the local architectures of subtrees , remain unchanged. Hence, any cost that differs between the matrices C(G,S^{′}) and C(G^{′},S^{′}) (see Definition 3) is located in a column (i.e. node of the gene tree) associated to an ancestor of v (including v itself). For each of those nodes, there are two cases: (i) the node belongs to the NNI edge and its two children have subtree that have been modified (e.g. nodes w and v); (ii) the node is a strict ancestor of the NNI edge (w,v) and has exactly one child with a subtree that has been modified (e.g. g_{k},…,g_{0}).
Lemma 2 below indicates which columns of the cost matrix don’t need to be recomputed.
Lemma 2. Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w,v) of G, and the gene tree G^{′} obtained from G by an NNI operation on (w,v). For each node z of G that is not ancestor of v in G and for each vertex x of S^{′}, then c(z,x) = c^{′}(z,x) holds.
This observation results from the fact that the dynamic algorithm of Mowgli computes the value of a cell (z,x) in the cost matrix using cells storing values either for the same node z or for its children (see formulas of Definition 3). Hence the value of a cell (z,x) directly or indirectly depends only on values for cells corresponding to z and its descendants. Going from gene tree G to G^{′} by an NNI operation, precisely changes the descendant relationships of v and its ancestors, i.e. all other nodes z have the same descendants in both G and G^{′} (see Figure 4), hence c(z,x) = c^{′}(z,x) holds for all these nodes.
Unfortunately, there is no extension of Lemma 2 to ensure that when an edge has already been unsuccessfully tried for an NNI, it is useless to reconsider it later, even if it is a descendant in G of the edge leading to the last successful NNI.
Theorem 1. Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w,v) of G, a gene tree G^{′} obtained by an NNI operation on (w,v), and any strict ancestor u of w in G where the unique child of u that is an ancestor of w is u_{1} w.l.o.g. (i.e. w ≤ u_{1} in both G and G’). If c(u_{1},x) ≤ c^{′}(u_{1},x) holds for all x ∈ V(S^{′}), then c(u,x) ≤ c^{′}(u,x) holds for all x ∈ V(S^{′}), and as a corollary C(G,S^{′}) ≤ C(G^{′},S^{′}).
The proof of Theorem 1 is described in Appendix. This theorem leads to the optimized algorithm of MowgliNNI, formally stated in Algorithm 1 as an integrated procedure run after Mowgli. The later computes a dynamic programming matrix c :V(G) → V(S^{′}) that MowgliNNI then partly recomputes given a rearrangement performed on the gene tree G. For each rearrangement, the matrix recomputed by MowgliNNI, denoted c^{′} :V(G^{′}) → V(S^{′}), is obtained in worst case time O(S^{′} · h(G)), where h(G) is the height of G (i.e. the number of its ancestors)
Algorithm 1 MowgliNNI(G,c): seeking a gene treeG^{′} of minimum reconciliation cost, starting from a gene treeG and the precomputed matrix reconciliation cost, whereS^{′} is the subdivided species tree.
Theorem 2. MowgliNNI has worst case running time O(S^{2} · G + S^{2} · h(G)·N)
Indeed the steps of Algorithm 1 can be described as follows: initializing the reconciliation matrix for the initial gene tree is done in O(S^{2} · G) time; then updating the matrix for each of the N NNIs now only costs O(S^{′} · h(G)) = O(S^{2}·h(G)).
In MowgliNNI’s naïve implementation each rearrangement requires to recompute the cost associated to each and every node of the gene tree. In contrast, in the optimized version, an NNI around edge (w,v) is examined after updating only those costs associated to ancestral nodes of w. This has no impact on the worst case complexity (when the gene tree is a caterpillar h(G) is in O(G)) but significantly reduces the running times in practice since in most cases the number of nodes in G is much larger than their average height. For some random tree models the average height of a node in an nleaf tree is indeed proportional to log(n) [34].
Results and discussion
Experiments on simulated datasets
Simulated gene trees and evolutionary histories
A phylogeny of 37 proteobacteria was used as a reference species tree (denoted S) [8]. Along this tree, we simulated the evolutionary history (denoted R_{True}) of 1000 gene families (G_{True}), each containing from 10 to 100 genes, according to a birth and death process [35]. Birth events can be one of three kinds of evolutionary events, i.e. speciation, duplication, and horizontal gene transfer. During the simulation process along the species tree, a speciation occurs every time a gene lineage reaches an internal node of the species tree, leading to a split in two gene lineages. A birth event happening strictly between two nodes of the species tree can only correspond to a gene duplication or a horizontal gene transfer event. A birth is decided to be duplication or a transfer according to the input rates of these events.
The death of a gene lineage corresponds to a loss event, which happens according to an input loss rate. The species tree was scaled to the height of 500 million years (Mya). The speciation rate is determined by the topology and the height of the species tree. Each of the 1000 gene families was generated with different event rates, the loss rate being randomly chosen in the range [0.0010–0.0018] events/gene per million year. The ratio between the sum of duplication and transfer rates and the loss rate was randomly chosen in the range [0.5  1.1], the duplication rate being [70%  100%] of the mentionned sum. This birth and death process first output a complete gene tree G^{o}, then the “true” gene tree G_{True} was obtained from G^{o} by pruning extinct subtrees. The “true” evolutionary events to be recovered by the reconciliation programs are those appearing in G_{True}. We denote R_{True} the history composed by these events. We only considered gene families containing at most ten duplication and transfer events in their true evolution. In particular for the transfer events, this constraint allowed us to limit the number of cases where the true evolution contains a sequence of consecutive transfers where nontransferred genes are lost (i.e. a sequence of events). Such a piece of history can hardly be recovered by reconciliation methods as it left no trace at all in the gene tree.
Starting from G_{True}, a further step of the simulation protocol allows to obtain estimates of both this gene tree and the events composing its history (see Figure 6).
Figure 6. The simulation protocol to obtain inferred gene trees from sequences derived from a true gene tree. The module “Deviation from ultrametricity” is taken from the program of Galtier that converts the edge lengths of ultrametric trees from the time unit into substitution numbers [36]. G_{ML} denotes the initial gene tree inferred by Maximum Likelihood from the simulated sequences; Rec_{ML}, resp. Rec_{R}, denotes the reconciliation between G_{ML} and the reference species tree predicted by Mowgli[2], resp. RangerDTLD [3]; G_{NNI} is the alternative gene tree found by MowgliNNI and Rec_{NNI} is the reconciliation between G_{NNI} and the species tree.
The length of the edges in G_{True} were first converted from duration to number of substitutions per site by a module taken from [36]. Next, we simulated the evolution of 1500  3000 bp DNA sequences along this tree under the GTR model, thanks to the SeqGen program [37]. The alignment, composed of one sequence per extant gene, was then given as input to RAxML [38] to obtain a maximum likelihood estimate of the gene tree, denoted G_{ML} (also called initial gene tree below). Mowgli[2] and RangerDTLD [3] were then used to infer a most parsimonious evolutionary history R_{ML}, resp. R_{R}, between this initial gene tree and the reference species tree S. Separately, MowgliNNI was used to search for an alternative gene tree topology (G_{NNI}) of lower reconciliation cost, along with one of its most parsimonious evolutionary history (R_{NNI}). The elementary cost considered for each event kind (with being, or) was computed as follows:
where stands for the number of events of the corresponding kind in R_{True}.
Measuring the accuracy
First, we estimated the improvement in the accuracy of the gene tree’s topology, as measured by the RobinsonFoulds (RF) distance [39] between the true gene tree (G_{True}) and the inferred gene tree (G_{ML}). As a second measurement of the accuracy of inferred reconciliations we compared the positions of, and events predicted by MowgliNNI and Mowgli with those present in the true history. This is achieved by studying the proportion of true positive (TP), false positive (FP) and false negative (FN) separately for duplications, transfers and losses [2]. True negatives (TN) were not studied as their number is considerably large (if even finite) and hard to determine. An event of R_{True} is declared as correctly predicted when it concerns the right part of the gene tree (node or edge) placed in the correct branch or node of the species tree (see [2] for more details). Incidentally, both the receiver and the donor edge of the species tree have to be correctly indicated for a predicted transfer event to be declared as correct.
MowgliNNI provides more accurate inferences
We explored the ability of MowgliNNI to improve the set of G_{ML} trees using six different bootstrap values as threshold for defining weak edges, i.e. 20, 40, 60, 80, 90, and 95. The G_{ML} trees were inferred from relatively long sequences, they thus contained a large proportion of high bootstrap values, e.g. more than 63% edges had a bootstrap value ≥ 80. Though this left only a moderate number of edges in each gene tree to be considered by MowgliNNI for rearrangement, the method was still able to improve their quality (see below).
Mowgli and RangerDTLD showed a similar accuracy in inferring duplications and transfers (Figure 7), though RangerDTLD proposed reconciliations with higher costs in 13% of the cases. As both methods reconciled the same trees and used the same elementary costs for the events, one might wonder why they did not always obtain the same reconciliation costs. This results from different factors among which the most important is that RangerDTLD relies on a less general reconciliation model than Mowgli (e.g. not ensuring time consistency and not allowing gene loss in the donor branch after a transfer), but which on the other hand allows it to run at greater speed. As Mowgli and RangerDTLD performed similarly, in the following we just report results obtained with Mowgli.
Figure 7. Impact of MowgliNNI on the proportion of True Positive events. The accuracy of Mowgli, RangerDTLD and MowgliNNI (threshold=80) in inferring duplications and transfers, where TP^{DT} (resp. FP^{DT}, FN^{DT}) denotes the true positive (resp. false positive, false negative) of duplications and transfers predicted.
MowgliNNI progressively reduced the number of predicted duplications, transfers and losses as the threshold increased. At threshold 0 (where MowgliNNI = Mowgli, 5510 duplications, 2494 transfers and 12190 losses were predicted on the whole dataset; going to threshold 80, these numbers dropped to 4602 duplications, 1676 transfers and 8133 losses, i.e. values that are much closer to the 4535 duplications and 8260 losses contained in the true reconciliations.
Figure 8(a) shows that, no matter the threshold value, the false positive (FP) of MowgliNNI are always less than that of Mowgli both in terms of RF distance and evolutionary events (duplications, transfers and losses). This means that the G_{NNI} trees are closer to the true ones than the initial G_{ML} trees inferred from sequences only. Similarly, the evolutionary events inferred from the G_{NNI} trees are more accurate. As the threshold increases from 0 to 80, R_{NNI} contains less and less FP events, hence widening the gap in accuracy between Mowgli and MowgliNNI. As increasing the threshold results in reconsidering a larger number of G_{ML} edges for NNI operations, this means that the species tree examined through the reconciliation lens is a good guide tree for correcting wrong edges of the gene trees.
Figure 8. Impact of MowgliNNI on the proportion of False Positive events.(a) Average false positive error reduction achieved by NNI trees (FP_{NNI}) w.r.t. that of the initial gene trees (FP_{ML}). The positive value indicate that the NNI tree has on average less errors than the initial tree. (b) Average FP of the NNI trees – note that values at threshold 0 correspond to the FP of Mowgli.
The average number of false positive events of the R_{NNI} reconciliations decreases as the threshold increases (Figure 8(b)). However, as in Doyon et al. [2], the average number of FP transfers is quite high compared to that of duplications and losses. This can be explained by several reasons. First, a transfer is judged incorrect as soon as (i) it does not depart or end in the same edges of the species tree as the corresponding true transfer, or (ii) it does not concern the same edge in the gene tree. Overall, there is an additional constraint w.r.t. duplications and loss events, leading on average to more incorrect events. This point is all the more sensitive that several most parsimonious reconciliations (MPR) are obtained in a number of cases, while we just accounted for one of them for each gene family. Hence, event error rates we report are pessimistic, and especially for transfers due to the stringent conditions for judging a transfer as correct. Note that the multiplicity of MPRs does not affect the RF error terms. Last, incorrect gene trees lead to incorrect event inferences, but the latter are very sensitive to only small errors in gene trees. The event FP error grows almost exponentially when the RF distance between the initial and the true tree increases from 0 to 10% (Figure 9). Figures 8 and 10 show that transfers are more influenced by this factor, as a result of more stringent conditions for being correct.
Figure 9. Impact of gene tree errors on reconciliation accuracy. The false positive (FP^{DTL}) error rate of the events predicted by reconciliation methods grows exponentially with respect to the Robinson Foulds distance between the initial and true tree.
Figure 10. Impact of MowgliNNI on the proportion of False Negative events.(a) The false negative reduction of NNI gene trees (FN_{NNI}) in comparison to the initial gene trees (FN_{ML}). While reducing the number of wrong events predicted, MowgliNNI mostly does not remove the events that have been correctly predicted. (b) FN of the NNI trees. FN and FP of Robinson Foulds distance are the same since the true, initial and NNI gene trees are binary.
Influence of the sequence length parameter
MowgliNNI achieved a higher improvement over Mowgli on the subset of 327 G_{ML} families inferred from short sequences (length 1500–2000 bp) than on the subset of 336 families inferred from long sequences (length 2500–3000 bp), see Table 1. For instance, at threshold 80, MowgliNNI was able to proposed a modified gene tree (G_{NNI}) for up to 83%, resp. 92%, of the families containing weak edges when G_{ML} was inferred from long, resp. short, sequences. Similar results were observed for the quality of modified gene trees(G_{NNI}) in term of RF distance to G_{True} and for the quality of reconciliations in term of event distance between inferred and true reconciliation. The fact that higher improvements are obtained for shorter sequences was confirmed through the simulation of 1000 G_{ML} families inferred from much shorter sequences (400 bp), where still a higher improvements where obtained (see Table 2).
Table 1. Quality of the gene trees (G_{NNI}) and reconciliations (R_{NNI}) inferred by MowgliNNI depending on the length of the sequences used to obtainG_{ML }trees and on the threshold indicating weak edges
Table 2. Quality of the gene trees (G_{NNI}) and reconciliations (R_{NNI}) inferred by MowgliNNI on very short sequences
Robustness of reconciliations to imprecision in the event costs
In order to measure the dependance of MowgliNNI on the precise costs used for each kind of event, we ran the method with the threshold t = 80 on G_{ML} trees with costs varying up to 10%, 20%, then 50% w.r.t. those used derived from R_{True} using Formula (1). The paired ttest for RF distances shows that the G_{NNI} trees obtained with the new noisy costs are not significantly different from those obtained with the former costs (pvalue=0.1747, 0.1758, 0.1144 respectively). The accuracy of inferred events also does not change much. Transfers have the highest variation with 4.2% (resp. 3.1%) increase in FP (resp. FN) when the event costs vary up to 50% (Table 3). Thus, MowgliNNI is quite robust to changes in the event costs.
Table 3. The robustness of MowgliNNI to changes in event costs with respect to the initial ones computed by Formula (1) (Column 1)
Room for future improvement
To measure how much of the achievable improvement over the G_{ML} trees was realized by MowgliNNI, we studied the distribution of reconciliation costs of all possible gene trees for several cases involving a computationally manageable number of species. The shape of the distribution together with the relative position of the costs obtained for G_{True}, G_{ML} and G_{NNI} within those distributions gives information on how much improvement could be achieved in the future by more sophisticated methods (e.g., relying on SPR moves). We report here on two cases representative of our observations: two true gene trees of 8 taxa were generated from the species tree S of 37 proteobacteria according to the protocol described in Figure 6. Their two associated histories A and B were used as starting points to obtain both sequence alignments and reconciliations costs (according to Equation 1). This time, 50 sequence alignments were generated from each of the two gene trees. A maximum likelihood tree was obtained from each of the 100 alignments, with bootstrap supports associated to its edges. These trees were then submitted for improvement to MowgliNNI, applying a threshold 50 to specify weak edges, and relying on event costs corresponding to histories A and B respectively. All reconciliations were performed with respect to the species tree S.
Figure 11 shows the distributions of reconciliation costs C(S,G) obtained for all possible binary trees G having the same leaves as and respectively. The first observation is that though the same species tree was used in both cases, these distributions vary significantly in range and shape depending on the gene tree leaf set. In the case of history B, the number of trees with reconciliation costs falling in a given range varies sporadically, whereas the number of trees in a given range almost follows a normal (or beta) distribution for history A.
Figure 11. Distributions of reconciliation costsC(S,G) over all possible binary gene treesG for two sets of 8 taxa from the phylogenyS of 37 proteobacteria, obtained by generating two simulated histories A and B alongS. For each distribution, we indicate the position of C(G_{True},S), the reconciliation cost obtained by Mowgli for the true gene tree of the corresponding history. The plot also indicates the average cost C(S,G_{ML}) obtained for reconciliations from maximum likelihood trees and the average cost C(S,G_{NNI}) obtained for reconciliations of MowgliNNI trees obtained from the maximum likelihood trees.
History A involved 2 duplications, no transfer and 7 losses and was correctly recovered by the parsimonious reconciliation of Mowgli from. However, History B (involving 2 duplications, 1 transfer and 5 losses) was incorrectly recovered from, the achieved cost being less than the 5.75 cost for the real history. Though the real cost is in the left part of the distribution, it is not the minimum point of the distribution, showing that parsimony can sometimes be misleading when followed to its extreme.
Nevertheless, in both cases, the true gene tree is among the ones having the minimum reconciliation costs: it is precisely the one leading to the minimum cost for history A and among the nine best trees for history B. On these examples (and other cases not shown), parsimony can be considered as a very good guide towards the correct gene tree, even if the reconciliation from this correct tree can underestimate the number of real events (as discussed above).
For both histories A and B, MowgliNNI proposed a gene tree G_{NNI} whose reconciliation cost was on average closer from that of the true gene tree – and from the real cost – than the cost obtained from the maximum likelihood tree.
Conclusion on simulated datasets
In summary, MowgliNNI successfully uses the reconciliation cost as additional information to resolve the uncertain parts of gene trees inferred from sequences only. Though the gene tree resolutions are partly guided by reconciliations with the species tree, they are not attracted away from the true gene trees, but are closer to them than the initial gene trees. As a result, MowgliNNI infers gene events more accurately, which is of prior importance to distinguish orthologs from paralogs and xenologs [14].
Experiments on real data
As species tree S, we chose a phylogeny covering 336 genomes of Bacteria and Archaea recently inferred by Abby et al. [30].
Then, a dataset of 29,709 homologous gene families spanning these taxa was collected from the HOGENOM database (release 04) [40]. Each such family contains from 3 to 312 taxa. The gene tree of each family from this dataset was reconciled with the species tree by Mowgli and MowgliNNI using costs τ = 3, δ = 3.5, resp. λ = 1 for transfers, duplications, resp. losses. These costs were estimated on the basis of several bacteria phyla by a maximum likelihood method [18,41]. A threshold of 50% for branch support values was used to indicate to MowgliNNI the weak edges in the gene trees.
A decrease in the number of inferred events and reconciliation costs
MowgliNNI allowed to change the gene tree, hence to lower the reconciliation cost, in 24% of the ≈30,000 families. This gain is nonnegligible and has a real importance as changing the gene tree topology has an important impact on the inferred events (as already shown on simulated datasets and discussed below). In turn, these inferred events may serve to predict the function of new sequences on the basis of their orthology relationships with annotated sequences, orthology following from the chosen reconciliation. Among previous reconciliation studies that allowed to modify the gene trees, BerglundSonnhammer et al. report that 10% of their families were improved [21] when allowing rearrangements on weak edges under the DL model, while Chaudhary et al. improved all their gene trees in a pure D model when rearranging gene trees with Subtree Prune and Regraft (SPR) operations [24]. Note that the heterogeneity of models and datasets used in these studies limit the comparison of their results, but we cite them for completeness.
For gene families with a lower reconciliation cost (24% of all families), we counted the number of events of each kind () inferred by Mowgli and MowgliNNI. As a rule, MowgliNNI led to a decrease in the number of events in inferred evolutionary histories. In particular, the number of transfers is reduced in 88.3% of these gene families, the number of losses being reduced in 59.9%, while the number of duplications is almost the same (decrease in 5.2% of the families). These results obtained in the DTL model echo those of Durand et al. reporting that in the DL model gene tree rearrangements substantially reduce the number of events needed to explain the data [19]. The differences in reductions we observed among the kind of events can be explained by the costs – estimated from [18,41] – that we used for the events (τ = 3,δ = 3.5,λ = 1). Given those costs, it is usually more parsimonious to explain the conflicts between a gene and the species tree by a combination of and rather than a combination of, and. Thus, when MowgliNNI infers a gene tree closer to the species tree, it mostly removes the need for artificial transfers (and losses to a lesser extent), while not altering that much the number of duplications.
To give a precise example of how MowgliNNI proposes a more parsimonious reconciliation on a modified gene tree than Mowgli on the initial gene tree, Figure 12 details the particular case of family HBG040981, (a putative tocopherol cyclase). In this case, MowgliNNI proposes a gene history resorting to less events by rearranging an edge with a very small support. The gene tree modified by MowgliNNI leads to a reconciliation having one transfer and one loss less than the reconciliation performed from the initial gene tree. On the whole, the reconciliation cost goes from 9.5 down to 5.5.
Figure 12. Example of reconciliation withMowgli (top) and MowgliNNI (bottom).A and D are the gene trees. The NNI rearrangement around the green bold edge in A exchanging the two subtrees in green circles results in D. B and E are the reconciled gene trees showing the duplications (blue squares), losses (red crosses) and the transferred subtree (purple). C, resp. F, is the species tree with the events inferred by Mowgli, resp. MowgliNNI, mapped onto the appropriate edges (a purple arrow shows the origin and destination of the transfer). NOSTO1  Nostoc sp. PCC 7120; ANVAR1  Anabaena variabilis ATCC 29413; TRERY1  Trichodesmium erythraeum IMS101; SYNEC4  Synechococcus sp. JA2–3B’a(2–13); SYNEC5  Synechococcus sp. JA3–3Ab; SYNEY1  Synechocystis sp. PCC 6803; GLVIO1  Gloeobacter violaceus PCC 7421; THELO1  Thermosynechococcus elongatus BP1; PRMAR1  Prochlorococcus marinus str. MIT 9312.
A decrease in the number of equally most parsimonious reconciliations
In addition to reductions in number of events and hence reconciliation cost, the modified gene trees proposed by MowgliNNI usually reduced the number of alternative MPRs, i.e. equally most parsimonious histories. On a random sample of two dozens modified gene trees, the number of MPRs is reduced in 63% of the cases (by a factor of 18 in the best case), and increased in 21% (by a factor 3 at worst). This echoes similar observations done by other authors.
The improvement in running time due to the optimized version of MowgliNNI
We measured the running time of Mowgli and that of both the naive and optimized versions of MowgliNNI (see Methods), respectively called MowgliNNI^{n} and MowgliNNI. From the ≈30k families of our dataset, we extracted a random sample of 100 families uniformly spanning from 10 to 80 taxa, and respecting the previously observed proportion of improving / nonimproving cases in the reconciliation cost (i.e., 24% and 76% resp.). This sample was used to run the three programs. Figure 13 reports the ratios of MowgliNNI^{n} and MowgliNNI running times over those of Mowgli, with respect to the number of weak edges within the input gene tree. Indeed, for each gene tree, the number of tested rearrangements during the optimization search depends on the number of weak edges, which hence strongly impacts MowgliNNI^{n} and MowgliNNI running times. Note that the number of weak edges often represents between 10% and 20% of the gene tree edges, but can go up to 70%.
Figure 13. Compared running times of reconciliation methods. Two sets of values are plotted: each blue dot, resp. red triangle, corresponds to the ratio between the running time of MowgliNNI^{n}, resp. MowgliNNI and that of Mowgli on a same gene tree, depending on the number of weak edges. Regression lines are provided for both dot sets.
Figure 13 shows that MowgliNNI is 20 (resp. 50 and 80) times faster than MowgliNNI^{n}, when facing 1–20 (resp. 20–40 and 40–60) weak edges. This shows that the combinatorial optimization proposed in the Methods section is crucial in practice.
Now, when compared to Mowgli, the rearrangements tried by MowgliNNI^{n} on weak edges to obtain a better gene tree are done at the price of a relatively small computation time overcost. We also indicate the regression line of MowgliNNI running times with respect to those of Mowgli, plotted against the number of weak edges. Its slope is only 0.01186, meaning that MowgliNNI (the optimized version) is able to take into account the gene tree uncertainties with just a slight increase in the running time.
Transfers in prokaryotic phyla
On our whole dataset of 29,709 homologous gene families, we particularly studied transfers in 5 bacterial and 1 archaeal phyla: Proteobacteria (169 genomes), Actinobacteria (31 genomes), Cyanobacteria (14 genomes), Chlamydiae (7 genomes), Spirochaetes (7 genomes) and Crenarchaeota (10 genomes). We compared our results obtained with Mowgli and MowgliNNI to those of Abby et al. [30] obtained with the Prunier method [29] that infers transfers in monocopy gene families on another basis than reconciliation.
In order to compare our results to the Abby et al. study, we extracted particular families from HOGENOM v4. For each of the 6 phyla of interest, we collected the list of families having at most one copy of the gene for the genomes of this phylum and separated them into two groups: families having one copy of the gene for each genome of the phylum, socalled “universal families”, and families having a copy of the gene for at least 7 genomes of the phylum, socalled “nonuniversal families”.
For each phylum, we computed the number of intraphylum transfers inferred by reconciliations of Mowgli and MowgliNNI for families of the two groups (universal and nonuniversal). As the number of families we found in several groups among the various phyla varied slightly from those reported by Abby et al. [30] we summarized the findings of both studies in terms of transfer rates, expressed in number of transfers per million year and per family.
Figures 14 and 15 display transfer rates of universal, resp. nonuniversal families, for the studied phyla ordered by increasing transfer rate. We note that the obtained order on the phyla depends on the profile of the studied families, as in Abby et al. [30]. For both the universal and the nonuniversal families, the transfer rates obtained through Mowgli and MowgliNNI follow the same trend as that prediced by the Prunier method, i.e. the phyla are ordered in the same way, from the Spirochaetes up to the Actinobacteria in the case of universal families and from the Proteobacteria up to the Crenarchaeota in the case of nonuniversal families.
Figure 14. Impact of MowgliNNI on transfer rate inferred on a real dataset of “universal families”. Comparison of Mowgli, MowgliNNI and Prunier [29] on the basis of their transfer rate per million year per gene family, for prokaryotic phyla having monocopy universal families (i.e. families having one copy of the gene for each genome of the considered phylum). No monocopy universal family was found for Proteobacteria.
Figure 15. Impact of MowgliNNI on transfer rate inferred on a real dataset of “nonuniversal families”. Comparison of Mowgli, MowgliNNI and Prunier [29] on the basis of their transfer rate per million year per gene family, for prokaryotic phyla having monocopy nonuniversal families (i.e. families having at most one copy of the gene for the genomes of a considered phylum, and covering at least 7 genomes of this phylum). No monocopy nonuniversal families was found for the Chlamydiae and Spirochaetes phyla.
Finally, as expected, MowgliNNI reduced the number of inferred transfers compared to Mowgli, leading to transfer rates closer to that inferred by Prunier.
Conclusion
We introduce the MowgliNNI heuristic method relying on NNI rearrangements of the uncertain parts of the gene trees to solve a parsimony optimization problem for reconciliations accounting for duplications (), losses () and transfers (). We show experimental evidence that reconciliations computed under the parsimony criterion can efficiently correct erroneous parts of gene trees inferred from sequence data. On simulated data, MowgliNNI often proposes a new gene tree topology that is closer to the correct one and that also leads to better, and predictions. Moreover, the number of events and the number of most parsimonious reconciliations predicted by MowgliNNI are significantly lower than those obtained without questioning the gene tree topology. This is confirmed on real data. A critical point for parsimony methods is the choice of respective costs for the considered evolutionary events. We show here that MowgliNNI’s performance is only slightly altered when changing the costs given to the individual events (, and), that is, the method is robust to cost misspecification.
Appendix
Proof of Theorem 1
Theorem 1. Consider a gene tree G, the subdivision S^{′} of a species tree S, an edge (w, v) of G, a gene tree G^{′} obtained by an NNI operation on (w,v), and any strict ancestor u of w in G where the unique child of u that is an ancestor of w is u_{1} w.l.o.g. (i.e. w ≤ u_{1} in both G and G’). If c(u_{1} ,x) ≤ ;c^{′} (u_{1},x) holds for all x ∈ V(S^{′}), then c (u,x) ≤ c^{′} (u,x) holds for all x ∈ V(S^{′}), and as a corollary C(G,S^{′}) ≤ C(G^{′},S^{′}).
Proof. First remark that an NNI operation performed around the edge (w,v) of G to obtain a modified tree G^{′} does not alter the order of the nodes above v, which are then considered below indifferently of the tree they belong.
The proof is done with a recurrence over increasing time t ∈ {0,1,…,h(r(S^{′}))} for the subset of nodes V_{t}(S^{′}) ⊂ V(S^{′}). Recall that, in S^{′} the height of a node u (denoted h(u)) is a valid time function (see Figure 1) and that u_{1} is the child of u that is an ancestor of w (whereas u_{2} is incomparable with w). □
Base case
For time t = 0, the possible events for the internal node u and any leaf x ∈ V_{0}(S^{′}) are,, and (see the reconciliation model of Definition 1).
For each event, (resp.) depends on the costs c(u_{i},y) (resp. c^{′}(u_{i},y)) over all children u_{i} ∈ {u_{1},u_{2}} and vertices y ∈ V_{t}(S^{′}) (see Definition 3). Since u_{2} (resp. u_{1}) is incomparable to (resp. an ancestor of) w, Lemma 2 implies that c(u_{2},y) = c^{′} (u_{2},y) and the assumption states that c(u_{1},y) ≤ c^{′}(u_{1},y).
That all costs used in the equation of are not lower than the corresponding costs in that of leads to properties listed in the following remark.
Remark 1. The following properties hold for all internal nodes u ∈ V(G)∖L(G).
1. For all events and leaves x∈V_{0}(S^{′}), holds.
2. For all leaves leaves x ∈ V_{0}(S^{′}),.
3. , since/// are impossible events at height 0.
For a event of node u on a leaf x ∈ V_{0}(S^{′}), we have the following:
Hence, we have the following result:
Remark 2. For all internal nodes u ∈ V(G) ∖ L(G) and leaves x ∈ V_{0}(S^{′}), holds.
And we obtain the following:
Therefore, c(u,x) ≤ c^{′}(u,x) holds for each leaf x ∈ V_{0}(S^{′}).
Inductive step
For a height 0 ≤ t < h(S), we now suppose that the expected property c(u,y) ≤ c^{′}(u,y) holds for all vertices y ∈ V_{t}(S^{′}) and prove that it still holds for any vertex x∈V_{t+1}(S).,,,,, and are the possible events for node u and vertex x. Following exactly the same arguments as in the base case, Remark 1 ( and) and Remark 2 () still hold for the current time (t+1).
The dependencies of the corresponding cost for,, and events are as follows: depends on the costs c(u_{i},x_{i}) for u_{i} ∈ {u_{1},u_{2}} and x_{i} ∈ {x_{1},x_{2}}, with x_{i} ∈ V_{t}(S^{′}); on c(u,x_{1}), with x_{1} ∈ V_{t}(S^{′}); and on c(u,x_{i}) for x_{i} ∈ {x_{1},x_{2}}, with x_{i} ∈ V_{t}(S^{′}). The same dependencies apply for,, and. Recall that u_{2} (resp. u_{1}) is incomparable to (resp. an ancestor of) w and that Lemma 2 (resp. the assumption) implies that c(u_{2},x_{i}) = c^{′}(u_{2},x_{i}) (resp. c(u_{1},x_{i}) ≤ c^{′}(u_{1},x_{i})) for each x_{i} ∈ {x_{1},x_{2}}. Moreover, the inductive hypothesis states that c(u,x_{i}) ≤ c^{′}(u,x_{i}) holds for each child x_{i} of x since x_{i} ∈ V_{t}(S^{′}). For each event, that all costs used in the equation of are not lower than the corresponding costs used in the equation of leads to the following result:
For all events, internal nodes u ∈ V(G)∖L(G) and internal vertices x ∈ V_{t+1}(S^{′}), holds.
We obtain the following:
Therefore, c(u,x) ≤ c^{′}(u,x) holds for each vertex x ∈ V_{t+1}(S^{′}), and thus for all vertices of S^{′}.
As a corollary, the same inequality holds between the root nodes r of G and G^{′}, since w ≤ r. Then C(G,S^{′}) ≤ C(G^{′},S^{′}).
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
THN, VR, JPD and VB designed the algorithm and the simulation procedure. THN implemented the program and conducted the experiments on the simulated datasets. SP, AAC and VB carried out the experiments on the real dataset. THN, VR, JPD, AAC and VB wrote the paper. All authors read and approved the final version of this manuscript.
Acknowledgements
We thank Gergely J. Szöllősi for providing event costs of the real dataset. This work was funded by the french Agence Nationale de la RechercheInvestissements d’avenir / Bioinformatique (ANR10BINF01–02, Ancestrome), Programme 6ème Extinction (ANR09PEXT000 PhyloSpace) and by the Institut de Biologie Computationnelle.
References

Dayhoff MO: The origin and evolution of protein superfamilies.
Fed Proc 1976, 35(10):21322138. PubMed Abstract

Doyon JP, Scornavacca C, Gorbunov KY, Szöllösi G, Ranwez V, Berry V: An efficient algorithm for gene/species trees parsimonious reconciliation with losses, duplications and transfers.

Bansal MS, Alm EJ, Kellis M: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer, and loss.
In Bioinformatics 2012, 28(12):i283i291. Publisher Full Text

Hallett M, Lagergren J, Tofigh A: Simultaneous identification of duplications and lateral transfers. In RECOMB ’04. Edited by Bourne PE, Gusfield D. New York: ACM; 2004:347356.

Górecki P: Reconciliation problems for duplication, loss and horizontal gene transfer. In RECOMB. Edited by Bourne PE, Gusfield D. New York, NY, USA: ACM; 2004:316325.

Conow C, Fielder D, Ovadia Y, LibeskindHadas R: Jane: a new tool for the cophylogeny reconstruction problem.
Algorithms Mol Biol 2010, 5:16. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tofigh A, Hallett M, Lagergren J: Simultaneous identification of duplications and lateral gene transfers.

David LA, Alm EJ: Rapid evolutionary innovation during an archaean genetic expansion.
Nature 2011, 469(7328):9396. PubMed Abstract  Publisher Full Text

Goodman M, Czelusniak J, Moore GW, Romero Herrera A, Matsuda G: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences.
Syst Zool 1979, 28:132163. Publisher Full Text

Page RD: Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny.
Mol Phylogenet Evol 2000, 14:89106. PubMed Abstract  Publisher Full Text

Nakhleh L, Warnow T, Linder CR: Reconstructing reticulate evolution in species: theory and practice. In Proceedings of the Eighth Annual International Conference on Resaerch in Computational Molecular Biology. RECOMB ’04. New York: ACM; 2004:337346.

Arvestad L, Lagergren J, Sennblad B: The gene evolution model and computing its associated probabilities.

Doyon JP, Ranwez V, Daubin V, Berry V: Models, algorithms and programs for phylogeny reconciliation.
Brief Bioinformatics 2011, 12(5):392400. PubMed Abstract  Publisher Full Text

Ovadia Y, Fielder D, Conow C, LibeskindHadas R: The cophylogeny reconstruction problem is NPcomplete.
Comp J Biol 2011, 18(1):5965. Publisher Full Text

LibeskindHadas R, Charleston MA: On the computational complexity of the reticulate cophylogeny reconstruction problem.

Tofigh A: Using Trees to Capture Reticulate Evolution, Lateral Gene Transfers and Cancer Progression. PhD thesis, Royal, KTH,. Sweden: Institute of Technology; 2009.

Szőllösi GJ, Daubin V: Modeling gene family evolution and reconciling phylogenetic discord.
Methods Mol Biol 2012, 856:2951. PubMed Abstract  Publisher Full Text

Durand D, Halldorsson BV, Vernot B: A hybrid micromacroevolutionary approach to gene tree reconstruction.
Comput J Biol 2006, 13(2):320335. Publisher Full Text

Hahn MW: Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution.
Genome Biol 2007, 8(7):R141. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

BerglundSonnhammer AC, Steffansson P, Betts MJ, Liberles DA: Optimal gene trees from sequences and species trees using a soft interpretation of parsimony.
Mol J Evol 2006, 63(2):240250. Publisher Full Text

Chang W, Eulenstein O: Reconciling gene tree with apparent polytomies.

Vernot B, Stolzer M, Goldman A, Durand D: Reconciliation with nonbinary species trees.
Comput J Biol 2008, 15:9811006. Publisher Full Text

Chaudhary R, Burleigh JG, Eulenstein O: Algorithms for rapid error correction for the gene duplication problem. In Proceedings of the 7th International Conference on Bioinformatics Research and Applications. ISBRA’11. Berlin, Heidelberg: SpringerVerlag; 2011:227239.

Zheng Y, Wu T, Zhang L: Reconciliation of gene and species trees With Polytomies.
ArXiv 2012, 1201.3995v2.
[qbio.PE]

Lafond M, Krister Swenson M, ElMabrouk N: An optimal reconciliation algorithm for gene trees with polytomies. In WABI 2012, LNBI 7534. Edited by Tang J, Raphael B, Raphael B , Tang J . Berlin Heidelberg: SpringerVerlag; 2012:106122.

Stolzer M, Lai H, Xu M, Sathaye D, Vernot B, Durand D: Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees.
Bioinformatics 2012, 28:409415. Publisher Full Text

Górecki P, Eulenstein O: Algorithms: simultaneous errorcorrection and rooting for gene tree reconciliation and the gene duplication problem.
Bioinformatics, BMC, 2012, 13(Suppl 10):S14. BioMed Central Full Text

Abby S, Tannier E, Gouy M, Daubin V: Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests.
Bioinformatics, BMC 2010, 11:324. BioMed Central Full Text

Abby S, Tannier E, Gouy M, Daubin V: Lateral gene transfer as a support for the tree of life.
PNAS 2012, 109(13):49624967. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Semple C, Steel MA: Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. New York, USA: Oxford University Press; 2003.

Sanderson MJ: inferring absolute rates of evolution and divergence times in the absence of a molecular clock.
Bioinformatics 2003, 19:301302. PubMed Abstract  Publisher Full Text

Felsenstein J: Inferring Phylogenies. Sunderland: Sinauer Associates; 2004.

Knuth DE: The Art of Computer Programming. Redwood City: AddisonWesley Longman Publishing Co., Inc.; 1998.

Kendall DG: On the generalized birthanddeath process.
Ann Math Stat 1948, 19:115. Publisher Full Text

Galtier N: A model of horizontal gene transfer and the bacterial phylogeny problem.
Syst Biol 2007, 56:633642. PubMed Abstract  Publisher Full Text

Rambaut A, Grass NC: Seqgen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic trees.
Bioinformatics 1997, 13(3):235238. Publisher Full Text

Stamatakis A: Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models.
Bioinformatics 2006, 22(21):26882690. PubMed Abstract  Publisher Full Text

Robinson DF, Foulds LR: Comparison of phylogenetic trees.
Math Biosci 1981, 53:131147. Publisher Full Text

Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M, Perriere G: Databases of homologous gene families for comparative genomics.

Szőllösi GJ, Boussau B, Tannier E, Daubin V: Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations.
PNAS 2012, 109(43):1751317518. PubMed Abstract  Publisher Full Text  PubMed Central Full Text