Abstract
Background
Distancebased phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method.
Results
This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the JukesCantor distance function when applied to data generated according to Kimura’s twoparameter model with a transitiontransversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees.
Conclusions
We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.
Keywords:
Phylogenetic reconstructions; Substitution models; Additive substitution rate functionsIntroduction
Phylogenetic reconstruction is the task of determining the topology of an evolutionary
tree underlying a given set of samples (species) using sequence data extracted from
them. This is typically done by assuming some simplified model for DNA sequence evolution,
in most cases modeling it as a homogeneous continuoustime Markov process
[13]. Distancebased reconstruction algorithms tackle this task by first computing a set
of
Substitution models used for phylogenetic reconstruction range from the simplest JukesCantor (JC) model [4], through slightly more complex and flexible models, such as Kimura’s twoparameter (K2P) model [5] and the HasegawaKishinoYano model (HKY) [6], to the General TimeReversible (GTR) model [7,8]. In previous works [9,10] we observed that substitution models which are not too restrictive or too general have many inherently different additive SR functions. We used this basic observation to demonstrate that it is possible to adjust for each given set of DNA sequences a “good” additive SR function, which leads to significantly increased phylogenetic reconstruction accuracy, compared to other additive SR functions. This exploits our ability to predict the stochastic noise associated with each SR function. When the SR function used for distance estimation is additive for the underlying substitution model, this stochastic noise is the only cause for inaccurate reconstruction. However, in the scenario, which is very common in practice, where the SR function in use is not additive for the model, an additional systematic bias is introduced in the distance estimates. This systematic bias in distance estimation results in a phylogenetic reconstruction method that might be statistically inconsistent in some cases. In this paper^{a}, we extend our previous line of research to this scenario, by removing the constraint of additivity. We do this by considering both the stochastic noise and systematic error.
Several previous studies have demonstrated the utility of phylogenetic reconstruction methods that are not generally statistically consistent. The maximum parsimony method has been long known to be inconsistent in some cases [11,12]. However, in other cases it was shown to be more likely to produce accurate reconstructions, compared with the maximum likelihood method [1315]. More recently, it has been demonstrated that reconstruction accuracy can be improved by deliberately assuming an oversimplified substitution model, when reconstructing a tree using maximum likelihood [16,17]. In the context of distancebased reconstruction, nonadditive distance measures have been shown in several cases to lead to improved accuracy when compared with additive measures [18,19]. Overall, these studies provide convincing evidence for the need to consider inconsistent phylogenetic reconstruction methods. However, none of them provide a rigorous framework for characterizing the cases in which inconsistent methods outperform consistent ones.
In this paper we develop a theoretical framework which provides a practical and systematic way to quantify the effect of distanceestimationbias on the accuracy of distancebased reconstruction. This framework is based on a novel method for measuring the deviation from additivity of SR functions. Coupled with the results in [9], this method enables evaluation of both the systematic bias and stochastic noise of SR functions. Such evaluation is important, because there is often a tradeoff between these two sources of error, stemming from the fact that simpler models with fewer parameters (such as JC) have smaller stochastic noise at the expense of greater estimation bias. Our framework allows us to consider this tradeoff when deciding which SR function to use for a given data set. This allows us to characterize a wide range of cases in which an SR function associated with an oversimplified evolutionary model results in increased reconstruction accuracy.
This finding falls in line with previous studies demonstrating the usefulness of phylogenetic reconstruction methods that are not generally consistent. Previous studies have attributed the increased accuracy of inconsistent methods mainly to the fact that these methods have a bias toward reconstructing certain topologies, leading to increased accuracy in cases where the phylogeny being reconstructed has the “favored topology”. We notice a similar behavior using our theoretical characterization of nonadditive SR functions. However, somewhat surprisingly, we find that nonadditive SR functions often have an advantage even when the phylogeny being reconstructed has an “unfavorable topology”. This is due to the reduced stochastic noise of the nonadditive SR function (compared with it additive alternatives), which compensates for its topological bias.
Our paper is organized as follows. Section “Background” outlines some of the required background and introduces several new concepts that are central in our analysis. Section “Deviation from additivity in homogeneous substitution models” provides the main analytic results in the paper, and introduces deviation from additivity as a measure of distance estimation bias. In that section we prove a general upper bound for this deviation and establish a connection with reconstruction accuracy. We then study deviation from additivity and stochastic error of the JC distance formula when applied to data generated under the K2P model. In Section “Performance of Non affineadditive SR functions in quartet resolution” we study the effect of deviation from additivity and stochastic error on the accuracy of quartet reconstruction. In the case of quartets we can draw a tight connection between the different sources of error in distance estimation and inaccuracy of reconstruction. We present a useful heuristic, based on the socalled Fisher criterion ( [20,21]), for comparing the expected accuracy of two SR functions in this context. In Section “Simulations on Hasegawa’s Tree” we extend our study to larger trees using experiments on simulated data based on the tree obtained by Hasegawa in [6]. Finally, In Section “Inferring trees from genomic sequences” we demonstrate our approach through a series of experiments reconstructing trees from bacterial gene sequences.
Background
In this section we provide a brief exposition of DNA substitution models and substitution rate functions used for distance estimation. We concentrate on details essential to this study and refer the reader to a previous paper [9] and standard textbooks [1,2] for a more complete survey.
Substitution Models
In this work, a DNA substitution model
A common assumption made on the substitution process is that it is homogeneous throughout time. This means that all rate matrices in the model are proportional
to each other. Such a substitution model is thus termed homogeneous, and it is defined by a unit rate matrixR as follows:
We use the Kimura’s twoparameter (K2P) model
[5] as a concrete example for demonstrating our approach. A rate matrix in this model
is defined by two rate parameters: α, which is the rate of transition type (ti) substitutions (
Each unit rate matrix of the K2P model defines a homogeneous submodel, which is identified
by its unique transitiontransversion (titv) ratio
Transition matrices in the K2P model have the same symmetric structure as the underlying rate matrices, with two distinct transition parameters: p_{α} – the probability of a transitiontype substitution; p_{β}– the probability of a transversiontype substitution. The transformations between (α,β,t ) and (p_{α},p_{β}) are given by the following equations:
Substitution rate functions
A substitution rate (SR) function for a model
Definition 2.1 (Additive SR function)
An SR function Δ is said to be additive for a substitution model
It is often explicitly required that an SR function be additive for the assumed model (see [9]). The evolutionary time, t, typically serves as the standard additive measure in most common substitution models. Throughout this study we follow the special case of K2P, focusing on the two SR functions defined below.
The first SR function, Δ_{K2P}, is the common SR function suggested for the K2P model in [5], and it is clearly additive, as it maps the transition probabilities onto evolutionary time t . The second SR function, Δ_{JC}, maps the transition probabilities onto evolutionary time only in the special case of the JC model where α =β . Under other homogeneous submodels of K2P, it is nonadditive. This nonadditivity is analyzed in details in section Deviation from additivity in homogeneous substitution models.
Additive metrics, Affineadditive mappings, and Nearadditivity
The core idea behind distancebased phylogenetic reconstruction is that a phylogenetic tree t can be accurately and efficiently reconstructed from pairwise distances which are additive with respect to t[23,24].
Definition 2.2 (Additive metric)
A metric d defined over the leafset L (t ) of a tree t is t additive (or additive w.r.t t ), if there exists a positive edgeweighting function
It is well known that additive SR functions imply additive metrics: if Δ is an additive SR function for a model
Definition 2.3 (Nearadditive mapping)
A dissimilarity mapping d on L (t ) is said to be nearadditive w.r.t. t iff there exists a t additive mapping ^{d ⋆} s.t.
where
For our results we will be using a generalization of this criterion, in which the mapping d^{⋆} can be any affineadditive mapping, defined below.
Definition 2.4 (Affineadditive mapping)
A dissimilarity mapping
As with additive metrics, affineadditive mapping are also associated with edge weights.
Let d be a t additive mapping corresponding to the edgeweighting function w (·). Then the edge weighting function
Local consistency
Atteson’s result plays a central role in arguing the statistical consistency of distancebased
phylogenetic reconstruction. Typically, this is done by assuming that the interleaf
distances are computed using an SR function Δ which is additive for the underlying substitution model
1. If Δ is additive for
2. As the length of the input sequences grows, the estimated transition matrices
3. When
4. The nearadditivity of the estimated dissimilarity map
This line of argument has been used in numerous works studying statistical consistency
of distancebased algorithms (e.g.,
[2527]), and in all these cases an additive SR function is assumed. Notice, however, that
this line of argument remains valid when
Definition 2.5 (Consistent SR function)
An SR function Δ of a substitution model
The main idea endorsed in this paper is that if an SR function only deviates slightly
from some SR function which is affineadditive for
Deviation from additivity in homogeneous substitution models
In order to assess whether a given SR function Δ is consistent w.r.t. a given model tree t, one has to find an affineadditive mapping d^{⋆} which minimizes the ratio
It can be shown that such Δ is affineadditive in the model if and only if Δ (t )=at + B for some
Definition 2.6 (Deviation from additivity)
Let
Lemma 2.7 below presents the basic relation between deviation from additivity and consistency. In Section Performance of Non affineadditive SR functions in quartet resolution we demonstrate the tightness of this relation.
Lemma 2.7
Let
Proof
We need to show that
For all i,j ∈L (t ), denote
□
An upper bound on the deviation of an SR function Δ from additivity in a given interval [t_{0},t_{1}] is implied from the error associated with its linear interpolation At + B within that interval (
Figure 1. Deviation from additivity and stochastic error. (a) Δ_{JC} is portrayed (green) in the homogeneous submodel of K2P with R =10 in the interval t ∈[0.8,2]. Its linear interpolation in that interval, Δ_{int}=At + B, is plotted in blue, and the maximum difference between the two functions is designated
by X. The deviation of Δ_{JC} from additivity in this setting is
Lemma 2.8
Let
Proof
Let us start by introducing a couple of auxiliary notations:
We are looking for
Since Δ_{int}(t )=At + B is a linear interpolation of Δ in t_{0}t_{1}, we have ψ (ABt_{0})=ψ (ABt_{1})=0. Let t_{min} be an arbitrary point in the interval t_{0}t_{1}s.t. ψ (ABt_{min})=ψ_{min}≤0 and let (t_{2}t_{3}) be the maximal open interval in t_{0}t_{1} containing t_{min} in which ψ (ABt )<0 (this interval can be empty if ψ_{min}=0). We define a similar interval (t_{4}t_{5}) in which ψ (ABt )>0 around some arbitrary t_{max}s.t. ψ (ABt_{max})=ψ_{max}. Note that the intervals (t_{2}t_{3}) and (t_{4}t_{5}) are disjoint, and that Δ_{int} is the linear interpolation of Δ in both these intervals (since ψ (ABt_{2})=ψ (ABt_{3})=ψ (ABt_{4})=ψ (ABt_{5})=0). Therefore, the bound on the error of polynomial interpolation (see, e.g., [28], p. 187) implies that
Combining these, we get
□
Note
In Appendix 3 we prove that if Δ does not intersect its linear interpolation Δ_{int}=At + B within the interval (t_{0},t_{1}), then the function At + B^{∗}mentioned in the proof above is, in fact, the affineadditive function which minimizes the deviation from additivity of Δ in [t_{0},t_{1}]. This means that, in such cases, the first inequality in (9) holds in equality. The last inequality in (9) also holds in equality in such cases, because we are guaranteed to have either [t_{2},t_{3}]=[t_{0},t_{1}] (when Δ is bounded from above by its linear interpolation) or [t_{4},t_{5}]=[t_{0},t_{1}] (when Δ is bounded from below by its linear interpolation). Thus, in such a case, the bound of Lemma 2.8 is reduced to the bound on interpolation error (middle inequality in (9)). Cases where Δ does not intersect its linear interpolation are frequent among many SR functions of interest, as this condition holds when Δ is either convex or concave.
Deviation of Δ_{JC}from Additivity in K2P
We now turn to study the deviation of Δ_{JC} from additivity in homogeneous submodels of K2P with titv ratio
Note that the homogeneous K2P submodel with
We get that for any given titv ratio
After determining a collection of trees for which a given nonaffineadditive SR function,
Δ, is consistent, one can compare the performance of Δ with additive alternatives. In our case, we compare Δ =Δ_{JC}, which is not affine additive when
By a similar application of the delta method to Δ_{JC}, we obtain:
where k is the sequence length and
Figure
1 provides an illustrative comparison of Δ_{JC}and Δ_{K2P} under the homogeneous submodel of K2P with titv ratio R =10, and within the interleaf time interval of [0.8,2]. Figure
1a shows the deviation of Δ_{JC}from additivity in that setting, using its linear interpolation Δ_{int}=At + B . Note that Lemma 2.8 and the subsequent note imply that
Performance of Non affineadditive SR functions in quartet resolution
The quartet tree is the smallest phylogenetic tree with nontrivial topology. Focusing
on quartets enables a close study of the effects of deviation from additivity and
stochastic noise on reconstruction accuracy. The topology of a quartet spanning four
taxa {1,2,3,4} can be represented by the split notation (ij kl ) (where {ijkl }={1,2,3,4}), indicating that the internal edge of the quartet separates ij from kl . All distance based quartet resolution algorithms essentially reduce to the fourpoint
method (FPM)
[26,30], which resolves this split using the six observed pairwise distances
For concreteness, we assume henceforth that the quartet split is (1234), meaning that the sum of the exact evolutionary times t_{12} + t_{34} is minimal. We start by analyzing the impact of the deviation from additivity of Δ_{JC} on the consistency of quartet resolutions. First, observe that any monotone distance function is consistent for quartets in which t_{12} and t_{34} are the smallest interleaf distances  as is the case with symmetric quartets, in which all external edges are of the same length. Therefore, we study two prototypes of asymmetric quartets. The length of the internal edge in both types is t_{i}, and each type has two long external edges of length t_{l}, and two short external edges of length t_{s}. In type A quartets (Figure 2a), the short edges are on one side of the split and the long edges are on the other side. In this case d_{12}and d_{34} re the smallest and largest interleaf distances (resp.). Hence, the concavity of Δ_{JC} increases the separation between the sum d_{12} + d_{34} and the other two competing sums, leading to an expected improvement in reconstruction accuracy. The other quartet configuration (type B; Figure 2b) has a short edge and a long edge on both sides of the split. In this case, the interval of interpolation is [d_{13},d_{24}], and the distance d_{12}=d_{34} is near the center of this interval. Thus the concavity of Δ_{JC} decreases the separation between the sums d_{13} + d_{24} and d_{12} + d_{34} by approximately twice the deviation from additivity of Δ_{JC} in that range.
Figure 2. Performance of the Four Point Method using Δ_{JC} on K2P quartets with titv ratio R = 2. The concave non affineadditive SR function Δ_{JC} is shown (dashed green line) in the interval [t_{0},t_{1}], where t_{0} and t_{1} are the smallest and largest of the six pairwise distances (resp.). The dashed blue line shows the linear interpolation Δ_{int}=At + B of Δ_{JC} in the interval [t_{0},t_{1}]. Horizontal dotted lines correspond to half of the two competing sums computed by FPM under the two SR functions (see legend). (a) In quartets of type A, t_{0}=t_{12} and t_{1}=t_{34}, and so Δ_{int}(1,2) + Δ_{int}(3,4)=Δ_{JC}(1,2) + Δ_{JC}(3,4). However, for i ∈{1,2} and j ∈{3,4}, Δ_{int}(i,j )<Δ_{JC}(i,j ). Therefore, the deviation from additivity of Δ_{JC} increases its FPM separation, denoted _{SEPJC}, compared to the FPM separation _{SEPint} of Δ_{int}. (b) In quartets of type B, t_{0}=t_{13} and t_{1}=t_{24}, and so Δ_{int}(1,3) + Δ_{int}(2,4)=Δ_{JC}(1,3) + Δ_{JC}(2,4). However, Δ_{int}(1,2)=Δ_{int}(3,4)<Δ_{JC}(1,2)=Δ_{JC}(3,4), and so Δ_{int}(1,2) + Δ_{int}(3,4)<Δ_{JC}(1,2) + Δ_{JC}(3,4). Therefore, the deviation from additivity of Δ_{JC} decreases its FPM separation, denoted _{SEPJC}, compared to the FPM separation _{SEPint} of Δ_{int}. Note that _{SEPint} remains invariant in both types of quartets under fixed t_{i} whereas _{SEPJC} changes, depending on the type of quartet and the t_{s}/t_{l }ratio.
When the deviation from additivity exceeds half the length of the internal edge, the sum d_{13} + d_{24} becomes the minimal sum, and Δ_{JC} becomes inconsistent. Note that this demonstrates the tightness of the condition stated in Lemma 2.7, and in this sense, type B quartets provide a worst case scenario for quartet resolution by a concave SR function^{e}.
Next we turn to compare the accuracy of Δ_{JC} with that of Δ_{K2P} when used to reconstruct its “worst case scenario” quartets of type B. Interestingly, Δ_{JC} ends up outperforming Δ_{K2P} on many of these quartets, due to its reduced stochastic noise (as predicted in our discussion revolving around Figure 1b). For example, consider a series of homogeneous K2P quartets of type B with titv ratio R =5, whose edge lengths were set as follows: t_{i}=0.2, t_{l}=1.0, and t_{s}∈[0.2,1.0]. We assessed reconstruction accuracy for both SR functions (Δ_{JC}and Δ_{K2P}) across this series of quartets, by generating 100,000 simulations of the substitution process using 1,000 bp long sequences for each quartet (Figure 3a). Despite its deviation from additivity, Δ_{JC} outperforms the additive SR function Δ_{K2P} on many of these quartets (as long as t_{l}/t_{s}<3.6) . Note that as t_{s} shrinks, the deviation of Δ_{JC} from additivity increases, since the interval [t_{0},t_{1}] expands. This experiment appears to indicate that the deviation of Δ_{JC} from additivity has to be quite large for Δ_{K2P } to outperform it.
Figure 3. Performance of Δ_{JC} and Δ_{K2P} on a series of quartets of type B. A series of homogeneous K2P quartets is considered (left illustration), with titv ratio of R =5, and edge lengths t_{i}=0.2, t_{l}=1, and t_{s}∈[0.2,1]. (a) Reconstruction accuracy using FPM and either Δ_{JC} (dashed green) or Δ_{K2P} (solid red) plotted against t_{l}/t_{s}. Accuracy ratio is estimated using 100,000 independent replicates for each value of t_{s} in the interval [0.2,1] (in steps of 0.01), with sequence length 1,000 bp. (b) Fisher’s Criterion (FC) for the sums corresponding to splits (1234) and (1324) under either Δ_{JC} (dashed green) or Δ_{K2P} (solid red) plotted against t_{l}/t_{s}.
Fisher’s criterion for separability
We now present a simple and general framework based on the socalled Fisher Criterion (FC) for predicting the relative accuracy of two SR functions in resolving quartets. FC measures the effective separation between normal random variables X ∼n (μ_{1}σ_{1}) and Y ∼n (μ_{2}σ_{2}) using the following measure^{f} ( [20,21]):
We use FC to measure the separability of the distance sum corresponding to the true split (which should be the minimal sum for consistent SR functions) from the two remaining sums. For the expectation μ of each sum we use the true distances as computed by the SR function on the actual model parameters. For the variance ^{σ 2}, we use the sum of the approximate variances of the two distances involved in the sum. We expect that an SR function which provides a larger separation of the smallest sum from the two other sums will imply a better reconstruction probability.
We note that FC is not an exact indicator of the separability in our case, because the necessary criteria for this are not satisfied in our model. Namely, the two distance sums are not normally distributed, and they are correlated through the substitution process along the external edges of the quartet. Nevertheless, as Figure 3b suggests, FC turns out to provide a quite reliable comparison of the expected performance of Δ_{JC} and Δ_{K2P} for the quartet series considered in the aforementioned experiment. Figure 3b exhibits for each quartet the FC of Δ_{JC} alongside that of Δ_{K2P}, both associated with the comparison of the true split (1234) and the “Δ_{JC} favored split” (1324). As shown, the trends observed in both FC plots closely resemble the trends observed in the reconstruction accuracy plot (Figure 3a), and the the equilibrium point of the FC values of Δ_{JC} and Δ_{K2P} is very close to the equilibrium point of the accuracy of reconstructions of these two functions (near t_{l}/t_{s}=3.6).
A useful feature of this framework is the natural way in which it teases apart the stochastic noise from the deviation from additivity. If we denote the numerator of FC by SEP (for “separation”) and its denominator by NOISE, then a comparison of FC estimates between two SR function Δ_{1},Δ_{2} can be represented as a ratio of ratios:
Figure 4 illustrates how a comparison between the expected performance of Δ_{JC} and that of Δ_{K2P} can be carried out by tracing the SEP and NOISE ratios along four series of homogeneous K2P quartet: the bottomleft plot corresponds to the quartet series considered in Figure 3; the plot above it corresponds to the same series with titv ratio R =2; the two plots on the right describe two quartet series in which the weight of the short edges is constant t_{s}=0.2, and the weight of the long edges ranges in [0.2,1]. These four series demonstrate several typical trends in the behavior of the SEP and NOISE ratios. First, we observe that the NOISE ratio decreases (favoring Δ_{JC}) as the diameter of the quartet (t_{24}) increases (it is almost constant in the two series on the left, and monotone decreasing in the series on the right). This is because the diameter provides the major contribution to the stochastic noise (for both Δ_{JC} and Δ_{K2P}), and as it increases, the ratio between the stochastic noise of Δ_{K2P} and Δ_{JC} increases as well. We also observe a natural decrease in the NOISE ratio with an increase in the titv ratio (the NOISE ratio for R =5 is consistently smaller than for R =2). Concerning the SEP ratio, we see it becomes smaller (favoring Δ_{K2P}) as the quartet becomes more unbalanced (the SEP ratio decreases along the X axis in each of the four plots). This is because the deviation of Δ_{JC} from additivity increases as the interleaf distance interval [t_{0},t_{1}]=[t_{13},t_{24}] expands. Deviation of Δ_{JC} from additivity also increases with the titv ratio, as the substitution model further departs from the assumptions of JC (the SEP ratio for R =5 is consistently smaller than for R =2).
Figure 4. SEP and NOISE ratios.SEP (Δ_{JC})/SEP (Δ_{K2P}) (dashed) and NOISE (Δ_{JC})/NOISE (Δ_{K2P}) (dotted) plotted against t_{l}/t_{s} for four series of homogeneous K2P quartets of type B. Top two series have titv ratio of R =2, and bottom two series have titv ratio of R =5. Left two series have external edge lengths t_{l}=1 and t_{s}∈[0.2,1], and right two series have external edge lengths t_{l}∈[0.2,1] and t_{s}=0.2. The length of the internal edge is constant t_{i}=0.2 in all four series.
The two series on the right side of Figure 4 demonstrate well the tradeoff between the effects of stochastic noise and deviation from additivity. In both series, the SEP and NOISE ratios decrease as the quartets become more unbalanced (due to the trends listed above). However, the rates of decrease of these two ratios are different due to the different titv ratios, and this determines the expected relative performance of the two SR functions across the series. When R = 2, the SEP ratio decreases at a slower rate than the NOISE ratio, and Δ_{JC} is expected to outperform Δ_{K2P} across the entire series. When R = 5, the SEP ratio decreases at a faster rate than the NOISE ratio, and when the quartets are sufficiently unbalanced (t_{l}/t_{s}>4) Δ_{K2P} is expected to outperform Δ_{JC}.
Simulations on Hasegawa’s Tree
In this section we describe experiments done on simulated data sets generated along the seventaxon tree assembled by Hasegawa, Kishino, and Yano in 1985 [1,6]. This tree, spanning seven eutherian mammals (Figure 5a), was reconstructed originally using mitochondrial DNA sequences. It has a caterpillar topology (meaning that every internal node is incident to an external edge), and it has long external edges and short internal edges, making it a suitable representative of small phylogenetic trees spanning moderately distant species. These features also make it particularly challenging for distancebased reconstruction.
Figure 5. Simulations on Hasegawa’s Tree. (a) Reconstruction accuracy of four different SR functions on different scaled versions of Hasegawa’s tree [6]. The tree with scaled edge weights is depicted (left) next to the graph (right) plotting reconstruction accuracy of four SR functions. Different scales of the tree are considered, indicated by the diameter of the tree (X axis). Reconstruction accuracy (Y axis) is measured for each scaled tree by the average normalized RF distance between the reconstructed tree and the true tree across 10,000 simulated data sets. Simulations were carried out assuming a ti=tv ratio of R = 2 and sequence length of 500 bp. (b) A similar plot is shown for a semisymmetric caterpillar tree.
In our study we used the tree structure and edge lengths to generate simulated data
sets. We considered the tree in various scales, by setting the tree diameter (largest
intertaxon path length) to values in the interval [0.1,2.0]. For each scale considered,
10,000 simulations were carried out, where in each simulation 500 bp sequences were
evolved along the tree according to a homogeneous K2P substitution model with titv
ratio of R =2. For each simulated data set, estimated values of the K2P statistics p_{α }and p_{β}, denoted by
We studied the reconstruction accuracy associated with four different SR functions:
Δ_{JC}, Δ_{K2P}, Δ_{tv}, and Δ_{R=2}. The first two are as described in Equations (5) and (4), respectively. The third
SR function, Δ_{tv}, considers only tvtype substitutions:
The performance of these four SR functions is traced across the different tree scales in Figure 5a. For each SR function Δ and scale s, we recorded the average normalized RF distance from the true tree to each of the 10,000 trees reconstructed using Δ . The RF distance was normalized by its maximum value which is twice the number of internal edges in the tree (in our case 2×4=8). As observed previously in [9], Δ_{K2P} performed well in shorter scales, and Δ_{tv} performed well in longer scales. However, both additive SR functions were significantly outperformed in nearly all cases by Δ_{JC}. Surprisingly, Δ_{JC} even slightly outperformed Δ_{R=2}. We speculate that this happened due to a bias similar to the one observed in type A quartets in Section Performance of Non affineadditive SR functions in quartet resolution, improving the performance of concave SR functions such as Δ_{JC} on certain K2Ptrees.
To test this hypothesis, we went through a similar experiment with a more symmetric seventaxon caterpillar tree, with internal edges of uniform length t_{int}, and external edges of uniform length t_{ext}=5t_{int }(Figure 5b). The symmetry of this tree was expected to reduce the effect of the reconstruction bias observed in Hasegawa’s tree, and indeed, Δ_{JC} performed much more poorly on this tree. Despite this fact, Δ_{JC} still outperformed Δ_{K2P} in all scales and Δ_{tv} in the smaller scales (s <1.1).
Inferring trees from genomic sequences
In this section we describe our study comparing various SR functions on genomic DNA sequences. Next to Δ_{JC} and Δ_{K2P} we also considered the well known LogDet SR function [36,37], denoted here as Δ_{LogDet}. Extending our study to this setting is challenging in two respects. First of all, unlike the simulated case, the true tree is not known with complete confidence, and accuracy of reconstruction can only be determined by using a wellaccepted reference tree that may contain some errors. Secondly, the true substitution model is also unknown and is likely to violate the assumptions of both JC and K2P models and even the relaxed assumptions of the general timereversible model (in which Δ_{LogDet} is additive). Hence, we have to assume in this case that Δ_{JC}, Δ_{K2P}, and Δ_{LogDet} are all non affineadditive, where Δ_{JC} and Δ_{K2P} are still likely to exhibit higher deviation from additivity than Δ_{LogDet}, since they make stronger assumptions on the substitution model.
The genomic data set
In building the genomic data set, we made use of a set of 31 clusters of orthologous groups (COGs) which was compiled by Ciccarelli et al. and used for inferring phylogenetic relationships amongst a large number of species in [38,39]. These 31 gene families were selected to capture the evolutionary history of the species containing them. This was done in [38] by making sure that the genes in these families have the following properties: (1) they are highly conserved across species, (2) they have a small number of paralogs, and (3) they are weakly affected by horizontal gene transfer. We scanned the NCBI genome database and found 199 bacterial genomes that contained all annotated COGs. For each of the 31 COGs, we extracted the appropriate protein sequence in each of the 199 bacterial species, choosing an arbitrary paralog in cases of multiple hits. We followed a procedure similar to the one described in [38,39] to obtain reliable multiplesequence alignments for each COG: we computed a 199way multiple alignment of the protein sequences of each COG using HMMalign [40] and then mapped each protein sequence back to its coding DNA sequence. The conserved parts of each of the 31 DNA alignments were extracted using GBLOCKS [41] to filter out alignment columns with 50% or more gap symbols. The alignments were manually scanned, and 36 species which contributed a large number of gaps to the alignments were removed from the subsequent analysis. The 31 different alignments were concatenated to form one long 163way multiple sequence DNA alignment.
For the reference tree we used the phylogenetic tree of microbial species provided by the ARBSILVA Living Tree Project [42]. This tree, spanning 8,029 species at the time of writing, is based on a widely accepted analysis of the small subunit (SSU) 16S RNA. A subtree spanning our 163 bacterial species was extracted from this tree and treated as the true phylogenetic tree in our analysis.
Reconstruction accuracy for tenspecies subsets
We used the base set of 163 species to generate 40,000 random 10species subalignments. The random selection process was guided to generate species subsets corresponding to a wide range of diameter scales (a blind random selection process is biased toward subsets with large diameters). For each of the 40,000 subsets, a 10way subalignment was extracted from the original 163way alignment, and in this alignment we extracted only columns corresponding to fourfold degenerate sites that do not have any gap symbol. This is done to make sure the sites used for distance estimation have undergone a substitution process that is as uniform as possible along the different lineages and across the different sites. Each subalignment was used to compute three distance matrices – one under Δ_{JC}, one under Δ_{K2P}, and one under Δ_{LogDet}. The latter was calculated by the version that is implemented in the PHYLIP package. The NJ algorithm was then applied to the three matrices and the resulting trees were compared to the true tree (as depicted by the appropriate LTP subtree) according to the RF distance.
As an additional comparison, we used a fourth reconstruction technique. This method (termed BIONJGTR) used the BIONJ reconstruction algorithm [43] on distances obtained under the general timereversible model with invariant sites and Gamma distribution of rates across variant sites (GTR+Γ +I) [8,44].
The PhyML package [45] was used to infer this tree for each of the 40,000 subsets. We selected the GTR+Γ +I model since it was found by the MEGA5 software [46] to provide the best fit to the sequence data. The 40,000 sampled instances were partitioned into eight bins according to the RF distance observed between the BIONJGTR tree and the true (LTP) tree, and average RF distances were recorded for each of the three SR functions in each bin. This allowed us to observe trends throughout these 40,000 samples (Figure 6). Of the 40,000 trees inferred under Δ_{JC}, 83.1% showed an equal or lower RF distance than those reconstructed by the BIONJGTR method. Moreover, Δ_{JC} outperformed Δ_{K2P} and Δ_{LogDet} on average in all partitions, and Δ_{LogDet} showed by far the worst performance with 48.7% of all reconstructed trees achieving higher RF distances to the reference tree than those inferred by BIONJGTR. As with our results on simulated data sets, we see that the SR functions with lower stochastic error but inferior model fit performed best. Unsurprisingly, the GTR+G+I model itself, which was predicted to have the best fit to the sequence data, was often outperformed by the simpler JC and K2P models. Note that the difference in performance between Δ_{JC} and the two other SR functions is greater for subsets that are more accurately reconstructed by the BIONJGTR approach (the lower bins). This appears to indicate that oversimplified distance methods are particularly beneficial when the sequence data conveys a stronger phylogenetic signal.
Figure 6. Evaluation against BIONJGTR tree. The 40,000 subsets of size 10 were partitioned according to the the RFdistance between the reference LTP tree and the tree reconstructed using BIONJGTR (X axis). The (left) Y axis describes the mean difference between the RFdistance associated with a tree reconstructed using a particular SR function (Δ_{K2P}, Δ_{JC}, or Δ_{LogDet}) and the RFdistance associated with the BIONJGTR tree. The bar plot in the background depicts the number of subsets in each bin.
Conclusions
In this paper we explored the basic properties of methods for estimating evolutionary
distances, and studied how these properties affect the accuracy of distancebased
phylogenetic reconstruction. We considered both the systematic bias and the stochastic
noise (variance) of the distance estimators, and examined the tradeoff between these
two factors. We focused on the common task of phylogenetic reconstruction under homogeneous
substitution models. Assuming homogeneous models simplifies the analytical framework,
since in such models each SR function is reduced to a univariate function of the evolutionary
time t . However, obtaining accurate estimates of t is still a hard task in this setting, since the unit rate matrix is unknown. An SR
function Δ is guaranteed to yield consistent reconstruction across all trees in a homogeneous model only if it is additive, meaning that it is a linear
function of t . When Δ is not additive, it introduces a systematic bias in distance estimates, which we
denoted here as deviation from additivity . Some SR functions are only additive in one homogeneous model, whereas others are
additive across a wider collection of homogeneous models. This less constrained additivity
is typically achieved at a price of increased estimation noise. We studied the tradeoff
between “deviation from additivity” and “estimation noise” via a case study where
the model tree is a homogeneous K2Ptree with an unknown titv ratio R . In this case, Kimura’s distance formula Δ_{K2P} is always additive, while the less noisy Jukes Cantor’s formula, Δ_{JC}, is additive only when
A study of this type requires a way to measure the deviation from additivity of a nonadditive SR function Δ in a given range of distances [t_{0},t_{1}]. To this end, we introduced the concept of affineadditive distance functions, and defined the deviation from additivity of Δ in [t_{0},t_{1}] as the distance of Δ from its closest affineadditive function in [t_{0},t_{1}]. We established a tight connection between this measure and statistical consistency of reconstruction (Lemma 2.7) and derived an upper bound for deviation from additivity in homogeneous models (Lemma 2.8). We applied these results in analyzing the deviation from additivity of Δ_{JC}, and its effect on the accuracy of reconstructing homogeneous K2Ptrees. We then showed, both analytically (in Section Deviation from additivity in homogeneous substitution models) and through experiments on simulated data sets (in Sections Performance of Non affineadditive SR functions in quartet resolution and Simulations on Hasegawa’s Tree), that, compared to Δ_{K2P}, it is often better to use the nonadditive but less noisy estimates of Δ_{JC}, even when R is quite high. Somewhat surprisingly, we found this to be the case even when the tree being reconstructed has an “unfavorable” topology. Our experiments on bacterial gene sequences (Section Inferring trees from genomic sequences) also indicate that the simple and less noisy SR functions perform better on average than ones that are expected to better fit the true substitution process.
The framework presented in this paper implies a practical way for selecting SR functions which are likely to increase the accuracy of distance estimation. The practicality of the method is drawn from the fact that the criteria by which we select an SR function depend only a relatively crude information about the tree being reconstructed. For instance, in the case of a homogeneous K2Ptree, one can easily obtain from the input sequences rough estimates of both the titv ratio R and the range of interleaf times [t_{0},t_{1}]. These estimates can then be used to compare the expected accuracies of Δ_{JC} and Δ_{K2P} on the given input, and determine which of them is more likely to yield an accurate phylogeny. For quartets, a tight comparison can be made using the FCbased approach suggested in Section Fisher’s Criterion for Separability, and for larger trees, a cruder comparison can be made using a plot like the one presented in Figure 1b. A promising avenue of further research is to extend the FCbased approach to allow tighter prediction of reconstruction accuracy of trees spanning more than four taxa.
Endnotes
^{a}This is a WABI 2011 special issue invited paper. Extended abstract of this paper appeared in [47]. ^{b}Typically, the unit rate matrix is assumed to be the one corresponding to one substitution per site. ^{c}Many common distancebased algorithms, such as the Neighbor Joining (NJ) algorithm [31,32], are known to be robust in this sense. ^{d}In a tree, edges which touch leaves are external, and all other edges are internal. ^{e}Types A and B quartets represent the Farris zone and Felsenstein zone, resp. (see, e.g., [1], Chapter 9). ^{f}We use here the square root of the criterion commonly used in the literature, because we prefer to think in terms of distances rather than squares of distances. This has no practical influence, since we use FC only for comparing between different choices, not for assessing the quality of a give choice.^{g} This ML estimate is obtained by a simple numerical method for maximizing the likelihood function (see, e.g., [1]).
Appendix
Tightness of Lemma 2.8
Let f (t ) be a (continuous) function on some interval [t_{0},t_{1}]. We prove below that if f does not intersect its linear interpolation At + B in that interval, then
Lemma 2.9
Let f (t ) be a monotone increasing function in the interval [t_{0},t_{1}] and let At + B be its linear interpolation in [t_{0},t_{1}]. If either f (t )≥At + B for all t ∈[t_{0},t_{1}] or f (t )≤At + B for all t ∈[t_{0},t_{1}], then for all a >0, we have
Proof
We prove the minimality of
For a >0, let B_{a}be the maximum value of B^{′}s.t.
Figure 7. Proof of Lemma 2.9. A function f(t) is portrayed (bold) with its linear interpolation At + B =At + B_{A}(green) in the interval [t_{0},t_{1}], s.t.f (t)≥At + B for allt ∈[t_{0},t_{1}]. Equation (18) is illustrated for a <A on the right, and equation (19) is illustrated for a >A on the left.
For a <A, we get the following equality (Figure 7; right):
Hence, since ψ (a,B_{a})≥ψ (a,B_{a},t ) for every t ∈[t_{0},t_{1}], and since a <A, we get
Similarly, if a >A, we get the following equality (Figure 7; left)
and a >A implies that
□
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors participated in discussing, formulating, and modulating the research. DD performed the simulations and experiments of Sections Simulations on Hasegawa’s Tree and Section Inferring trees from genomic sequences. IG and SM initiated and directed the research and drafted the manuscript. IY performed the analysis in Sections Deviation from additivity in homogeneous substitution models and Section Performance of Non affineadditive SR functions in quartet resolution and contributed to the ideas of the project. All authors contributed to the writing and editing of the manuscript, and all authors read and approved the final manuscript.
Acknowledgements
This research was supported by the Israel Science Foundation (ISF) grant No. 509/11. We also acknowledge the support for the publication fee by the Deutsche Forschungsgemeinschaft and the Open Access Publication Funds of Bielefeld University. The third author would also like to thanks the Max Planck Institute for Informatics for supporting a visit under which part of this research was carried out.
References

Felsenstein J: Inferring Phylogenies. Sunderland: MA Sinauer Associated Inc; 2004.

Semple C, Steel M: Phylogenetics. Oxford University Press; 2003.

Papoulis A, Pillali SU: Probability, Random Variables and Stochastic Processes. New York: McGraw Hill Higher Education; 2002.

Jukes T, Cantor C: Evolution of Protein Molecules. In Mammalian Protein Metab. Edited by Munro H. New York: Academic Press; 1969:21132.

Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences.
J Mol Evol 1980, 16(2):111120. PubMed Abstract  Publisher Full Text

Hasegawa M, Kishino H, Yano T: Dating of the humanape splitting by a molecular clock of mitochondrial DNA.
J Mol Evol 1985, 22(2):160174. PubMed Abstract  Publisher Full Text

Tavaré S: Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences.

Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates.
J Mol Evol 1984, 20:8693. PubMed Abstract  Publisher Full Text

Gronau I, Moran S, Yavneh I: Towards Optimal Distance Functions for Stochastic Substitution Models.
J Theor Biol 2009, 260(2):294307. PubMed Abstract  Publisher Full Text

Gronau I, Moran S, Yavneh I: Adaptive Distance Measures for Resolving K2P Quartets: Metric Separation versus Stochastic Noise.

Felsenstein J: Cases in which parsimony or compatability methods will be positively misleading.
Syst Zool 1978, 27:401410. Publisher Full Text

Cavender J: Taxonomy with confidence.
Math Biosci 1978, 40:271280. Publisher Full Text

Steel M, Penny D: Parsimony, likelihood, and the role of models in molecular phylogenetics.
Mol Biol Evol 2000, 17:839850. PubMed Abstract  Publisher Full Text

Sober E: A likelihood justification of parsimony.
Cladistics 1985, 1:209233. Publisher Full Text

Felstenstein J, Sober E: Parsimony and likelihood: an exchange.
Syst Zool 1986, 35:617626. Publisher Full Text

Yang Z: How often do wrong models produce better phylogenies?
Mol Biol Evol 1997, 14:105108. PubMed Abstract  Publisher Full Text

Bruno WJ, Halpern AL: Topological bias and inconsistency of maximum likelihood using wrong models. [http://wwwt10.lanl.gov/billb/BrunoHalpern99.pdf] webcite
Mol Biol Evol 1999, 16(4):564566. PubMed Abstract  Publisher Full Text

Zharkikh A: Estimation of evolutionary distances between nucleotide sequences.
J Mol Evol 1994, 39(3):315329. PubMed Abstract  Publisher Full Text

Gascuel O, Guindon S: Efficient Biased Estimation of Evolutionary Distances When Substitution Rates Vary Across Sites.
Mol Biol Evol 2002, 19(4):534543. PubMed Abstract  Publisher Full Text

Fisher R: The use of multiple measurements in taxonomic problems.

Duda R, Hart P: Pattern Classification and Scene Analysis. Hoboken: John Wiley and Sons; 1973.

Sumner J, FernandezSanchez J, Jarvis P: Lie Markov Models.
J Theor Biol 2012, 298:1631. PubMed Abstract  Publisher Full Text

Buneman P: The recovery of trees from measures of dissimilarity. In Mathematics in the Archeological and Historical Sciences. Edited by Hodson F, Kendall D, Tautu P. Edinburgh University Press; 1971:387395.

Sattath S, Tversky A: Additive similarity trees.
Psychometrica 1977, 42(3):319345. Publisher Full Text

Atteson K: The Performance of NeighborJoining Methods of Phylogenetic Reconstruction.
Algorithmica 1999, 25:251278. Publisher Full Text

Erdos P, Steel M, Szekely L, Warnow T: A few logs suffice to build (almost) all trees (I).
Random Struct Algorithms 1999, 14:153184. Publisher Full Text

Erdos P, Steel M, Szekely L, Warnow T: A few logs suffice to build (almost) all trees (II).
Theoret Comput Sci 1999, 221:77118. Publisher Full Text

Johnson L, Riess R: Numerical Analysis. Boston: Addison Wesley; 1977.

Zaretskii K: Constructing a tree on the basis of a set of distances between the hanging vertices.
Uspekhi Mat Nauk 1965, 20(6):9092.
[In Russian]

Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees.
Mol Biol Evol 1987, 4:406425. PubMed Abstract  Publisher Full Text

Studier J, Keppler K: A note on the neighborjoining algorithm of Saitou and Nei.
Mol Biol Evol 1988, 5(6):729731. PubMed Abstract  Publisher Full Text

Robinson F, Foulds R: Comparison of phylogenetic trees.
Math Biosci 1981, 53:131147. Publisher Full Text

Rambaut A, Grass NC: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees.
Comput Appl Biosci 1997, 13(3):235238. PubMed Abstract

Felsenstein J: PHYLIP  Phylogeny Inference Package (Version 3.2).

Steel M: Recovering a tree from the leaf colourations it generates under a Markov model.
Appl Math Lett 1994, 7(2):1924. Publisher Full Text

Lockhart P, Steel M, Hendy M, Penny D: Recovering evolutionary trees under a more realistic model of sequence evolution.
Mol Biol Evol 1994, 11(4):605612. PubMed Abstract  Publisher Full Text

Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward Automatic Reconstruction of a Highly Resolved Tree of Life.
Science 2006, 311(5765):12831287. PubMed Abstract  Publisher Full Text

von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P: Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments.
Science 2007, 315(5815):11261130. PubMed Abstract  Publisher Full Text

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; 1999.

Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments.
Syst Bio 2007, 56:564577. Publisher Full Text

Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, RosselloMora R: Update of the AllSpecies Living Tree Project based on 16S and 23S rRNA sequence analyses.
Syst Appl Microbiol 2010, 33:291299. PubMed Abstract  Publisher Full Text

Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data.
Mol Biol Evol 1997, 14(7):685695. PubMed Abstract  Publisher Full Text

Rodriguez F, Oliver JL, Marin A, Medina JR: The general stochastic model of nucleotide substitution.
J Theor Biol 1990, 142:485501. PubMed Abstract  Publisher Full Text

Guindon S, Gascuel O: A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood.
Syst Biol 2003, 52:696704. PubMed Abstract  Publisher Full Text

Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S: MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods.
Mol Biol Evol 2011, 28:27312739. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Doerr D, Gronau I, Moran S, Yavneh I: Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions. In Algorithms in Bioinformatics, Volume 6833 of Lecture Notes in Computer Science. Edited by Przytycka T, Sagot MF. Berlin / Heidelberg: Springer; 2011:4960.