Abstract
Background
Supertree methods combine trees on subsets of the full taxon set together to produce a tree on the entire set of taxa. Of the many supertree methods, the most popular is MRP (Matrix Representation with Parsimony), a method that operates by first encoding the input set of source trees by a large matrix (the "MRP matrix") over {0,1, ?}, and then running maximum parsimony heuristics on the MRP matrix. Experimental studies evaluating MRP in comparison to other supertree methods have established that for large datasets, MRP generally produces trees of equal or greater accuracy than other methods, and can run on larger datasets. A recent development in supertree methods is SuperFine+MRP, a method that combines MRP with a divideandconquer approach, and produces more accurate trees in less time than MRP. In this paper we consider a new approach for supertree estimation, called MRL (Matrix Representation with Likelihood). MRL begins with the same MRP matrix, but then analyzes the MRP matrix using heuristics (such as RAxML) for 2state Maximum Likelihood.
Results
We compared MRP and SuperFine+MRP with MRL and SuperFine+MRL on simulated and biological datasets. We examined the MRP and MRL scores of each method on a wide range of datasets, as well as the resulting topological accuracy of the trees. Our experimental results show that MRL, coupled with a very good ML heuristic such as RAxML, produced more accurate trees than MRP, and MRL scores were more strongly correlated with topological accuracy than MRP scores.
Conclusions
SuperFine+MRP, when based upon a good MP heuristic, such as TNT, produces among the best scores for both MRP and MRL, and is generally faster and more topologically accurate than other supertree methods we tested.
Keywords:
MRP; MRL; supertrees; phylogeneticsBackground
Because estimation of large trees is computationally challenging [13] and topological error tends to increase with the number of taxa [47], supertree methods (which estimate trees on full sets of taxa from sets of smaller trees) may be key to accurate estimations of the Tree of Life. Many supertree methods have been proposed: see [8] for an overview of early methods, and also [917]. Some of these (e.g., the RobinsonFoulds supertree approach in [9]) operate only on rooted source trees, while others (e.g., the Maximum Likelihood Supertree Method in [15]) are only theoretical (i.e., have not yet been implemented). Of the various methods that are implemented, MRP (Matrix Representation with Parsimony) [18,19] is by far the most frequently used. Furthermore, studies have shown that of these methods, only MRP produces highly accurate supertrees on datasets of unrooted source trees with large numbers of taxa [17,20].
MRP operates in two steps. Given a set of source trees with set S of taxa, the first step produces a large matrix, called the MRP matrix, with one row for every taxon in S and one column for every edge of every tree in . For a given edge e in a given source tree t, the column in the MRP matrix has entries over {0,1, ?}, with 0 given for the taxa that are on one side of the edge e, 1 for the taxa on the other side, and ? for all the remaining taxa (i.e., the ones that do not appear in the tree t). This way of encoding each source tree is called the "BaumRagan" coding; however, when the source trees are rooted, other techniques (e.g., the Purvis coding) can be used (see a comparison between these coding methods in [20]). The second step then uses the maximum parsimony criterion to produce a tree on the MRP matrix. The MRP approach thus depends on whether substitutions from 0 to 1 have the same cost as substitutions from 1 to 0, and hence also depends on how the states 0 and 1 are assigned to the leaves in each source tree, for each given edge. In particular, the MRP matrix definition depends upon whether the trees in are rooted or unrooted. The simplest version of MRP (and the one we study in this paper) treats all input trees as unrooted and uses the standard maximum parsimony criterion in which all substitutions have equal cost (this is called "reversible Fitch parsimony"). For this very simple version of MRP (i.e., BaumRagan encoding, followed by reversible Fitch parsimony), the choice of state (i.e., 0 or 1) for each side of each edge has no impact on the output, and so can be made arbitrarily. Methods for MRP are based upon techniques for the NPhard maximum parsimony problem [21]. The most popular MRP heuristics therefore use good heuristics for maximum parsimony (MP), such as PAUP* [22] and TNT [23].
Recently, Swenson et al. [24] introduced a new supertree method, SuperFine+MRP, that operates in two steps: in the first step, an incompletely resolved tree called the "strict consensus merger" (SCM) tree is computed, and in the second step MRP heuristics are used to refine each high degree node (polytomy) in the SCM tree. Their study showed that SuperFine+MRP produced topologically more accurate trees than MRP (both methods based upon the same MP heuristics in PAUP* [22]) and also ran in much less time. However, in some cases (in particular on very large supertree datasets), the SCM tree contained very large polytomies, so that refining the large polytomies could require a substantial time effort, thus reducing the running time advantage of SuperFine+MRP over MRP. Speeding up the analysis through a parallelization of SuperFine's refinement step is also hampered by the fact that refinement of very large polytomies using MRP is the most computationally intensive part of the SuperFine analysis [25].
Our objective was therefore to find an alternative to MRP for the refinement step within SuperFine. In this paper we examine supertrees estimated by analyzing the MRP matrix using RAxML's [26] fast heuristics for maximum likelihood under the symmetric 2state model (so that the change from 0 to 1 is as likely as the change from 1 to 0) with CAT distribution of rates across sites, and we refer to this as the S2+CAT model. We call the optimization problem in this approach to supertree estimation "matrix representation with likelihood", or MRL. Thus, MRL is the counterpart to MRP, and uses S2+CAT maximum likelihood instead of maximum parsimony as a criterion for estimating a supertree from the MRP matrix.
We report on a simulation study we performed to compare MRP to MRL (using fast heuristics for both) as supertree methods, and also to refine the SCM tree computed by SuperFine. Our study shows that using RAxML for MRL produces topologically more accurate trees than the MRP heuristics (PAUP* and TNT [23]) we studied, and that MRL scores (under GTR+gamma, see discussion below) correlate very well with tree accuracy (and in general better than MRP scores). These results suggest that MRL may be a useful optimality criterion for supertree estimation. Second, we show that SuperFine can be used to obtain better scores for MRP and good scores for MRL, and faster than the standard heuristics for these problems.
Methods and Materials
Datasets
Simulated datasets
We used 500 and 1000taxon datasets used in previous studies [17,28]. These supertree datasets consist of profiles of source trees, with each source tree computed by running RAxML [29] on DNA sequence alignments produced in simulation. These simulated datasets have realistic patterns of missing data, reflecting both biological processes and taxon sampling strategies used by systematists in phylogenetic studies. Two types of source trees were generated on the model trees: "cladebased source trees (each tree being a dense sample within a specific clade of the model tree), and "scaffold" source trees (a random sampling of a proportion of the taxa throughout the model tree). The proportion of taxa from the model tree that is sampled in the scaffold tree is called the "scaffold density". Supertrees are generally more accurate when estimated from dense rather than sparse scaffold trees. These simulated datasets have scaffold trees with four densities, 20%, 50%, 75% and 100%. Each supertree dataset consists of a number of cladebased source trees and one scaffoldbased source tree, but the number of cladebased source trees depended upon the number of taxa (15 for the 500taxon datasets and 25 for the 1000taxon datasets). For each scaffold density, there are 30 replicates with 500 taxa and 10 replicates with 1000 taxa. However, for scaffolds with low densities, a few of the datasets did not have sufficient taxonomic overlap to perform an accurate supertree analysis and were excluded from the results. In total, 4 of the 40 1000taxon datasets (3 from the 20% scaffold density and 1 from the 50% scaffold density) and 6 of the 120 500taxon datasets (all from the 20% scaffold density) were excluded from analysis.
Biological datasets
We examined the performance of the supertree methods on six biological datasets shown in Table 1. The number of source trees and taxa in each dataset varied from a dataset with a small number of taxa but many source trees (115 taxa and 726 source trees) to a dataset with a large number of taxa but fewer source trees (2,228 taxa and 39 source trees). The details for the generation of the seabirds, placental mammals, marsupial, and THPL source trees can be found in the references mentioned in Table 1. The comprehensive papilionoid legumes (CPL) dataset was originally studied in [30] as a combined analyses of 39 markers and consisted of 2228 taxa and 33,168 sites in the alignment; for this dataset, we used source trees that were estimated by Swenson et al. [17].
Table 1. Statistics for biological datasets
Supertree methods
Since earlier studies [17,28] established that the simplest version of MRP (unrooted source trees, all substitutions have equal cost) outperformed other base supertree methods, we used MRP as a benchmark.
MRP
For the MP heuristic used within MRP analyses, we ran PAUP* using the parsimony ratchet implementation, identically as in [17,28]. We also ran TNT [23], using a combination of sectorial search, tree drifting, and fusing (i.e., techniques within TNT that are effective for largescale parsimony analysis). We refer to these two ways to run MRP as MRP(PAUP*) and MRP(TNT). At the end of each MRP analysis, we had a collection of equally good MRP solutions, from which we produced a greedy consensus tree (also known as an extended majority consensus). Scripts used to generate the PAUP* and TNT runs are available upon request.
MRL
For the ML heuristic used within the MRL analyses, we used RAxML [29]. RAxML is potentially the most accurate ML heuristic for large datasets, and when used with its BINCAT model it can work directly with the MRP matrices. We refer to this way of running MRL by MRL(RAxML). Note that any ML package that supports the symmetric 2state model can be used instead of RAxML for the MRL analysis.
SuperFine
We briefly describe SuperFine [24]. SuperFine uses a twostep technique, where the first step produces a typically unresolved tree called the "Strict Consensus Merger" (SCM) tree, and the second step then refines this tree. The SCM tree is obtained by merging two trees at a time until all the trees are combined. Each of these pairwise mergers contracts edges on which the two trees either disagree or which have "collisions". At the end of the process, the final tree contains all the taxa, but may be only partially resolved. The order in which the trees are merged can impact the accuracy (and resolution) of the final tree, as observed in [31]; therefore, we use the same rule for determining the ordering on pairwise mergers as used in [24], which considers the size of the overlap in taxon sets when picking the pair of trees to merge.
The second step of SuperFine resolves the SCM tree, one node at a time. The resolution of a single high degree node (described below) depends only on the partition of the taxa into subsets, as defined by the node, and the topologies of the source trees; therefore, these resolutions are independent of each other and so the order does not matter.
To resolve a single high degree node v (i.e., polytomy) in the SCM tree, SuperFine first labels the neighbors of v by 1...d, where the SCM tree has d subtrees off of v (i.e., d = deg(v)). Next every leaf in each of d subtrees is relabeled by the label (from 1...d) assigned to the root of its subtree. At this point, SuperFine creates a new set of source trees, by modifying each of the input source trees so that each contains at most d leaves, as follows: If x is an internal node in a source tree that is adjacent to two leaves, each of which has the same label l, then we remove its neighboring leaves and relabel x by l. In [24], it was proven that this modification produces source trees that have at most one leaf with each label. SuperFine then applies its base supertree method (such as MRP) to compute a supertree on the modified source trees, each of which has at most d leaves. Since d may be much smaller than the number of taxa, this supertree estimation can be very fast. Finally, the resultant supertree produced for this polytomy is used to define the refinement at that node. Note, therefore, that the refinement at that node depends only on the base supertree method and the modified source trees, and that the modified source trees are defined by the source trees and the partition of the taxa defined by the node. Because the source trees and the partition of the taxa defined by a node v does not depend upon whether other polytomies are processed before or after v, the order in which the polytomies are processed has no impact on the output.
We use the terminology "SuperFine+MRP" to refer to SuperFine used with MRP to perform the refinement step, but SuperFine can be used with other base supertree methods. Thus, we use the terminology "SuperFine+MRL" to refer to SuperFine used with MRL to perform the refinement step. We ran SuperFine+MRP based upon PAUP* (using the same parsimony ratchet implementation as MRP(PAUP*)) and TNT (using the same combination of sectorial search, tree drifting, and fusing as MRP(TNT)), and refer to these two different versions by SuperFine+MRP(PAUP*) and SuperFine+MRP(TNT). We note that this way of running SuperFine+MRP(PAUP*) is identical to that reported in [24]. We ran SuperFine+MRL based upon RAxML (using the same RAxML commands as MRL(RAxML)), and refer to this as SuperFine+MRL(RAxML).
The software used in this study is available in opensource form by request from the authors; the datasets are available online at http://www.cs.utexas.edu/users/phylo/datasets/supertrees.html webcite.
Measurements
We compared the trees produced by the supertree methods (MRP, MRL, SuperFine+MRP, and SuperFine+MRL) to the true supertree (known because the data are simulated). We report two error rates: the missing branch rate (i.e., the percent of the internal edges in the true supertree missing in the estimated supertree, and also known as the false negative (FN) rate) and the false positive rate (i.e., the percentage of the internal edges in the estimated supertree that do not appear in the true supertree). For those estimated supertrees that are fully resolved, these two error rates will be equal. However, the false positive error will always be at most that of the false negative error rate, since the true supertree is always binary.
We computed the MRP scores of the estimated supertrees (i.e., the MP scores of the trees with respect to the MRP matrix). We report ML scores under S2+Γ (the symmetric 2state model with gamma distribution of rates across sites) instead of under the S2+CAT model. This is motivated by the observation that RAxML's search under S2+Γ is computationally more intensive than its search under S2+CAT, and that both searches return trees of comparable topological accuracy. However, the preferred models for phylogeny estimation have used gammadistributed rates instead of CATdistributed rates. For these two reasons, we report ML scores under gammadistributed rates for trees found using the S2+CAT model, and we call these the "MRL scores". Finally, since MRP methods return a set of most parsimonious trees, we report the false positive rates, false negative rates, and MRL scores with respect to the greedy consensus of the set of trees. For the MRP scores, we report the best MP score found during the heuristic search.
For the biological datasets, since the true supertrees are not known, we computed the SumFN [17] distance to the source trees, which is the percent of the internal edges in the source trees missing in the estimated supertree. Note that SumFN is identical to SumRF (the sum of the RF distances) when the source trees and the estimated supertree are binary. We also computed the MRP and MRL scores of the estimated supertrees.
Finally, we also report average running times for each method on each model condition as well as on the biological datasets.
Correlation Analyses
For each simulated dataset, we examined how well each of the scores (MRP score, MRL score, and SumFN) correlate with missing branch rates. Since we only have six estimated supertrees per dataset, we generated more trees for each dataset to run our correlation analyses, using pedgecontractandrefine (pECR) [32] moves as follows. A pECR move operates by randomly contracting p edges in a tree and then randomly refining the resultant unresolved tree. For each dataset in each model condition, the six estimated supertrees were used to generate a set of pECR neighboring trees, with p drawn between 0 and 25% of the internal edges. This was repeated 100 times per supertree, resulting in a total of 600 trees. We then compute the MRL, MRP, SumFN, and missing branch rates for each of the 606 trees (600 pECR trees plus 6 supertrees). In other words, the results we report are for 114 different 500taxon supertree datasets (24 from the 20% scaffold density and 30 from the remaining three model conditions) and 36 different 1000taxon datasets (7 from the 20% scaffold density, 9 from the 50% scaffold density, and 10 from the remaining two model conditions). We compute the Spearman's rank correlation between the MRL, MRP, and SumFN scores and the missing branch rates. We then averaged the Spearman's rank correlation coefficient across replicates and report the average Spearman's rank correlation as a function of the model condition.
Results
Simulated Datasets
We begin our discussion of results by examining the results for the simulated datasets. We focus primarily on the average missing branch rates for each model condition (as defined by the scaffold density), but also note the false positive rate (which can differ from the missing branch rate when the supertrees are not fully resolved). In addition, we note the MRP and MRL scores; this allows us to consider whether using better heuristics (for the optimization problems) result in improved topological accuracy. Finally, we also examine running times.
Topological error rates
We begin with a discussion of the missing branch rate for the estimated supertrees (Figure 1 and Table 2 for 1000taxon datasets, and Table 3 for 500taxon datasets). Note that the supertree methods had very close error rates when the scaffold tree contains all the taxa (i.e., the scaffold density is 100%), that all methods improved in accuracy as the scaffold density increases, and that the difference in error rates between methods decreased with the increase of scaffold density (trends already observed for SuperFine+MRP and other supertree methods in [17,28]). Since biological supertree datasets often do not contain scaffold trees containing all the taxa (and frequently contain only sparsely sampled scaffold trees), we focus our attention on performance on the sparse scaffold trees, with 20% or 50% scaffold densities.
Figure 1. Average missing branch rates and running times for 1000taxon model conditions. The average missing branch rates and running times (in minutes) for the supertree methods for the 1000taxon model conditions as a function of scaffold density. The standard error is shown for the missing branch rates, and the standard deviation is shown for the running times. Averages are computed only on replicates where there is sufficient taxonomic overlap to perform an accurate supertree analysis. n = 10 for all scaffold densities except n = 7 for the 20% scaffold density, and n = 9 for 50% scaffold density.
Table 2. Missing branch rates for 1000taxon model conditions
Table 3. Missing branch rates for 500taxon model conditions
The next observation is that where there was any noticeable difference between supertree methods, MRP(TNT) and MRP(PAUP*) gave the highest missing branch rates, and that MRL(RAxML) and all versions of SuperFine gave the most accurate trees. Also, the difference between the error rates was greatest on the 20% scaffold density conditions, and decreased as the density of the scaffold tree increased, as expected.
An important observation is that although there were some statistically significant differences, the error rates of the various versions of SuperFine never differed by more than 1%. To some extent this is expected, since the majority of the SuperFine tree topology is produced by its first step, when it computes the SCM tree, and the second step (here performed using MRL or MRP) is limited to refining the SCM tree.
We now discuss the false positive rates; Table 4 gives these rates for 1000taxon datasets and Table 5 gives rates for 500taxon datasets. We note that the false positive and missing branch rates were nearly identical for the MRP and MRL trees, indicating that these trees were fully resolved. By contrast, the SuperFine trees were not always fully resolved, and hence had lower false positive rates than their missing branch rates.
Table 4. False positive rates for 1000taxon model conditions
Table 5. False positive rates for 500taxon model conditions
Finally, we discuss the SCM tree. We note that the missing branch rates (Tables 2 and 3) were quite high, and that the false positive rates (Tables 4 and 5) were quite low (although not equal to zero). The high false negative rate means that the SCM tree is not a good point estimate of the true tree, an observation also established in [24]. On the other hand, its low false positive rate means that most of its edges are likely to be true of the true tree, and makes it a good constraint tree (which is how it is used within SuperFine).
Which methods give good MRL and MRP scores?
With respect to MRL scores, not surprisingly, MRL(RAxML) gave the best MRL scores (Tables 6 and 7). The next best methods were SuperFine+MRP(TNT) and SuperFine+MRP(PAUP*). MRP(PAUP*) typically gave the least accurate MRL scores.
Table 6. MRL scores for 1000taxon model conditions
Table 7. MRL scores for 500taxon model conditions
With respect to MRP scores (Tables 8 and 9), we find that SuperFine+MRP(TNT) and SuperFine+MRP(PAUP*) had nearly identical performance and the best scores of all methods for scaffold densities of 20% and 50%. For denser scaffolds, MRP(TNT), SuperFine+MRP(TNT), and SuperFine+MRP(PAUP*) had the best scores. At 100% scaffold density, MRP(PAUP*) also produced the best scores. Thus, although MRP(PAUP*) and MRP(TNT) directly try to optimize MRP scores, SuperFine+MRP(PAUP*) and SuperFine+MRP(TNT) gave better MRP scores, and hence were more effective heuristics for MRP, especially for sparse scaffolds.
Table 8. MRP scores for 1000taxon model conditions
Table 9. MRP scores for 500taxon model conditions
These results together suggest that the best MRP heuristics are SuperFine+MRP(PAUP*) and SuperFine+MRP(TNT) (followed by MRP(TNT)), while the method that generally produces the best MRL scores is MRL(RAxML), followed closely by SuperFine+MRP(PAUP*) and SuperFine+MRP(TNT).
Running Time
In many ways, the different variants of SuperFine are extremely close, producing trees of almost identical topological accuracy (among the most accurate across all scaffold densities), and producing good heuristics for MRP and MRL. However, how do they perform in terms of running time? We focus here on the results for the 1000taxon datasets, shown in Figure 1 and Table 10 (see Table 11 for 500taxon datasets). All versions of the SuperFine methods were fast, finishing in all the simulated datasets in just a few minutes (on average, about eight (8) minutes on the 1000taxon datasets). MRP(TNT) was also fast (finishing in under 5 minutes on all these datasets), but the remaining methods were much slower: MRL(RAxML) often took more than 1.5 hours and MRP(PAUP*) took close to an hour on the 1000taxon datasets.
Table 10. Running times for 1000taxon model conditions
Table 11. Running times for 500taxon model conditions
The tradeoffs between the different methods can be seen more clearly on the sparse scaffold conditions. Figure 2 shows that MRL(RAxML) was slow but very accurate, MRP(TNT) was very fast but inaccurate, MRP(PAUP*) fell in between both of these methods, and all SuperFine methods were both accurate and fast.
Figure 2. Scatterplot of average missing branch rates versus running times for 1000taxon 50% scaffold density model conditions. The average missing branch rate versus running times (in minutes) for the supertree methods for 9 replicates of the 1000taxon 50% scaffold density model conditions.
Correlation of MRL, MRP, and SumFN with tree error
We also consider the question of how well the MRL, MRP, and SumFN scores correlate with tree error (as measured by FN rate). In other words, is it helpful to find a supertree that optimizes MRL, MRP, or SumFN?
Some trends are immediately obvious from the correlation analysis (Tables 12 and 13). First, all three scores were statistically correlated with tree error according to Spearman's rank correlation test and after Bonferroni correction for multiple hypothesis testing (pvalues not shown). Second, SumFN and MRP scores had roughly the same correlation coefficient across the scaffold densities. Focusing on low scaffold densities, MRL had a much larger correlation coefficient than MRP and SumFN. In general, MRL and tree error were more strongly correlated at all scaffold densities, except at 100% scaffold densities, where all scores correlated strongly with tree error. We also note that all pairwise correlations became stronger as the scaffold density increased.
Table 12. Correlation analyses for 1000taxon model conditions
Table 13. Correlation analyses for 500taxon model conditions
This preliminary study showing a relatively strong correlation between MRL scores and tree error suggests that methods for optimizing MRL scores may have some inherent value, especially for the low scaffold density conditions where the correlation between MRP scores and tree error is much lower.
Biological datasets
Because we do not have a reliable "true species tree" for the biological datasets, we compare estimated supertrees in terms of SumFN, MRP, and MRL scores. While we have shown the correlation of these scores to topological error for the simulated datasets, the extent of correlation for each empirical dataset is not known. Nevertheless, these scores, when considered together, enable a framework, albeit an imperfect one, for evaluating estimated supertrees. Table 1 shows the reference for each biological dataset and various empirical statistics (number of source trees, number of taxa, the scaffold density, degree of resolution for the SCM tree). We show SumFN scores in Table 14, MRP scores in Table 15, MRL scores in Table 16, and running time in Table 17.
Table 14. SumFN rates for the biological supertrees
Table 15. The MRP scores (MP scores with respect to the MRP matrix) for the biological supertrees
Table 16. The MRL scores (ML scores with respect to the MRP matrix, given as log likelihoods) for the biological supertrees
Table 17. Running times, in minutes, for the biological supertrees
Placental dataset
All methods resulted in identical SumFN scores on this dataset. Unsurprisingly, MRP and SuperFine+MRP methods resulted in the best MRP scores, and MRL(RAxML) and SuperFine+MRL(RAxML) in the best MRL scores. The fastest method was MRP(TNT) (less than a minute) followed closely by SuperFine+MRP(TNT) (less than two minutes), and only the methods that use RAxML took more than five minutes. Due to the large number of source trees, many of which had incompatible edges, the SCM tree was almost completely unresolved. For this dataset, therefore, the SuperFine trees were almost identical to trees obtained using their base supertree methods.
Seabirds dataset
SuperFine+MRP and MRP(TNT) had the best MRP scores and the best SumFN scores. MRL(RAxML) had the best MRL score, followed by SuperFine+MRP(TNT). All methods completed in under a minute for this dataset.
Marsupials dataset
On this dataset, all methods had close SumFN scores. MRP and SuperFine+MRP again had the best MRP scores, and MRL(RAxML) had the best MRL score followed by SuperFine+MRP(TNT). Both MRP(TNT) and SuperFine+MRP(TNT) completed in under a minute, while MRL(RAxML) (the slowest method) completed in under eight minutes.
THPL dataset
For this dataset, SuperFine methods and MRL(RAxML) resulted in much better SumFN scores compared to the other methods. SuperFine+MRP(TNT) resulted in the best MRP score, followed closely by MRL(RAxML). An interesting observation is that MRP methods had, by far, the worse MRP scores. MRL(RAxML) once again resulted in the best MRL score, followed by SuperFine+MRL(RAxML) and SuperFine+MRP(TNT). All SuperFine methods and MRP(TNT) completed in under a minute, but MRL(RAxML) and MRP(PAUP*) were much slower (25 and 32 minutes, respectively).
CPL dataset
All methods resulted in similar SumFN scores. The best MRP scores were obtained by SuperFine+MRP(TNT) and MRP(TNT). The best MRL score was obtained by MRL(RAxML), followed by SuperFine+MRL(RAxML). This dataset is the largest dataset we examined, with 2,228 taxa and 39 source trees, and the six supertree methods differed substantially in terms of their running times. MRL(RAxML) and MRP(PAUP*) were the slowest, finishing in 461 minutes (i.e., more than 7.5 hours) and 675 minutes (i.e., more than 11 hours), respectively. MRP(TNT) and SuperFine+MRL(RAxML) were the next slowest, finishing in 30 minutes and 29 minutes, respectively. By comparison, SuperFine+MRP(TNT) completed in less than 4 minutes. Thus, only SuperFine+MRP(TNT) was fast on this dataset.
Summary
Several key observations are noted. First, SuperFine+TNT gave the best results for SumFN, but other than on the seabirds and THPL datasets, all methods produced trees with similar SumFN scores. MRL(RAxML) typically resulted in the best MRL scores, and SuperFine+MRP(TNT) often produced trees with the second best MRL scores. SuperFine+MRP(TNT) also resulted in the best MRP scores for all datasets. SuperFine+MRP(TNT) was among the fastest methods, and on the largest dataset it was substantially faster than any other method. Thus, although we cannot evaluate the topological accuracy of any of these estimated supertrees, SuperFine+MRP(TNT) had a good overall performance for all criteria we evaluate (MRP, MRL, SumFN, and running time).
Discussion and Conclusion
Supertree estimation methods need to be both highly accurate and also reasonably fast, as otherwise they will not be useful in estimating large phylogenies. Our discussion thus addresses both running time and topological accuracy.
The results for the simulated datasets show clearly that all the methods produce trees with about the same accuracy on datasets with very dense scaffolds, but differ substantially in terms of accuracy on the datasets with sparser scaffolds. Since sparser scaffolds are common for biological supertree inputs, the differences in accuracy on sparse scaffolds is important.
In general we found that all the SuperFine variants we studied (whether using MRL or MRP to refine polytomies in the SCM tree) produced very accurate trees, and that differences between them were largely in terms of running time, or with respect to MRL, MRP, or SumFN score. With respect to running time, SuperFine+MRP(TNT) was the fastest of all the methods we studied, finishing in at most 4 minutes on all the datasets (including the largest one with 2228 taxa and 39 source trees). Furthermore, SuperFine+MRP(TNT) produced very good MRP and MRL scores, outperforming TNT and PAUP* with respect to MRP score optimization. On the biological datasets, we also observed similar results, including that SuperFine+MRP(TNT) generally produced very good SumFN scores. Thus, although SuperFine+MRP(TNT) was not designed to be a heuristic for any of these criteria, it has excellent performance across the board.
It is worth discussing in greater depth the results we showed for SumFN scores. Our study shows that neither MRP nor SumFN scores have the best correlation with tree error, except when the scaffold factor is very dense. This result suggests that optimizing MRP or SumFN may not be the best strategy (except with dense scaffolds), and that evaluating supertree methods with respect to SumFN may not be the best way of distinguishing methods (except for dense scaffold datasets, perhaps). These observations were made earlier in [24], but are worth repeating here, because of the increased interest in an approach to supertree estimation proposed by Steel and Rodrigo [15], called "maximum likelihood supertrees". This method is based upon an exponential error model, and can be based upon different ways of measuring distances between trees and weights on the input trees. However, in the simplest case, where the weights on trees are all the same and the distance between trees is the RF distance, finding the ML supertree is identical to optimizing SumRF (minimizing the total topological distance, using RobinsonFoulds scores, to the input trees), a criterion that is almost identical to SumFN. Indeed, when the input estimated trees are binary, these criteria are exactly the same. Since our simulation study estimated supertrees from binary source trees, our correlation analysis also shows that optimizing SumRF is not likely to be the best strategy, except for dense scaffold datasets, and thus suggests that the use of RF distance metric within the ML supertree approach proposed by Steel and Rodrigo may not be appropriate. We note here a potential shortcoming of the ML supertree approach in general: it seems likely that the probability of a particular estimated tree will not depend only on the topological distance it has to the true tree, but rather also on the parameters of the true tree (especially the branch lengths), since very short branches are more likely to fail to be recovered in a phylogenetic estimation than longer branches.
A fundamental observation in this study is that searching for supertrees that optimize the maximum likelihood score under the S2+CAT model improved tree accuracy, a trend that we found quite surprising. The MRP matrix is a collection of partial binary characters defined by the input source trees. When these trees are compatible, the MRP matrix will exhibit no homoplasy at all, a condition under which the MRP solution will yield the true tree. Therefore, when there is no homoplasy, the ML solution under a nocommonmechanism model [33] (in which every combination of edge and site has its own rate parameter) will also produce the true tree, since then ML and MP produce the same trees. However, standard ML models (including the model used in this study) assume i.i.d. rates across sites, which does not yield the same result. Thus, we do not have a theoretical explanation for why optimizing likelihood under S2+CAT should lead to good supertrees. All we can say is that the data suggest that there may be some value (even if only approximate, and perhaps only under some conditions, not yet understood) in using maximum likelihood under this model as an optimization criterion for estimating supertrees. Future work should investigate whether optimizing the MRL score continues to return good solutions when the source trees are estimated from sequences that evolve under more realistic conditions, including indels, heterotachy, and nonstationarity.
As has been noted in [34], supertree analyses are not always able to completely identify the true tree, because the conditions required for such identification include correct source trees and overlap properties that may not be true of any given set of source trees. However, alternatives  such as combined analyses, in which a phylogeny estimation method is applied to a concatenation of the gene sequence alignments  also have only limited guarantees. From a practical standpoint, the evidence suggests that while combined analyses can yield more accurate trees [20,24] than supertree methods, there are conditions in which combined analysis methods cannot be used (e.g., heterogeneous data, including morphology, gene orders, or different types of molecular data), or are simply too computationally intensive. In these cases, improved supertree methods can be important tools in the phylogenetics toolkit.
In summary, this study introduces a new set of supertree methods based upon combining the divideandconquer strategy within SuperFine with fast supertree methods. In particular, the combination of SuperFine with TNT is extremely fast and produces very accurate supertrees, even on the largest datasets we studied. Earlier work [24] showed that SuperFine (based upon MRP, and using PAUP*) came very close to the accuracy of combined analysis based upon maximum likelihood. Future work should investigate statistical approaches to supertree estimation (along the lines of maximum likelihood supertrees, but taking branch lengths or support into account). The combination of SuperFine with such statisticallybased supertree methods might close the gap between combined analysis and supertree methods.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
TW designed the study; NN and SM developed the software; NN performed the study; TW, NN, and SM analyzed the data and wrote the paper. All authors read and approved the final manuscript.
Acknowledgements
This research was supported by two grants from the US NSF (DEB 0733029 and DBI1062335), the John Simon Guggenheim Memorial Foundation Fellowship to TW, an NSERC fellowship to SN, and a David Bruton Jr. Centennial Professorship to TW. We also thank the anonymous reviewers for their helpful comments.
References

Foulds LR, Graham RL: The Steiner problem in phylogeny is NPcomplete.

Wang L, Jiang T: On the complexity of multiple sequence alignment.

Roch S: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard.

Moret B, Roshan U, Warnow T: Sequence length requirements for phylogenetic methods. In Proceedings of the 2002 International Workshop on Algorithms in Bioinformatics (WABI'02), Rome, Lecture Notes in Computer Science (LNCS #2452). Springer Verlag; 2002:343356.

Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T: Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees.

Liu K, Linder CR, Warnow T: Multiple sequence alignment: a major challenge to largescale phylogenetics. [http:/ / knol.google.com/ k/ kevinliu/ multiplesequencealignmentamajor / ectabesw3uba/ 9] webcite

Liu K, Warnow T, Holder M, Nelesen S, Yu J, Stamatakis A, Linder C: SATéII: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

BinindaEmonds O: Phylogenetic Supertrees combining information to reveal The Tree Of Life. Computational Biology, Kluwer Academic, Dordrecht, the Netherlands; 2004.

Bansal M, Burleigh J, Eulenstein O, FernándezBaca D: RobinsonFoulds Supertrees.

Baum B, Ragan MA: The MRP method. In Phylogenetic Supertrees combining information to reveal The Tree Of Life. Edited by BinindaEmonds ORP. Kluwer Academic, Dordrecht, the Netherlands; 2004:1734.

Burleigh J, Eulenstein O, FernándezBaca D, Sanderson M: MRF supertrees. In Phylogenetic Supertrees combining information to reveal The Tree Of Life. Edited by BinindaEmonds O. Kluwer Academic, Dordrecht, the Netherlands; 2004:6586.

Chen D, Eulenstein O, FernándezBaca D, Sanderson M: MinimumFlip Supertrees: complexity and algorithms.

Ranwez V, Berry V, Criscuolo A, Fabre P, Guillemot S, Scornavacca C, Douzery E: PhySIC: a veto supertree method with desirable properties.

Ranwez V, Criscuolo A, Douzery EJ: SuperTriplets: a tripletbased supertree approach to phylogenomics.

Swenson M, Suri R, Linder C, Warnow T: An experimental study of Quartets MaxCut and other supertree methods.
Alg Mol Bio 2011., 6(7)
Special issue for selected papers from WABI 2010

Baum B: Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees.

Ragan MA: Phylogenetic inference based on matrix representation of trees.

Kupczok A, Schmidt H, von Haeseler A: Accuracy of phylogeny reconstruction methods combining overlapping gene data sets.

Foulds LR, Graham RL: The Steiner problem in phylogeny is NPcomplete.

Swofford D: PAUP*: Phylogenetic analysis using parsimony (* and other methods) Ver 4. Sinauer Associates, Sunderland, Massachusetts; 2002.

Goloboff P, Farris J, Nixon K: TNT, a free program for phylogenetic analysis.

Swenson M, Suri R, Linder C, Warnow T: SuperFine: fast and accurate supertree estimation.
Syst Biol 2011, in press.
http://sysbio.oxfordjournals.org/content/early/2011/09/16/sysbio.syr092.abstract webcite

Neves D, Warnow T, Sobral J, Pingali K: Parallelizing SuperFine.

Stamatakis A: RAxMLNIHPC: Maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models.

Goloboff PA, Pol D: On divideandconquer strategies for parsimony analysis of large data sets: RecIDCM3 versus TNT. [http://sysbio.oxfordjournals.org/content/56/3/485.abstract] webcite

Swenson M, Barbançon F, Linder C, Warnow T: A simulation study comparing supertree and combined analysis methods using SMIDGen.
Alg Mol Bio 2010., 5(8)
Special issue of selected papers from WABI 2009

Stamatakis A, Ludwig T, Meier H: RAxMLIII: a fast program for maximum likelihoodbased inference of large phylogenetic trees.

McMahon M, Sanderson M: Phylogenetic supermatrix analysis of GenBank sequences from 2228 Papilionoid legumes.

Swenson M: Phylogenetic Supertree Methods. PhD thesis. The University of Texas at Austin; 2008.

Ganapathy G, Ramachandran V, Warnow T: On contractandrefinetransformations between phylogenetic trees. In ACM/SIAM Symposium on Discrete Algorithms (SODA'04). SIAM Press; 2004:893902.

Tuffley C, Steel M: Links between maximum likelihood and maximum parsimony under a simple model of site substitution.

Sanderson MJ, McMahon MM, Steel MA: Terraces in phylogenetic tree space.

Beck RMD, BinindaEmonds ORP, Cardillo M, Liu FGR, Purvis A: A higherlevel MRP supertree of placental mammals.

Kennedy M, Page R: Seabird supertrees: combining partial estimates of procellariiform phylogeny.

Cardillo M, BinindaEmonds ORP, Boakes E, Purvis A: A specieslevel phylogenetic supertree of marsupials.

Wojciechowski M, Sanderson M, Steele K, Liston A: Molecular phylogeny of the "temperate herbaceous tribes" of papilionoid legumes: a supertree approach.