Email updates

Keep up to date with the latest news and content from Algorithms for Molecular Biology and BioMed Central.

Open Access Research

Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions

Daniel Doerr1, Ilan Gronau2, Shlomo Moran3* and Irad Yavneh3

Author Affiliations

1 Center for Biotechnology, Bielefeld University, Bielefeld, Germany

2 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, USA

3 Computer Science Department, Technion - Israel Institute of Technology, Haifa, Israel

For all author emails, please log on.

Algorithms for Molecular Biology 2012, 7:22  doi:10.1186/1748-7188-7-22

Published: 31 August 2012

Abstract

Background

Distance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method.

Results

This paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura’s two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees.

Conclusions

We demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.

Keywords:
Phylogenetic reconstructions; Substitution models; Additive substitution rate functions