Log on / register
BioMed Central home | Journals A-Z | Feedback | Support | My details
Open AccessSoftware article

Noisy: Identification of problematic columns in multiple sequence alignments

Andreas WM Dress1,2 email, Christoph Flamm3 email, Guido Fritzsch4,5 email, Stefan Grünewald1,2 email, Matthias Kruspe5 email, Sonja J Prohaska3,6,7 email and Peter F Stadler8,5,9,3,6 email

Department of Combinatorics and Geometry (DCG), MPG/CAS Partner Institute for Computational Biology (PICB), Shanghai Institutes for Biological Sciences (SIBS), Shanghai, PR China

Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22 -26, D 04103 Leipzig, Germany

Institut für Theoretische Chemie und Molekulare Strukturbiologie Universität Wien, Währingerstraße 17, A-1090 Wien, Austria

Institute of Biology II: Zoologie, Molekulare Evolution und Systematik der Tiere, University of Leipzig, Talstrasse 33, D-04103 Leipzig, Germany

Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany

Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe NM 87501, USA

Biomedical Informatics, Arizona State University, PO-Box 878809, Tempe, AZ 85287, USA

Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany

RNomics Group, Fraunhofer Institut for Cell Therapy and Immunology (IZI), Perlickstraße 1, D-04103 Leipzig, Germany

author email corresponding author email

Algorithms for Molecular Biology 2008, 3:7doi:10.1186/1748-7188-3-7

Published: 24 June 2008

Abstract

Motivation

Sequence-based methods for phylogenetic reconstruction from (nucleic acid) sequence data are notoriously plagued by two effects: homoplasies and alignment errors. Large evolutionary distances imply a large number of homoplastic sites. As most protein-coding genes show dramatic variations in substitution rates that are not uncorrelated across the sequence, this often leads to a patchwork pattern of (i) phylogenetically informative and (ii) effectively randomized regions. In highly variable regions, furthermore, alignment errors accumulate resulting in sometimes misleading signals in phylogenetic reconstruction.

Results

We present here a method that, based on assessing the distribution of character states along a cyclic ordering of the taxa, allows the identification of phylogenetically uninformative homoplastic sites in a multiple sequence alignment. Removal of these sites appears to improve the performance of phylogenetic reconstruction algorithms as measured by various indices of "tree quality". In particular, we obtain more stable trees due to the exclusion of phylogenetically incompatible sites that most likely represent strongly randomized characters.

Software

The computer program noisy implements this approach. It can be employed to improving phylogenetic reconstruction capability with quite a considerable success rate whenever (1) the average bootstrap support obtained from the original alignment is low, and (2) there are sufficiently many taxa in the data set – at least, say, 12 to 15 taxa. The software can be obtained under the GNU Public License from http://www.bioinf.uni-leipzig.de/Software/noisy/ webcite.


© 1999-2010 BioMed Central Ltd unless otherwise stated. Part of Springer Science+Business Media.