Abstract
Recently one step mutation matrices were introduced to model the impact of substitutions on arbitrary branches of a phylogenetic tree on an alignment site. This concept works nicely for the fourstate nucleotide alphabet and provides an efficient procedure conjectured to compute the minimal number of substitutions needed to transform one alignment site into another. The present paper delivers a proof of the validity of this algorithm. Moreover, we provide several mathematical insights into the generalization of the OSM matrix to multistate alphabets. The construction of the OSM matrix is only possible if the matrices representing the substitution types acting on the character states and the identity matrix form a commutative group with respect to matrix multiplication. We illustrate this approach by looking at Abelian groups over twenty states and critically discuss their biological usefulness when investigating amino acids.
Keywords:
Maximum likelihood; Maximum parsimony; Substitution model; Tree reconstruction; Group theoryBackground
Alignments of homologous sequences provide fundamental materials to the reconstruction of phylogenetic trees and many other sequencebased analyses (see, e.g., [1,2]). Each alignment column (site) consists of character states that are assumed to have evolved from a common ancestral state by means of substitutions. Any combination of the character states in the aligned sequences at one alignment column represents a socalled character[3], which is sometimes also called site pattern[4]. Given a phylogenetic tree and an alignment that evolved along the tree, Klaere et al. [5] showed, for binary alphabets, how a character changes into another character if a substitution occurs on an arbitrary branch of the tree. The impact of such a substitution is summarized by the socalled One Step Mutation (OSM) matrix. The OSM matrix allows for analytical formulae to compute the posterior probability distribution of the number of substitutions on a given tree that give rise to a character [5].
Nguyen et al. [4] extended the concept of the OSM matrix to the fourstate nucleotide alphabet while developing a method, the Misfits algorithm, to evaluate the goodness of fit between models and data in phylogenetic inference. There, the OSM matrix is constructed based on the Kimura three parameter (K3ST) substitution model [6]. Nguyen et al. [4] illustrated how one can apply the Fitch algorithm [7] to compute the minimal number of substitutions required to change one character into another character under the OSM setting. In the present paper, we deliver a proof of the validity of this algorithm.
In addition, the OSM matrix can be constructed only if the matrices representing the substitutions, the socalled substitution matrices, and the identity matrix form a commutative or Abelian group (see, e.g., [8]) with respect to matrix multiplication [4]. The link between Abelian groups in phylogenetic models has been studied before, most notably by Hendy et al. [9]. Further, an extension of nucleotide substitution models with an underlying Abelian group to joint states at the leaves of a tree has also been studied by other authors. Bashford et al. [10] introduced an approach very similar to OSM to study the multitaxon tensor space. Bryant [11] also introduced a very similar framework to study the Hadamard transform of [12] in the light of multitaxon processes.
In this work, we first introduce standard phylogenetic notation. We then formalize the construction of the OSM matrix, and which part of its construction is used in the Misfits algorithm. We further present possible extensions of the OSM framework to arbitrary alphabets. We will show that the Misfits algorithm in fact computes the minimal number of substitutions needed to change one character into another character. Moreover, we discuss the extension of the algorithm to substitution models which do not have an underlying Abelian group. Finally, we discuss the Abelian groups available for amino acids.
Notation and problem recapitulation
Notation
Recall that a rooted binary phylogenetic Xtree is a tree
Furthermore, recall that a characterfis a function
An extension of fto
Construction of the OSM matrix
We now introduce the OSM framework in a stepwise fashion. The aim of the OSM approach
is to determine the effects a single mutation occurring on a rooted tree
The first task of this approach is to formalize the term mutation and its effects
on a single character state in
C1. For all
This guarantees that a mutation affects a character state in a unique fashion. It is wellknown that any bijective function on a finite discrete state set is a permutation (e.g., [13]). Thus, a mutation is a specific instance of a permutation applied to a character.
The next step is to select the set Σof admissible permutations acting on
C2. For every pair
C3. For all σ_{1},σ_{2} ∈ Σ also the product σ_{1} ∘ σ_{2} ∈ Σ. Mathematically speaking, Σis closed with respect to concatenation of its permutations.
C4. For all σ_{1},σ_{2}∈Σwe have
C5. There is an element σ_{0} ∈ Σ such that for all σ_{1} ∈ Σ we have
C6. For every σ_{1} ∈ Σ there exists a σ_{2} ∈ Σ such that σ_{1} ∘ σ_{2} = σ_{0}. Mathematically speaking, for every element of Σthere exists an inverse element. This guarantees that every permutation can be reversed within a single step.
C7. For all σ_{1},σ_{2},σ_{3} ∈ Σ we have σ_{1} ∘ (σ_{2} ∘ σ_{3}) = (σ_{1} ∘ σ_{2}) ∘ σ_{3} = σ_{1} ∘ σ_{2} ∘ σ_{3}, i.e. the associative law holds.
It should be noted that any set of permutations is associative, i.e. satifies C7.
Thus, for a set of permutations Σ to be Abelian with a regular action on
In the following, we consider the matrix representation of permutations. A permutation
matrix over
Example 1.In genetics, the most commonly used character state set is
The KleinFourgroup consists of the four Kronecker products of these two matrices, i.e.s_{0} = τ_{0} ⊗ τ_{0}, s_{1} = τ_{1} ⊗ τ_{0}, s_{2} = τ_{0} ⊗ τ_{1}, and s_{3} = τ_{1} ⊗ τ_{1}. The Kronecker products here yield 4×4 matrices, e.g.,
The setΣ_{K3ST}:={s_{0}s_{1}s_{2}s_{3}} coincides with the substitution matrices under the Kimura 3ST model[6]. In particular,s_{1}describes transitions within purines (AG) and pyrimidines (CT), s_{2}represents transversions within pairs (AC) and (GT), ands_{3}represents the remaining set of transversions within pairs (AT) and (CG).
The second Abelian group over four states, the cyclic group
Note that there are actually six different fourcycles for
The next step in constructing the OSM matrix is to construct a set
This can also be described by the Kronecker product, i.e. equally
This means that there are r^{n}different operators in Σ^{n}=Σ⊗⋯⊗Σ.
Remark 1.Therefore, for any pair of characters
Another noteworthy consequence of using the Kronecker product is that the elements
of Σ^{n} are permutations over
In the OSM framework we assume that the permutations acting on a character
If a permutation acts on an interior edge e, then it simultaneously acts on the states of all descendant taxa of e, i.e. all those taxa whose path to the root passes e. E.g., assume Taxa 1 and 2 form a cherry, i.e. their most recent common ancestor, 12, has no other descendants, and permutation σ_{i}∈Σ, i=1,…,r−1 is acting on the edge leading to this ancestor. Then, we get the permutation
This shows in particular that a Kronecker product of some permutations acting on each character state is equivalent to the matrix product of the permutations acting on the entire character. The right hand side equation shows that a single permutation on an internal edge has the same effect as simultaneously applying the same permutation on the pendant edges of all descendant taxa. In other words, if de(e) denotes the set of descendants of edge e, and σ_{i}∈Σ, then
Note that the set Σ^{X}of all permutations acting on the pendant edges is a generator of Σ^{n}, i.e. the closure of Σ^{X} contains all permutations in Σ^{n}. Since Σ^{n} contains a single permutation to transform character
where r is the number of states in Σ.
For every Xtree
Example 2.Regard the K3ST model from Example 1 and the rooted twotaxon tree depicted in Figure1a. With this
Each permutation which acts on the characters is thus a symmetric 16×16 permutation
matrix depicting a transition (s^{e,1}), transversion 1 (s^{e,2}), or transversion 2 (s^{e,3}) along edge
Figure 1. Construction of the OSM matrix. (a) A rooted tree with taxa 1 and 2. (b) A transition s_{1}on the left branch e_{1}(the red branch) changes a character into exactly one new character as depicted by
the red horizontal stripe cells of the permutation matrix
We are now in a position to recall the definition of the OSM matrix
The transformation problem
With the construction of
Algorithm 1
INPUT: rooted binary phylogenetic tree
STEP 1: Using Remark 1, find the substitution type σ_{i}which translates f_{j}into
STEP 2: Let c:=c_{1}…c_{1}be a constant character on X with
STEP 3: Calculate
OUTPUT:m.
We prove the correctness of our algorithm. In our framework, m corresponds to the minimum number of permutations σ_{1},…,σ_{m}∈Σ such that σ_{1}⊗⋯⊗σ_{m}(f)=f^{d}. In this form, m has multiple equivalent interpretations. It is the length of the shortest path between
f and f^{d} in the Cayley graph for
Example 3.Figure2demonstrates how Algorithm 4 works under the K3ST model, i.e. when the group isΣ = Σ_{K3ST} (Figure2a). Consider the rooted fivetaxon tree in Figure2band the character GTAGA at the leaves. Assume that the character GTAGA is to be converted into character ACCTC. By comparing the two characters positionwise, we need a substitutions_{1}on the external branch leading to taxon 1 to convert G into A at the first position. Similarly, we need a substitutions_{1}on the external branch leading to taxon 2, and a substitutions_{2}on every external branch leading to taxa 3, 4, and 5. Thus, the operations:=(s_{1}s_{1}s_{2}s_{2}s_{2}) transfers the character GTAGA into the character ACCTC. As the operation s also translates the constant character AAAAA into GGCCC, converting GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree to obtain the character GGCCC at the leaves. The Fitch algorithm[7]applied to the character GGCCC with the constraint that the character state at the root is A produces a unique most parsimonious solution of two substitutions as depicted by Figure2c.
Figure 2. Computing the minimal number of substitutions to translate a character into another one. (a) depicts the Kleinfour group Σ_{K3ST}, which consists of the identity s_{0}and the three substitution types s_{1},s_{2},s_{3}from the K3ST model. (b) In order to convert the character GTAGA into ACCTC under Σ_{K3ST}, we need to introduce the operation s:=(s_{1},s_{1},s_{2},s_{2},s_{2}). As the operation s also translates the constant character AAAAA to GGCCC, converting GTAGA into ACCTC is equivalent to evolving the character state A at the root along the tree to obtain the character GGCCC at the leaves. The Fitch algorithm applied to the latter produces a unique most parsimonious solution of two substitutions as depicted by (c).
Results
The impact of parsimony on the estimation of substitutions
In this section, we provide some mathematical insights into the role of maximum parsimony in the estimation of the number of substitutions needed to convert a character into another one as explained above. In particular, we deliver a proof for Algorithm 4.
Theorem 1. Let
Proof
Let f, f^{d}, X,
Considering the underlying tree
Now we show that it is equivalent to consider
Next, let
On the other hand, we can use the commutativity of the underlying Abelian group to derive
So altogether we have
and therefore
The minimum number of substitutions to change ffrom f^{d} on
Informally speaking, the idea is as follows: As there is exactly one path from the root ρ to any taxon x∈X, we wish to determine whether we can ‘pull up’ some of the operations along this path in order to affect more than one taxon and still give the same result. This idea has been described above (Equations (2) and (3)), and it coincides precisely with the idea of the parsimony principle.
However, in order to avoid confusion regarding the operation σ as a character on which to apply parsimony, Algorithm 4 instead acts on the constant
character. Clearly, in order to evolve the constant character c:=c_{1}⋯c_{1} on a tree with root state c_{1}, the corresponding operation would be
By the definition of maximum parsimony, when applied to h on tree
The impact of different groups
For any alphabet
Example 4.Recall the starting point of Example 3, i.e. regard the fivetaxon tree T from Figure3b, and the characters f = GTAGA and f^{d} = ACCTC. Now, instead of usingΣ_{K3ST}we use the permutations from the cyclic group
Figure 3. Converting one character into another character using the cyclic group. (a) depicts the cyclic group Σ_{c}, which consists of the identity
Note that variation of the minimum number of substitutions needed to translate a character into another one between different groups is not surprising: As different substitution types are needed to translate one pattern into the other one, depending solely on the underlying group, one group might need the same substitution type for some neighboring branches in the tree and another group different ones. Informally speaking, this would imply that in the first case, the substitution could be “pulled up” by the Fitch algorithm to happen on an ancestral branch, whereas in the second case this would not be possible.
The link between substitution models and permutation matrices
In Examples 1 and 2 we have shown that the K3ST substitution model can be included
into our framework. The connection between the KleinFourgroup and the K3ST model
(as well as the one between the
Most substitution models assume the independence of the different branches of a tree
to compute the joint probability of the characters in
Theorem 2 (Birkhoff’s theorem, e.g.,[20], Theorem 8.7.1).A matrix M is doubly stochastic, i.e., each column and each row of M sum to 1, if and only if for some N < ∞ there are permutation matricesσ_{1}, …, σ_{N}and positive scalarsα_{1}, …, α_{N} ∈ [0,1] such thatα_{1} + ⋯ + α_{n} = 1 and M = α_{1}σ_{1} + ⋯ + α_{N}σ_{N}.
Therefore, the weighted sum of the permutation matrices in
Another, even more useful consequence of Birkhoff’s theorem is the fact that it tells us which substitution models are suited for the OSM approach. If the transition matrix associated with the substitution model is doubly stochastic, then we find a set of permutations which give rise to the model.
Let us see how this influences the symmetric form of the general time reversible model (sGTR) with uniform stationary distribution. It has the transition probability matrix
Assigning permutation matrices to the respective parameters yields the set Σ_{sGTR} with elements s_{0} (identity) and
The weighted sum of the nonidentity elements yields
which is equal to P_{sGTR}if a + b + c + d + e + f=1. Thus, the set Σ_{sGTR}is to sGTR what Σ_{K3ST} is to K3ST. However, Σ_{sGTR} does not satisfy condition C5, because s_{a},⋯, s_{f} are not fixed point free. This can be seen as the main diagonal of s_{a},⋯, s_{f}does not only contain zeros. It is also not commutative (condition C4) as e.g. s_{a}·s_{c}≠s_{c}·s_{a}. And it is not closed under matrix multiplication (condition C3), which means that a concatenation of permutations in Σ_{sGTR} might lead to a new permutation not in Σ_{sGTR}, e.g., s_{a}·s_{f}∉Σ_{sGTR}. Other complex models like TamuraNei [21] do not even permit the decomposition of its transition matrix into the convex sum of permutation matrices. All of this shows why the overall applicability of complex models to the OSM approach is rather limited.
There are other approaches to describe phylogenetic models based on the group structure of their substitution matrices. In particular, Sumner et al. [22] use Lie algebra to construct OSM type matrices for the general Markov model, and discuss shortcomings of the group structure for the general GTR model [23].
Application to other biologically interesting sets
As stated above, OSMtype models require an underlying Abelian group. Thus, the OSM setting is applicable not only to binary data or fourstate (DNA or RNA) data, but also to alphabets of 16 (doublets), 64 (codons), and 20 characters (amino acids) respectively. We compare such extensions to existing biologically motivated binning approaches and discuss their relevance.
As we have shown in the previous sections, the symmetric form of the KleinFourGroup
There are four Abelian groups for twentystate alphabets, namely
Figure 4 shows a heatmap type visualization of an OSMtype matrix on a singleleaf tree where
the coloring of the cells corresponds to the weights given to the 20 permutations
in the respective groups. We see that the coloring pattern nicely reflects the four
cosets of the subgroup
Figure 4. Matrices illustrate the four Abelian groups for a twentystate alphabet. (a) the
Binnings are also done for amino acids, using either biochemical properties or evolutionary divergence. An example of a biochemical binning is the hydrophobic index, where the 20 amino acids are binned into four groups, very hydrophobic, hydrophobic, neutral, and hydrophilic. Unfortunately, this binning does not correspond to any of the proposed Abelian groups. Moreover, it is difficult to derive transitions between these groups just from the biochemical properties.
Transition matrices for evolutionary models for amino acid substitutions are usually generated by counting mutation types in the alignments (see, e.g., [25] for an overview). From these, optimal groupings can be obtained using clustering approaches [26]. The existence of estimates for the transition probability between all amino acids provides the possibility to get further information about betweengroup operations. These groupings could be forced to fit Abelian groups. However, as indicated in [26] a grouping into four groups of five amino acids each is rarely optimal.
Conclusions
In this paper, we provide the necessary mathematical background for the OSM setting which was introduced and used previously [4,19], but had not been analyzed mathematically for more than two character states. Moreover, the present paper also delivers new insight concerning the requirements for the OSM model to work: In fact, we were able to show that mathematically, it is sufficient to have an underlying Abelian group – which shows a generalization of the OSM concept that was believed to be impossible previously [4]. Therefore, we show that OSM is applicable to any number of states.
However, note that the original intuition of the authors in [4] was biologically motivated: The authors supposed that the group not only has to be Abelian, but also symmetric in the sense that each operation can be undone by being applied a second time. Thinking about DNA, for instance, this works: For example, the transition from A to G can be reverted by another substitution of the same type, namely a transition from G to A. This symmetry condition is fulfilled by the KleinFourgroup, but not by the cyclic group on four states.
While the OSM approach can be extended to any number of states, its biological relevance
becomes somewhat obscure when there is no corresponding group which is a power of
Competing interests
The authors declare no competing interests.
Authors’ contributions
All authors contributed equally. All authors read and approved the final manuscript.
Acknowledgements
SK thanks Marston Conder for fruitful discussions on the group theoretical background and Jessica Leigh for enlightening discussions on biochemical and evolutionary binnings of amino acids. This work is financially supported by the Wiener Wissenschafts, Forschungs and Technologiefonds (WWTF). AvH also acknowledges the funding from the DFG Deep Metazoan Phylogeny project, SPP (HA1628/9) and the support from the Austrian GENAU project Bioinformatics Integration Network III.
References

Durbin R, Eddy SR, Krogh A, Mitchison G: Biological sequence analysis  Probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998.

Mount DW: Bioinformatics: Sequence and Genome Analysis. New York: Cold Spring Harbor; 2004.

Semple C, Steel M: Phylogenetics. New York: Oxford University Press; 2003.

Nguyen MAT, Klaere S, von Haeseler A: MISFITS: evaluating the goodness of fit between a phylogenetic model and an alignment.
Mol Biol Evol 2011, 28:143152. PubMed Abstract  Publisher Full Text

Klaere S, Gesell T, von Haeseler A: The impact of single substitutions on multiple sequence alignments.
Philos T R Soc B 2008, 363:40414047. Publisher Full Text

Kimura M: Estimation of Evolutionary Distances between Homologous Nucleotide Sequences.
P Natl Acad Sci USA 1981, 78:454458. Publisher Full Text

Fitch WM: Toward defining the course of evolution: Minimum change for a specific tree topology.
Syst Zool 1971, 20:406416. Publisher Full Text

Humphreys JF: A course in group theory. New York: Oxford University Press; 1996.

Hendy M, Penny D, Steel M: A discrete Fourier analysis for evolutionary trees.
P Natl Acad Sci USA 1994, 91:33393343. Publisher Full Text

Bashford JD, Jarvis PD, Sumner JG, Steel MA: U(1)×U(1)×U(1) symmetry of the Kimura 3ST model and phylogenetic branching processes.

Bryant D: Hadamard Phylogenetic Methods and the ntaxon process.
Bull Math Biol 2009, 71(2):339351. PubMed Abstract  Publisher Full Text

Hendy MD, Penny D: A framework for the quantitative study of evolutionary trees.
Syst Zool 1989, 38(4):297309. Publisher Full Text

MacLane S, Birkhoff G: Algebra. Chelsea: American Mathematical Society; 1999.

Bryant D: Extending Tree Models to Split Networks. In Algebraic Statistics for Computational Biology. Edited by Pachter L, Sturmfels B. Cambridge: Cambridge University Press; 2005.

Kimura M: A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences.
J Mol Evol 1980, 16:111120. PubMed Abstract  Publisher Full Text

Horn RA, Johnson CR: Topics in matrix analysis. New York: Oxford University Press; 1991.

Steeb WH, Hardy Y: Matrix Calculus and Kronecker Product: A Practical Approach to Linear and Multilinear Algebra. Singapore: World Scientific Publishing; 2011.

Cayley A: Desiderata and Suggestions: No. 2. The Theory of Groups: Graphical Representation.
Am J Math 1878, 1(2):174176. Publisher Full Text

Nguyen MAT, Gesell T, von Haeseler A: ImOSM: Intermittent Evolution and Robustness of Phylogenetic Methods.
Mol Biol Evol 2012, 29(2):663673. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Horn RA, Johnson CR: Matrix analysis. Cambridge: Cambridge University Press; 1990.

Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.
Mol Biol Evol 1993, 10(3):512526. PubMed Abstract  Publisher Full Text

Sumner JG, Holland BR, Jarvis PD: The Algebra of the General Markov Model on Phylogenetic Trees and Networks.
Bull Math Biol 2012, 74(4):858880. PubMed Abstract  Publisher Full Text

Sumner JG, Jarvis PD, FernandezSanchez J, Kaine B, Woodhams M, Holland BR: Is the general timereversible model bad for molecular phylogenetics?
Syst Biol 2012, 61(6):10691074. PubMed Abstract  Publisher Full Text

Keilen T: Endliche Gruppen. Eine Einführung mit dem Ziel der Klassifikation von Gruppen kleiner Ordnung.
2000.
[http://www.mathematik.unikl.de/wwwagag/download/scripts/Endliche.Gruppen.pdf webcite]

Kosiol C, Goldman N: Different Versions of the Dayhoff Rate Matrix.
Mol Biol Evol 2005, 22(2):193199. PubMed Abstract  Publisher Full Text

Susko E, Roger AJ: On Reduced Amino Acid Alphabets for Phylogenetic Inference.
Mol Biol Evol 2007, 24(9):21392150. PubMed Abstract  Publisher Full Text