Abstract
Background
Unigenic evolution is a largescale mutagenesis experiment used to identify residues that are potentially important for protein function. Both currentlyused methods for the analysis of unigenic evolution data analyze 'windows' of contiguous sites, a strategy that increases statistical power but incorrectly assumes that functionallycritical sites are contiguous. In addition, both methods require the questionable assumption of asymptoticallylarge sample size due to the presumption of approximate normality.
Results
We develop a novel approach, termed the Evidence of Selection (EoS), removing the assumption that functionally important sites are adjacent in sequence and and explicitly modelling the effects of limited samplesize. Precise statistical derivations show that the EoS score can be easily interpreted as an expected logoddsratio between two competing hypotheses, namely, the hypothetical presence or absence of functional selection for a given site. Using the EoS score, we then develop selection criteria by which functionallyimportant yet nonadjacent sites can be identified. An approximate power analysis is also developed to estimate the reliability of inference given the data. We validate and demonstrate the the practical utility of our method by analysis of the homing endonuclease IBmol, comparing our predictions with the results of existing methods.
Conclusions
Our method is able to assess both the evidence of selection at individual amino acid sites and estimate the reliability of those inferences. Experimental validation with IBmol proves its utility to identify functionallyimportant residues of poorly characterized proteins, demonstrating increased sensitivity over previous methods without loss of specificity. With the ability to guide the selection of precise experimental mutagenesis conditions, our method helps make unigenic analysis a more broadly applicable technique with which to probe protein function.
Availability
Software to compute, plot, and summarize EoS data is available as an opensource package called 'unigenic' for the 'R' programming language at http://www.fernandes.org/txp/article/13/ananalyticalframeworkforunigenicevolution webcite.
Background
One of the principal reasons for studying molecular evolution is that the function of a novel protein can be deduced, in part, by comparing it with a similar previouslycharacterized protein. But what recourse do we have if the novel protein does not exhibit significant sequence similarity to other proteins? More problematically, what if it is similar only to proteins of unknown function? In practice, even when the novel protein shares regions of extensive similarity to proteins of known function, it may be difficult to elucidate the importance of individual sites in the novel protein.
Unigenic Evolution
One innovative experimental approach that can help identify specific domains or residues required for function is unigenic evolution, first described and developed by Deminoff et al. [1]. Unigenic evolution can be applied to any protein where the loss of function can be used as a selectable phenotype [15].
The procedure consists of random mutagenesis and amplification of a single wildtype sequence via mutagenic polymerase chain reaction (PCR) with subsequent cloning and functional selection [6]. Functional clones are isolated and characterized by DNA sequencing. In contrast to traditional structurebased mutagenesis screening, unigenic evolution experiments produce an unbiased estimate of functionallyimportant residues regardless of putative structural role or conservation.
Deminoff's Analysis
The selection process ensures that, in functional clones, amino acids essential for function will be conserved relative to nonessential sites. However, differential mutation sensitivity can be caused by more than structural or functional constraints. Mutation rates of residues may differ due to differential transition/transversion rates, codon usage, and genetic code degeneracy. To correct for these confounding factors, Deminoff et al. developed a statistical analysis that compared the expected versus the observed mutation frequency for each codon, where the expected frequencies were derived from a population of clones that had not been subject to selection.
Deminoff et al. clearly demonstrated the importance of accounting for nonuniform transition versus transversion probabilities when computing expected mutational frequencies. To increase the inferential power of their analyses, they also developed a 'slidingwindow' χ^{2}analysis, binning together a 'window' of adjacent codons, assuming that residues critical for protein function would be nearby in primary structure. By comparing the probabilities of silent versus missense mutation in these windows, regions of either restrained or excessive mutability were identified as hypo or hypermutable, respectively.
Behrsin's Analysis
The subsequent analysis of Behrsin et al. [7] advanced the statistical framework of Deminoff et al. by improving three key features. These features were (a) the fixed window size of the χ^{2}analysis, and (b) the effect of samplesize on the codon mutation probability, and (c) accounting for multiple nucleotide mutations per codon. First, window size for the χ^{2}analysis was addressed by using windows of different sizes and comparing estimated falsediscovery rates. The 'best' window was selected via tradeoff between the estimated sensitivity and specificity for classifying hypo or hypermutable residues. Second, nucleotide substitution frequencies were computed using the continuity correction of Yates [8] resulting in more consistent codon mutation frequencies. Third, codon mutation frequencies were computed analytically from nucleotide substitution frequencies without the assumption that only one substitution per codon was likely.
Further Improvements
The statistical framework of Deminoff et al. and the modifications suggested by Behrsin et al. allow for the reliable identification of hypomutable regions via unigenic evolution. Nonetheless, these stateoftheart analyses suffer from some deficiencies, from a statistical perspective, that could result in either erroneous or misleading conclusions. The goal of this work is to develop a statistically rigorous method for the analysis of unigenic evolution data, improving upon existing techniques by
1. relaxing the assumption that sample sizes are large enough such that asymptotic normality necessarily applies,
2. relaxing the assumption that selectionsensitive regions of a protein are contiguous,
3. clarifying the relationship between Fisherstyle pvalues and NeymanPearson TypeI and TypeII error probabilities with regard to testing hypotheses of functional selection,
4. relaxing the the assumption that the PCR amplification protocol does not meaningfully affect mutation probabilities, and
5. addressing the ability to of unigenic evolution to detect hypermutability.
We expand upon each of these points, in turn, below.
First, both Deminoff et al. and Behrsin et al. equate observed event relative counts with the respective event probabilities. This equivalence is effectively true when either sample sizes are asymptotically large or probabilities are nonextreme (not too close to either zero or one). However, experimentallyfeasible sample sizes are typically limited to the order of 50100 replicates (clones) and even the most mutagenic of PCR conditions result in low probabilities (≈ 0.001 to 0.01) of point mutation. Therefore it is unlikely that observed counts have a simple relationship with the event frequency, even accounting for the continuity correction of Yates [8]. The difficulty of estimating probability parameters from eventcounts when the likelihood of the event is very small is a wellknown problem from the inference of binomial and multinomial frequency parameters [9]. The most obvious consequence of assuming "counts ≈ probabilities" under these constraints is that the normal approximation, on which the χ^{2 }statistic is critically dependent, may be invalid enough to yield misleading results. At the very least, the sampling variance of the χ^{2 }statistic itself is necessarily quite large. The anticipated parameter ranges above, for example, yield a coefficient of variation for χ^{2 }to be on the order of 100300%. An additional problem with equating counts and probabilities is that, in doing so, the analysis of Behrsin et al. implicitly conditions on the total number of mutations as given. We anticipate that the actual number of mutations would be roughly Poisson distributed, implying that the variance of the mutation count will be on the order of the expected count itself, further degrading the validity (or at least power) of the χ^{2 }statistic to correctly determine the effect of selection.
Second, the assumption that selectionsensitive regions of a protein are contiguous is incorrect because proteins are threedimensional amino acid chains where secondary, tertiary, and quaternary folds bring distantinsequence residues to threedimensional proximity. Perhaps the most widely known example of functionallyimportant yet noncontiguous sites is the catalytic triad of residues in serine proteases such as trypsin [10,11]. Trypsin proteins have three absolutely required residues that form a chargerelay system needed for activity; with respect to the human sequence these are H57, D102, and S195. These three residues are widely separated in sequence, but are adjacent in the 3 D fold of the protein. Furthermore, S195 is surrounded by a conserved set of residues, but H57 and D102 are not. The residues surrounding the H58 and D102 equivalents in other organisms are very different and are drawn from all classes of amino acids. Thus using a 'windowing' procedure to identify hypomutable regions would fail in the H57 and D102 instances since many different amino acids are tolerated adjacent to an absolutely conserved position in a protein family. If selectionsensitive residues cannot be presumed contiguous, it is unclear how sites can be partitioned into 'selected' and 'nonselected' groups while correcting for implicit and combinatoriallyincreasing number of multiple comparisons.
Third, the use of a Fisherstyle hypothesis test to compute a pvalue directly is not equivalent to the estimation of the NeymanPearson TypeI and TypeII error probabilities α and β, respectively [12]. Although often confused in the literature, p and α are not interchangeable. Specifically, Fisher's pvalue expresses the probability of the observed and more extreme data given the null hypothesis, and can be considered a random variable whose distribution is uniform on (0, 1) under that null. In contrast, the NeymanPearson α and β values directly compare the probability of the observed data given either the null or alternate hypotheses as correct. The value of α must be fixed before the observations are made and is subject to the minimization of β. Confusion between α and p results in the systematic exaggeration of evidence against the null hypothesis [1317].
Fourth, current unigenic evolution analyses treat the mutagenic PCR protocol as a 'black box' process that takes a wildtype sequence as input and produces a pool of mutagenized clones as output. However, the physical process of mutagenic PCR via nonproofreading Taq polymerase imposes significant constraints on the amplification process in turn constrains properties of the final clone population. We show herein that the ultimate probabilities of the different classes of codon mutation are complicated, nonlinear functions of the Taq misincorporation probabilities. Given the complicated relationship between Taq misincorporation frequencies and codon mutation frequencies, it is important to determine if the PCR protocol meaningfully constrains observed codon mutation frequencies.
Last, the claim that unigenic evolution can detect hypermutability follows the fact that critical values of the χ^{2 }statistic can be observed due to either too few or too many observed mutation events. However, mutagenic PCR of a nucleotide sequence is an anisotropic 'random drift' through sequencespace. Selection can only act on the drifting sequence by 'slowing' its progression along trajectories that realize lessfunctional mutants, since these lessfunctional sequences are preferentially discarded. Neither mutagenic PCR nor functional selection are capable of 'accelerating' the sequence drift, thus implying that a 'large' number of observed mutations at a given site, while improbable, are not unexpected. We therefore claim that unigenic evolution is fundamentally incapable of detecting hypermutability, positing that unexpectedlylarge sitemutation counts stem from either ordinary sampling variance, or nonmodelled systematic, procedural, or other experimental errors.
The statistical framework described herein provides estimates of both the evidence of selection, and the statistical power available to detect that selection, independently for each codon site. It provides explicit comparisons with internal positive and negative controls, thus reducing the impact of systematic or experimental errors. It identifies individual sites rather than broad regions for followup analysis, and can guide wildtype sequence optimization with regard to unigenic mutability. With its emphasis on analytical rigour and its availability as an easytouse software addin package, this work helps to make the analysis of saturatingmutagenesis experiments both statistically sound and broadly accessible.
Results
To implement the aforementioned improvements we have developed a new method for analyzing data from unigenic evolution experiments. The analytical framework we have developed is uniquely powerful because it has no assumption of normality or other asymptotic approximations; has no requirement that critical residues be contiguous; provides detailed estimates of TypeI and TypeII errors; explicitly models the relationship between polymerase misincorporation probabilities and PCR mutation probabilities; and easily adapts to different codon compositions, nucleotide biases, and genetic codes. Our method, including subroutines for data visualization and summarization, was implemented as a package called unigenic as part of the R statistical software system [18] and is available under an opensource license.
Overall Procedure
Like previous methods, our procedure begins by estimating the 'background' nucleotide mutation frequencies given by the mutagenic PCR on a control population of clones that are not subject to functional selection. This population, termed the 'unselected' population, serves as a control group that describes expected mutation frequencies under the null hypotheses of 'no functional selection'. The nucleotide mutation frequencies of the unselected clones are used to compute synonymous and nonsynonymous mutation probabilities for the codons of the wildtype sequence under the null hypothesis. Finally, the observed number of synonymous and nonsynonymous mutations occurring at each codon of the functionallyselected clones are tested to see whether they are more concordant with the null hypothesis or a generalized alternative.
The overall procedure of using nucleotide mutation frequencies to estimate codon mutation frequencies was first suggested by Deminoff et al. who argued that adequate statistical power could not be realized if only codontriplet mutations were analyzed. During the development of our method, we tested the hypothesis that synonymous and nonsynonymous mutation counts alone have sufficient power to resolve functional versus nonfunctional proteins. Our test used a sitebysite test of multinomial homogeneity as described by Wolpert [19]. A test of multinomial homogeneity in our context is a test that, given the count of synonymous and nonsynonymous mutations at a particular site in both the selected and unselected populations, asks if the the counts are compatible with the hypothesis that the mutation frequencies are equal. This test is unique in that it does not require inferring or comparing frequencies directly. Instead, all possible frequencies that are compatible with the data are considered simultaneously. We found, for the experimental system described below, that codon mutation frequencies alone are insufficient to discriminate between populations. Results of the homogeneity tests are shown in Additional File 1. This lack of ability to discriminate populations underlies the necessity of formulating a null hypothesis via nucleotide mutation frequencies, and confirms the supposition of Deminoff et al..
Additional file 1. Homogeneity Tests are Insufficient to Detect Selection. The necessity of computing the codon mutation frequencies M via nucleotide frequencies P is shown by the lack of statistical power for determining selection purely by codonbycodon comparison of unselected and selected clones. (A) Using the test for such multinomial homogeneity as given by Wolpert [19], the posterior log_{2}oddsratio between hypotheses, ≈ 0.4, implies that they are virtually indistinguishable. (B) The estimated power of such analysis has a posterior log_{2}oddsratio of ≈ 0.05 thereby showing the unsuitability of tests for functional selection that rely only on codonbased mutation counts. Of particular significance is that the the M1 startcodon is not discerned in either selected or unselected population, even though it is absolutely required for protein function in the selected clones and absolutely conserved due to the cloning technique in the unselected population. The complete absence of power at M1 and other sites shows the unsuitability of codon homogeneity to serve as evidence of selection. Note that the additive property of log_{2}oddsratios implies that combining counts for identical codon classes increases the log_{2}oddsratio only linearly, thereby implying that reasonable power cannot be achieved by codonclass analysis either, for the given sample size.
Format: PDF Size: 83KB Download file
This file can be viewed with: Adobe Acrobat Reader
Experimental System
Complete details of the experimental system and data analyzed in conjunction with the development of our method are presented in Kleinstiver et al. [20]. Briey, our experimental system analyzed the 266 amino acid GIYYIG homing endonuclease IBmol[2123], a sitespecific DNA endonuclease consisting of an Nterminal catalytic domain connected by a linker region to a Cterminal DNAbinding domain. The Nterminal domain of ≈ 90 amino acids contains the classdefining GIYYIG motif that is highly conserved in almost all the members of this endonuclease family. IBmol binds to a ≈ 38 bp recognition sequence (the homing site) and introduces a doublestranded DNA break leaving a singlestranded 2nucleotide 3'overhang. We took advantage of the sitespecific endonuclease activity of IBmol in the genetic selection to isolate functional variants after 30 cycles of mutagenic PCR. The genetic selection utilizes a chloramphenicol resistance plasmid (pExp) to express wildtype IBmol or mutant variants, and a second ampicillin resistant compatible plasmid (pTox) that contains the IBmol homing site. pTox also encodes a DNA gyrase toxin under the control of an inducible arabinose promoter. Cells that contain both plasmids only survive plating under selective conditions if IBmol is functional and can cleave pTox. Cleavage of pTox by IBmol generates a linear plasmid that is rapidly degraded, thus removing the gyrase toxin and promoting cell survival. If IBmol is nonfunctional and pTox is not linearized, cells will not survive due to the activity of the gyrase toxin. Under nonselective conditions, all cells will survive regardless of whether IBmol is functional because there is no requirement to cleave the toxic plasmid. Using this genetic selection, we introduced random pointmutations in the IBmol coding region by mutagenic PCR, and isolated and sequenced 87 functional 'selected' clones after plating on selective media. We also isolated and sequenced 87 clones isolated on nonselective media resulting in a pool of 'unselected' clones in order to establish baseline mutagenesis frequencies.
PCR Misincorporation Frequencies
The first step in estimating the nucleotide misincorporation frequencies requires tabulation of observed nucleotide mutations in the unselected population into the 4 × 4 matrix C, where c_{ij }represents the number of observed misincorporations from nucleotide j in the wildtype sequence to clonetype nucleotide i. Experimental nucleotide mutation counts for our IBmol system are presented in Table 1. Note that under even highlymutagenic conditions, C is strongly diagonallydominant, indicating that even under highlymutagenic conditions misincorporations are still relatively rare. The mutation frequency matrix P can be computed from C by normalizing each column of C to sum to one. Doing so, in effect, equates misincorporation counts with relative frequencies. This simple normalization suffers from two drawbacks, however. First, the equivalence of mutation frequencies with relative counts is only true when the product (Σ_{k }c_{kj})p_{ij }is sufficiently large for all i and j, due to the same reasoning which underlies the approximation of the binomial with a normal distribution [24]. Second, this normalization results in only a pointestimate of P and provides no information about the accuracy of the estimate. Given the low frequency of many mutation events, such as the single observed {C ← G} event per 11 223 total {any ← G} events as shown in Table 1, it is doubtful that the condition of 'sufficiently large' holds.
Table 1. Sample Misincorporation Counts Under H_{0}
To remedy these difficulties, the columns of P were presumed to be multinomial probabilities of a 'blackbox' mutagenic process. Given an 'input' wildtype nucleotide j, column j of P gives the multinomial probabilities of the resultant 'output' clonal nucleotide. Under the hypothesis that the output nucleotide depends only on the input nucleotide, each of the four columns of P describe four, independent multinomial distributions.
Estimating multinomial parameters from observed event counts is a wellstudied subject. When any (Σ_{k }c_{kj})p_{ij }is small, as is typical in unigenic evolution data, techniques for multinomial estimation have been thoroughly investigated under various Bayesian frameworks. Following numerous recommendations in the literature [25], each column of P is assumed to be Dirichletdistributed such that our null hypothesis asserts that
where α is a vector of hyperparameters with each component set to 1/2. The justification and derivation of (1) are detailed under the 'Methods' subsection 'Multinomial Estimation'. Equation (1) defines P as a standard linear Markov operator. Let the fourdimensional vector x_{j }denote the frequencies of the wildtype nucleotides A, C, G, or T at site j. In general the wildtype sequence will not display polymorphism, implying that x_{j }is equal to one of [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], or [0, 0, 0, 1]. The action of P on x_{j }, given by standard matrixvector multiplication, results in y_{j }= P_{xj }representing the frequency of nucleotides expected in site j within the unselectedclone population.
Modelling the Polymerase
The main difficulty with accepting the hypothesis that the columns of P describe independent multinomial processes is that this hypothesis is not concordant with inspection of observed data. Deminoff et al. specifically noted correlated differences in mutation events that were attributed to differences between transition and transversion processes. Behrsin et al. observed ostensibly the same phenomenon, noting that complementary misincorporation events, such as {C ← A} and {G ← T}, always had similar counts [[7], Table 1]. Again, this similarity was attributed to differences between the mutagenic mechanisms leading to transition and transversion.
Our observed counts, shown in Table 1, showed a similar pattern. However the reduction to frequencies, as shown in Table
2, suggested the hypothesis that
Table 2. Estimates for the Expected Nucleotide Mutation Frequency.
Errorprone nucleotide incorporation by Taq polymerase can be modelled if we let τ_{ij }denote the relative frequency that the polymerase incorporates nucleotide i against templatenucleotide j. Collecting these misincorporation frequencies into matrix T, we compute the PCR mutation frequencies P as a function of the polymerase misincorporation frequencies T resulting in the secondary null hypothesis
We denote (2) as a 'secondary' null hypothesis since T is estimated using only the unselected clone counts C, thereby still representing the hypothesis of 'no functional selection'. A detailed
description of the model underlying
Codon Mutation Frequencies
Unigenic evolution presumes that selection operates on protein function and not during transcription or translation. Differences in protein function are caused by nonsynonymous amino acid substitution. Therefore the frequencies of nonsynonymous mutation need to be computed from the given frequencies of independent nucleotide mutation. Although Behrsin et al. provide a number of different sample formulas for deriving codon mutation frequencies given nucleotide frequencies, we describe a generalized method of deriving these frequencies for two reasons. First, our framework immediately accommodates different genetic codes, such as mitochondrial or chloroplast. Second, we require ready generalization, beyond the two classes of synonymous or nonsynonymous mutation currently used, for future work.
A codon comprises three contiguous nucleotides within a given reading frame. Since P is assumed to act independently on nucleotides, the mutagenic PCR process for codons is concisely represented by
where M is a 64 × 64 linear Markov operator that operates on the space of codon frequencies and '⊗' denotes the standard Kronecker matrixproduct. An explicit depiction of the Kronecker product and how it relates nucleotide mutation to amino acid mutation is shown in Additional File 2. As with nucleotides, given wildtype codon frequencies w_{j }at codonsite j, the quantity z_{j }= Mw_{j }represents the frequency of sitej clone codons after mutagenic PCR under the null hypothesis.
Additional file 2. The Kronecker Product, Illustrated. An explicit representation of the Kronecker product P ⊗ P. Since mutations in nucleotide sites are assumed independent, the frequency that nucleotide j is mutated to i is p_{ij }. For a second nucleotide, again the frequency that nucleotide l is mutated to k is p_{kl}. Therefore, the joint frequency that both mutations occur is p_{ij}p_{kl}. A third Kroneckermultiplication would result in the 64 × 64 matrix M = P ⊗ P ⊗ P. Being given a third mutation of frequency of p_{mn }yields a final codon mutation frequency of p_{ij}p_{kl}p_{mn}.
Format: PDF Size: 694KB Download file
This file can be viewed with: Adobe Acrobat Reader
The columns of M describe the probabilities that a codon subject to mutagenic PCR will remain identical, mutate synonymously, or mutate nonsynonymously. For example, consider column AGA of M. The standard genetic code translates AGA to arginine, as do the five additional codons CGT, AGG, CGA, CGG, and CGC. Therefore, given AGA as the wildtype codon, M describes the probability of either no mutation (identity) or synonymous mutation as
and the probability of nonsynonymous mutation as p_{ns }= 1  p_{sn}. Such matrix partitioning is simple to code in languages supporting namedindex arrayslicing, such as R [18]. It is also simple to adapt the required bookkeeping to any desired genetic code. For computational efficiency and ease of notation, we denote p_{sn,j }and p_{ns,j }to be the probabilities of synonymous and nonsynonymous mutation at codon j, using the subscripts to clearly differentiate them from the entries of nucleotide mutation matrix P, above.
Unit of Analysis
Rather than using only the two classes of synonymous and nonsynonymous mutation, it is theoretically possible to compare the observed counts for all 20 amino acids at each codon site. Such a comparison reduces M to a 20 × 64 matrix mapping codons to amino acids. Preliminary analyses of the ≈ 100 clones sequenced in each of our selected and unselected populations, however, showed insufficient power to make meaningful inferences at the amino acid level.
Amino acids or synonymous/nonsynonymous mutation are not the only possible units of analysis, however. For example, the assumption that selection operates at the protein level implies that the functional assay is independent of transcription or translation efficiency. Future work could easily test this hypothesis by reducing M to the three classes of 'identical', 'synonymous but not identical', and 'nonsynonymous'. Possibly, even different classes of codons could be defined. The main tradeoff with using more classes for analysis is the requirement for larger sample sizes. However, the basic hypothesistesting framework described below could be used with only trivial modification.
Alternate Hypothesis
The analysis of a unigenic evolution experiment compares observed sitespecific mutation counts between two populations. The first population, the control group, is not subject to selection; these are the 'unselected' clones and they provide the values of p_{sn,j }and p_{ns,j }under the null hypothesis.
The second population is subject to functional selection and results in a set of 'selected' clones. The frequency of mutation at each codon of the selected population provides an alternate hypothesis that can be compared with expectations under the null.
For codonsite j we denote
Likelihood Model
A standard multinomial likelihood model is used to describe the probability of mutation given the total number of clones sequenced and the presumed frequency of mutation. For each codonsite j, this gives
and
with codon mutations presumed to be mutually independent. The conditional hypothesis
H dictates the origin of the probability parameters
Therefore under H_{A }the distribution of parameterpair
as discussed previously. Again, α is a vector of hyperparameters with each component set to 1/2 and the justification and derivation of (7) are detailed under the 'Methods' subsection 'Multinomial Estimation'.
Note that since parameters inferred under H_{A }encompass arbitrary types of selection, they necessarily overlap with parameters under
the null. However, under H_{A }parameters
Lastly, we test both selected and unselected populations for consistency between hypotheses since doing so treats the unselected population as a negative control. This helps identify possible systematic or experimental errors during cloning or sequencing, thereby reducing false signals of selection.
Evidence of Selection
Given prior probabilities Pr(H_{A}) and Pr(H_{0}) that either H_{A }or H_{0 }is true for site j, Bayes' Theorem in conjunction with standard likelihood ratios can be used to compute
the relative probabilities of the hypotheses at codon j, given the data and parameter values. UsuallyPr(H_{A}) and Pr(H_{0}) are set equal to each other in the absence of other information. However, parameter values are not known precisely but have posterior distributions. Taking logarithms of (8) and integrating over these posteriors results in
where
giving
the expected logoddsratio of H_{A }versus H_{0 }given the number of observed mutations for codon j in clonepopulation k. We call
The EoS score is the primary criterion used by Kleinstiver et al. [20] to identify functionallyrelevant residues in IBmol via unigenic evolution. The EoS scores for the first 88 sites of IBmol are shown in Figure 1.
Figure 1. The Evidence of Selection for IBmol. Posterior log_{2}oddsratio EoS score
Although (9) is derived using similar principles to Kullback's and Leibler's information
divergence [26], a similarity exploited below, it is not obvious why integrating the logarithm of
(8) over the posterior parameter estimates is a statistically valid procedure. Briey,
this integration is justified because logoddsratios are isomorphic to standard Euclidean
vector spaces [2729] and can be 'added' together in a meaningful manner that is consistent with the fundamental
laws of probability. For example, consider two different pointvalues for
as expected. Such 'addition' holds over very general conditions, and underlies not
only the validity of (9) for computing
The interpretation of logoddsratios in the literature is traditionally taken as
the 'strength of evidence' favoring one hypothesis over another [31,32]. However, since H_{A }describes a general case that embeds H_{0 }as an alternative, this traditional interpretation is inappropriate for
•
•
•
The interpretation of
A concrete example of interpreting
Additional file 3. Details of the GIYYIGDomain. Numerical details of the GIYYIG motif greyhighlighted in Figure 1. 'EoS' refers to the
Format: PDF Size: 31KB Download file
This file can be viewed with: Adobe Acrobat Reader
This example highlights how the lack of evidence of selection does not imply a lack of functional importance. Lack of evidence is precisely that: there is not enough data to classify a given site as either 'important' or 'unimportant'. Often, as can be seen with the GIYYIG motif, many codons are intrinsically resistant to nonsynonymous changes under mutagenic PCR. The glycine residues, for example, can be seen to have had five mutations at site 4 in the selected population but less than one of them is expected to be nonsynonymous. Much of the reason that selection is detectable at the tyrosine residues is the large number of expected nonsynonymous mutations, a number principally dependent on the tyrosine codon's nucleotide composition. The number of nonsynonymous mutations expected for different clone population sizes is shown in Additional File 4 and is described more fully later.
Additional file 4. The Expected Number of Nonsynonymous Misincorporations Percentiles for the expected number of nonsynonymous mutations under the null hypothesis of 'no selection' for different clone population sample size, given misincorporation frequencies estimated by the unselected population counts shown in Table 1. Of particular importance is the wide range of 'Pr(NS)', the estimated probability of nonsynonymous mutation. This probability ranges from 0.0056 to 0.0633 per codon, an 11.2fold difference. 'Q02', 'Q50', and 'Q98' represent the 2%, 50%, and 98% binomial percentiles, respectively, indicating that the observed number of nonsynonymous mutations under H_{0 }is 96% likely to be within the indicated range. Codons resistant to nonsynonymous mutation, such as alanine and glycine, show obvious nonnormality for even between 200500 sequenced clones.
Format: PDF Size: 58KB Download file
This file can be viewed with: Adobe Acrobat Reader
We note an advantage of our method over previous work is shown by examining
Power and Reliability
The similarity of (9) to a KullbackLeibler divergence can be exploited to estimate the statistical power for inferring selection at a given site. The integrand of (9), for given parameter values and where Pr(H_{A}) = Pr(H_{0}), is the logratio of multinomial likelihoods. Taking the expectation of this logratio over the space of all possible data given H_{A }yields a pointestimate of the KullbackLeibler divergence
where
which can itself be integrated over the multinomial parameter posteriors, as before,
to yield the overall expected divergence
Another way of interpreting the persite values of
Figure 2. The Reliability of Inference for IBmoI. Estimates of statistical power (reliability of inference) for IBmol, for both selected and unselected clone populations. 'Nonrandom' associations are presumed due to the alternate hypothesis while 'random' associations are presumed due to the null. Power is computed as the expected log_{2}oddsratio for either truepositive versus falsenegative (mauve) or truenegative versus falsepositive (teal). (A) For the selected sequences, the truepositive ratio is strongly correlated with the posterior log_{2}oddsratio shown in Figure 1. (B) Approximate inferential power of sites in the unselected population, each of which is expected a priori to be near zero. Note that both M1 and E83, identified as putative outliers in Figure 1, are shown to clearly (and unexpectedly) violate the null hypothesis. Such violations hallmark systematic errors or experimental artifacts.
Effect of Sample Size
To help elucidate the practical effects of sample size on EoS values with respect to both the unselected and selected clone population, subsampled clone populations were analyzed with results displayed in Additional File 5. In brief, even as few as 10 unselected clones (yielding 100300 nucleotide misincorporation counts) were capable of giving reasonable estimates of parameter matrix T and sensitivity. Reasonable specificity however, as judged by the ability to correctly detect selection of the methionine start signal, was not achieved with fewer than all 87 of the unselected clones. With respect to the misincorporation frequencies estimated from the counts in Table 1, percentiles of the likely number of nonsynonymous mutations observed for given clone population sizes under the null hypothesis of 'no selection' are shown in Additional File 4. This table shows considerable nonnormality that is particularly pronounced for mutationresistant codons, highlighting the requirement (and opportunity) to 'tune' the effective selection pressure on individual residues by adjusting codon composition. Again, since normality is a requirement for the validity of χ^{2}based statistics, the nonnormality displayed by many residues even under very large sample sizes (> 500 clones) calls the validity of such analyses into question.
Additional file 5. The Effect of Sample Size for IBmol. The effect of differing selected and unselected clone population sample sizes on the power of inference. Subsamples of 5, 10, 20, 40, and 87 (all) clone populations were analyzed as per Figure 1 and shown using identical axis scales, with the 8787 plot therefore identical to Figure 1. All populations are subset inclusive, meaning that the 10sample subset contained all sequences of the 5sample subset, and so on. Approximate nucleotide misincorporation frequencies can be estimated by dividing the counts shown in Table 1 as appropriate. We note that even using only 5/87 unselected clones to estimate parameter matrix T resulted in qualitatively similar EoS values (red) for all 87clone selected populations. Unselected clones were critical, however, in estimating falsepositive (blue) rates, with all 87 unselected clones being required to detect the methionine startsignal.
Format: PDF Size: 621KB Download file
This file can be viewed with: Adobe Acrobat Reader
Global Insights from Local
The derivation of
Selecting Selected Sites
One of the major shortcomings of previous work was the difficulty of discerning which groups of sites were under functional selection given statistical procedures that were designed under the assumption of siteindependence. Given n codons with 'sufficientlyhigh' EoS score, there are n! different ways to partition those n codons into functionallyimportant and functionallyunimportant categories and thereby estimate the falsediscovery rate. This huge number of partitions implies that traditional techniques for multiplecomparison correction reduce statistical power to impractically low levels.
The 'windowing' analyses of Deminoff et al. and Behrsin et al. were used to constrain the number of required multipletest corrections to a reasonable level.
The principal benefit of using
can be interpreted as the log of relative probability that all sites in J were observed due to the action of selection as hypothesized by H_{A}. Unlike traditional multipletest corrections such as Bonferroni's,
The benefit of using
Note that
We further note that neither
Protein Mutation Count
For the experimental conditions used herein, nonsynonymous mutation probabilities varied between ≈ 010%, depending on the given codon. These mutation probabilities in turn determine κ, the total number of nonsynonymous mutations expected in the overall protein. The overall mutation count is an important experimental diagnostic since too few mutations lead to inefficient mutagenesis and high sequencing costs, while too many mutations result in nearlycertain functional knockout. The simplest method of computing the distribution of κ uses standard Monte Carlo techniques to perform in silico mutagenesis of the wildtype sequence given nucleotide mutation parameters under H_{0}.
The distributions of total protein mutations per clone, both expected and observed for the unselected clones under H_{0 }and observed for the selected clones under H_{A}, are shown in Figure 3. Under H_{0 }> 96.7% of clones are expected to have between 416 nonsynonymous mutations, inclusive, within the 266 amino acid sites of IBmol. Expected and observed distributions under H_{0 }were very similar, especially considering the small samplesize of 87 clones that was available to estimate the distribution. The selected population, under H_{A}, displayed a marked decrease in the number of mutations, with > 95% of clones having seven or fewer mutations.
Figure 3. The Expected Mutation Count Distribution for IBmol. The expected distribution of total nonsynonymous mutations per clone (κ) for IBmol under the null hypothesis of 'no selection', the observed mutation count frequencies for the unselected clones, and the observed mutation count frequencies for the selected clones under the alternate hypothesis. Over 96.7% of clones are expected to have between 416 nonsynonymous mutations, inclusive, within the 266 amino acid sites of IBmol. Expected and observed distributions under H_{0 }were very similar, considering the sample size of 87 clones. The selected population, under H_{A}, displayed a marked decrease in the number of mutations, with > 95% of clones having seven or fewer mutations.
Note that like previous analyses, both our null and alternate hypotheses assume siteindependence among mutations. Secondsite suppression or other modes of siteinteraction are not taken into account. Under rare conditions for IBmol, it is possible that up to 20 mutations could be expected for this 266site protein, making it likely that interaction effects are nonnegligible factors of selection.
Comparison with Previous Work
A comparison of our EoS values with the χ^{2}based statistics of previous work is shown in Figure 4. In a broad sense, nonbinned sitespecific χ^{2 }statistics with one degree of freedom and EoS values appear to be highly predictive
of one another. However, their interpretation and use are very different, especially
with respect to sensitivityvs.specificity and multiplecomparison correction. For example, the single degree of
freedom used in Figure 4 dictates a wide sampling variance for the
Figure 4. A Comparison of EoS values with Prior Work. Comparison of the EoS value with the
Nonetheless we note that even using the highlysensitive
Another advantage of EoS values over χ^{2}based statistics is illustrated by the nontrivial comparison of the p < 0.05 and 20:1 oddsratio critical values shown in Figure 4. Although detailed interpretations and the differences thereof have already been
discussed, in the particular case of the IBmol data shown in Figure 4 it is important to realize that Bonferroni correction of the
From a theoretical viewpoint, another advantage of the EoS value is that it deals with a very direct statistical question: how likely are the observed data given the model? Unrealistic EoS values are necessary and sufficient to diagnose unrealistic assumptions in the mathematical description of unigenic evolution. In contrast, summary statistics such as χ^{2 }values similarly condition on model accuracy, but necessarily further condition on the statistic being a good indicator of the phenomenon under investigation  in this case, selection. Thus unrealistic values of χ^{2 }can be similarly be attributed to model misspecification, but could also be due to the inadequacy of the chosen test statistic.
Additional Results
Although the main result of this work is the EoS score
Polymerase Versus PCR Modelling
A surprising result was that
This finding is surprising because modelling the effect of the polymerase over multiple
PCR cycles appears to be required in order to produce mutation frequencies, such as
those in Table 2, where complementarybase frequencies are always almost equal. The similarity of
respective
A consequence of this finding is that it implies that the polymerase may not be the major mechanism responsible for nucleotide mutation. For example, cytosine deamination via thermal decomposition during PCR cycling [3335] provides a credible, alternative mutation mechanism. In this case the nonproofreading property of Taq would be a more important factor than its misincorporation characteristics.
Optimizing the experimental conditions under which mutagenesis occurs can have important consequences in the efficiency and expected outcome of unigenic evolution. For example, note the ≈ 100fold difference between {C ← G} and {G ← A} mutation frequencies shown in Table 2. Small changes in mutation frequencies due to different mutagenic protocols could effect large changes in the distribution of mutant codons, suggesting that future work should investigate mechanisms to elucidate the factors affecting the precise characteristics of mutagenic PCR.
Stop Codon Assumption
One of the more significant assumptions of this work is the presumption that the effect of stop codons can be ignored. Unlike other mutations, the appearance of a premature stop codon affects every subsequent codon, negates our assumption of siteindependence, and likely results in a complete loss of function. For the system examined herein and in Kleinstiver et al. [20], we found that the frequency of stop codon production was sufficiently small as to not significantly affect results or interpretation. Given this putatively small effect, a more exact treatment of stop codons and their functional effects are left for future work since a correct and rigorous treatment would likely add considerable algebraic and computational complexity.
Conclusions
From an experimental point of view, the evidence of selection at a given site represents only part of the required information; in general, the reliability of that evidence must also be assessed. Quantifying reliability is important since small sample sizes, mutation events that are too rare to be reliably estimated, and the effects of multiple comparisons can complicate the interpretation of unigenic evolution experiments. Our method, which computes both evidence and reliability, represents a significant advance over previous work since it simultaneously assesses both the evidence of selection and the reliability of inference.
Experimental validation of our methods was provided via analysis of the poorlycharacterized homing endonuclease IBmol, where previouslydescribed methods from the literature were unable to elucidate functionallycritical residues. With the ability to guide the selection of precise experimental mutagenesis conditions, our method makes unigenic analysis a more broadly applicable technique with which to probe protein function.
Methods
Herein we provide technical and implementation details for our analytical framework, the most important of which are (a) the estimation of multinomial frequencies from counts, (b) our model of mutagenic PCR via the action of an errorprone polymerase, and (c) selecting the prior and sampling the posterior of the polymerase misincorporation frequencies.
Multinomial Estimation
The estimation of multinomial frequencies from counts is one of the oldest subjects in statistics. When asymptoticallymany observations are available, both Bayesian and frequentist methods infer nearly identical parameter values, where frequencies are simply proportional to counts. However, when observations are rare, prior beliefs will always, necessarily significantly affect inferred results. These effects, thoroughly described by Jaynes [36], can be understood though a simple example. Suppose that 10 000 clones were sequenced of which 5000 were found to have nonsynonymous mutations at a given site. Using the normal approximation to the binomial distribution both the mean frequency of nonsynonymous mutation and its standard error are easily computed with high accuracy. However, if only one nonsynonymous mutation had been observed, the actual frequency of mutation is not clear since mutation frequencies of 0.5, 1, or 2 mutations per 10000 clones, a range of 400%, are all realistic and compatible with the given data. Prior belief that the mutation rate should be 'somewhat high' will favor the higher rate, surprise at seeing any mutation would imply the lower rate is more believable.
The consensus in the statistical community is now that there is no 'best' notion of 'prior ignorance' that can be universally considered correct [37,38]. Instead, research has focused on developing methods with precise and wellcharacterized assumptions in order to minimize, in some sense, the influence of prior assumptions on the inference. These wellcharactered assumptions are called 'objective' or 'reference' priors and are the type of prior we choose as a basis for inferring both multinomial nucleotide mutation frequencies and nonsynonymous codon mutation frequencies.
If p represents a set of multinomial frequencies and n a set of observed counts, then Bayes' Theorem tells us that
where Pr(np) is the likelihood of the counts given parameters and Pr(p) is our assumed prior distribution of the parameters before any data is observed. Detailed informationtheoretic studies by Berger and Bernardo [25] found that p ~ Dirichlet(α) with all components of vector α set to 1/2 was a prior that formally minimized the inuence of the prior on the posterior Pr(pn). This specific prior was found to be invariant to reparameterization and is identical to the one derived by Jeffreys [39].
From an experimental viewpoint, invariance to reparameterization is an critical requirement for inferring frequencies from counts since the property implies that the same inference would be made if, for example, relative mutation rates had been estimated rather than frequencies. Any other choice of prior would yield different posterior values of p even when given identical data.
For the multinomial distribution with the objective reference prior above, the posterior has the simple form
Again, only α = 1/2 formally maximizes the information 'extracted' from the counts n. The expected value of these posterior relative frequencies is
for each frequencycomponent i, where ψ denotes the digamma function. Equation (13) is termed the 'natural' parameter mean for p, and when n_{i }is sufficiently large it is approximately equal to n_{i}/Σ_{i }n_{i}. This similarity is evident when comparing the 'Relative Count' and 'Natural Parameter' estimates for nucleotide mutation frequencies in Table 2 where all four multinomial parameter set estimates agree to within 1%. However, for codon mutation counts on the order of 02 observations per 100 clones, differences between estimates can be considerable.
Modelling PCR via Polymerase
Modelling the full mutagenic PCR process requires formally describing both the process of nucleotide misincorporation via polymerase and the action of multiple cycles of denaturation, synthesis, and reassociation that are the basis of PCR. Our model uses a Bayesian framework that explicitly accounts for the rarity of nonsynonymous mutations to infer parameter values of this model.
Polymerase Misincorporation
Mutagenic PCR is composed of of multiple cycles of lowfidelity Taqbased amplification. The model we adopt assumes that under mutagenic conditions [6]Taq polymerase
• induces errors only by nucleotide misincorporation,
• has negligible slippage, stutter, or other errors due to repeats,
• and has siteindependent misincorporation probabilities.
The assumption that slippage and stutter are negligible is substantiated by previous studies of Taq errors [6,34] and visual inspection of our data. The assumption that nucleotide misincorporation events are independent is somewhat stronger since it may be argued that spatial distortions induced by templateadduct mispairing affect subsequent DNA synthesis. However, inspection shows misincorporation events to be sufficiently rare that this effect, if it exists, is of sufficiently small magnitude as to be negligible.
A single nucleotide is represented as the fourdimensional probability columnvector p that comprises the probability of that nucleotide being one of the four nucleotides A, C, G, or T. The action of Taq polymerase on template p is modelled as a linear Markov operator T that pairs adduct q = Tp against the template during synthesis. Polymerase operator T has the explicit representation
where τ_{ij }denotes the probability that adductnucleotide i is basepaired to template nucleotide j.
With this notation, the columns of T sum to one. Further, since misincorporation events are rare, the counterdiagonal components τ_{AT}, τ_{CG}, τ_{GC}, and τ_{TA} are assumed to be ≈ 1, while all other parameters are ≪ 1. We emphasize that a full specification of T requires sixteen parameters and four constraints, yielding twelve independent degrees of freedom. These constraints imply that both matrix T and an an implicit twelveparameter model describing the relative Taq misincorporation frequencies are equivalent.
The Mutagenic PCR Process
A single DNA fragment to be amplified consists of two basepaired strands, the sense strand encoding the protein of interest and its complement the nonsense strand. Let fourvectors s and n represent the nucleotide pair at a single site. Four of the sixteen combinations of s and n represent proper WatsonCrick base pairs while the remaining twelve represent (presumably rare) mispairings. An individual cycle of PCR amplification can then be modelled as the process shown in Figure 5. Although initially basepaired, the two strands are separated and become independent when the DNA is denatured. Taq polymerase proceeds to add stepwise adducts onto each strand in a templatedependent manner. The attachment of a subsequent adduct is independent of the adduct at the previous position. In this way, a newly polymerized nonsense strand is built up using the sense strand as a template. Any mistakes during polymerization result in nucleotide mismatches between the strands which are not repaired because Taq polymerase does not contain a proofreading activity nor are other DNA repair enzymes present in the in vitro reaction. The individual nucleotide sitepairs can thus be considered independently of all other sitepairs, even though they are physically contiguous with other sitepairs on the oligonucleotide.
Figure 5. A Model of Mutagenic PCR. A model of a single cycle of mutagenic PCR. Each nucleotide of both sense and nonsense strands are treated as probability fourvectors. The 'state' of a nucleotide is the relative frequency we expect to observe it as either A, C, G, or T. The initial wildtype sequence is presumed to be welldefined. An example PCR cycle begins with A and T on the sense and nonsense strand, respectively, that are subsequently separated via denaturation. Errorprone polymerization of new nonsense and sense strands by Taq polymerase have, for this example, probabilities of 0.3 and 0.1 of nucleotide misincorporation. Random reassociation at the end of the cycle effectively averages the frequency of mutation at each site and implies that nucleotide frequencies are statistically independent despite being physically contiguous on the same strand. After the final PCR cycle completes a random sample of the DNA strands are selected for cloning, a fraction of which are subjected to selection based on the nucleotide sequence of the sense strand alone.
Mathematically, this single PCR cycle can be described by an operator Φ that acts on the sensenonsense pair (s, n) such that
where T denotes the Taq polymerase operator (14). Starting from the wildtype sensenonsense pair (s_{0, }n_{0}) the probabilistic basepair mixture after k rounds of mutagenic PCR can be computed via the iterative formula
The resultant sensestrand probability vector s_{k }is therefore a highlynonlinear function of the Taq error probabilities. For illustrative purposes we show the elegant Pascaltrianglelike hierarchy resulting from explicit representations of {2^{k}s_{k}} for k = 0 ... 7 as follows:
For practical computation, however, equation (16) should be used to avoid excessive underflow and truncation errors.
Given a polymerase matrix T, the overall multicycle mutagenic PCR process can be described by computing s_{k }under four different conditions, namely the condition that s_{0 }was precisely one of A, C, G, or T. Let
where, again, each column sums to one. Given a wildtype nucleotide sensestrand statevector w_{s }and Taq error probabilities T, operator P transforms w_{s }into m_{s }= Pw_{s}, where m_{s }is the probability statevector for a mutant sensestrand after k cycles of mutagenic PCR.
Under the null hypothesis of 'no selection', the likelihood of misincorporation counts C given mutation probabilities P(T) is given by the product of the four multinomial distributions
where
and each frequency p_{ij }is complicated, nonlinear function of the misincorporation frequencies T. Since the entries of T, not P, are the fundamental parameters of likelihood (18), the standard Dirichlet prior described above is inappropriate. Instead, a corresponding noninformative 'objective' prior is derived for it, below.
A conceptual flowchart of how parameters T are chosen with respect to counts C is shown in Figure 6. The figure describes in essence how samples of the posterior Pr(TC) are realized. It is worth emphasizing that the sixteen parameters of T possess only twelve degrees of freedom, as previously discussed and shown explicitly in Figure 6, because statistical algorithms must be carefully designed to be correct with regard to such constraints.
Figure 6. Inferring Polymerase Errors from Misincorporation Counts. The conceptual relationship between mutagenic PCR, the unselected clone population, and characteristics of the Taq polymerase. The topleft depicts a wildtype sequence that, when subject to mutagenic PCR, results in a clonal population that has not been selected for functional integrity. The number of misincorporations from a wildtype nucleotide to a clonetype nucleotide is summarized as 'Observed PCR Misincorporation Counts' matrix C. The topright depicts how relative Taq misincorporation rates are implicitly assumed from the 'Expected Polymerase Transition Matrix' T that describes how Taq incorporates nucleotides onto the nascent strand. These misincorporation probabilities are converted to 'Expected PCR Misincorporation Frequencies' P via nonlinear 'PCR Operator' Φ, described in the text, that models the multicycle PCR process. A Bayesian procedure is used to determine the compatibility of the given parameters with the observed data, as described under 'Methods'.
Polymerase Priors and Posteriors
Choosing a prior distribution for likelihood (18) is not trivial, especially since there is no universal notion of "complete prior ignorance" [9,37,40]. Choice of prior is therefore governed by specific criteria assumed by the investigator to be important. One nearly universallyaccepted criterion is that of reparameterizationinvariance, a criterion requiring that the inference should not depend on the units of either the parameters or observations. The importance of such invariance has been detailed by Jeffreys [39], Wallace and Freeman [41], Jermyn [42], and many others. In our context, such invariance ensures that parameterizing the likelihood by, for example, either "the expected number of observed misincorporations per PCR cycle" or its reciprocal "the expected number of PCR cycles before a misincorporation is observed" yield equivalent inferences. Since there is no meaningful physical difference between these two parameterizations it is essential that each yield equivalent results.
Under even highlymutagenic conditions, the error probabilities given by the columns of T are known a priori to be very close to the extremes of either zero or one. Taq polymerase is one of the beststudied polymerases in molecular biology and its error rate is known to be heavily influenced by the precise experimental conditions under which it is used [6,43]. Therefore, since the precise relative scales of the different types of polymerase errors are not known, we derived a prior for T that is 'objective' in the sense of Berger et al. [38]. An 'objective' prior is one that formally minimizes, in an informationtheoretic sense, the influence of the prior on the posterior and is constructed to always be invariant to reparameterization.
An Objective Polymerase Prior
Fortunately, there is a common case where selection of the prior is almost universally accepted, and that is when the model parameters form an Abelian Lie group [9], or in other words, when the parameter space is a commutative group on a differentiable manifold. To see that the 16 parameters of T conform to this special structure, consider the 16 dimensional parameter columnvector
where 'vec' is the standard matrixvectorization operator. Each columnconstraint
of T, namely that it sum to one, is equivalent to removing a particular onedimensional
subspace from span
Removing the specific groups of interest results in the quotientspace
where
The quotientspace (20) which fully describes the 16parameter T is therefore isomorphic to the 12dimensional additive Lie group of ℝ^{12 }which in itself is a Hilbert space [27].
Further justification for assuming that ℝ^{12 }is the 'correct' structure with which to understand T comes from realizing that although the columns of T are parameters in our likelihood function, they are also discrete and finite probability densities in their own right. Each column of T represents the relative frequency of concrete events, namely Taq misincorporations. When normalized, they represent multinomial probabilities; when nonnormalized they represent relative Poisson frequencies.
Our model is only interested in relative relative rates as given by the multinomial model, and it is wellknown that the natural parameter space of the multinomial distribution, as a member of the exponential family, is the logarithm of the multinomial parameters.
Given the special Lie group structure of T, the appropriate prior is generally considered to be that of Jeffreys [39]. The Jeffreys prior J is defined as
where F(T) denotes the Fisher Information Matrix (FIM) of the given likelihood model, in this case given by (18). However, it is not entirely obvious which likelihood model should be utilized. The likelihoods of codon mutation, as given by (5) and (6), condition on the the codonmutation probabilities. However, our nullhypothesis is computed at the nucleotide level because the nullhypothesis states that the selectedclone mutation probabilities are consistent with the unselectedclone mutation probabilities.
Incidentally, we note that although it is widely reported in the literature that Jeffreys' prior fails for 'simple' distributions such as the univariate normal with unknown mean and variance. In fact, Jeffreys' prior results in compatible with frequentist methods if a parameterization with Abelian Lie group structure is enforced.
The parameters of T are therefore estimated by the 4 × 4 matrix of observed nucleotide pointmutations n_{ij }. These counts enumerate the number of times a wildtype nucleotide of type j was observed to be of type i in a sequenced clone. Since mutations are relatively rare, the diagonal n_{ii }elements are expected to be several orders of magnitude greater than the offdiagonals. The likelihood function (18) can be more compactly written as
which, again, is the logsum of four independent multinomial processes. The parameters p_{ij }are the probabilities that wildtype nucleotide j will be mutated to i after k cycles of mutagenic PCR and implicit conditioning that the wildtype sequence is given to be j. For (22) the corresponding 16 × 16 matrix F is defined entrywise as
with the expectation being taken over all possible observations.
Explicit computation of the FIM is straightforward and can be accomplished by noting that
which implies that
Taking the expectation of (25) as per (23) requires only evaluating
where c_{++ }is the total number of observed wildtypetomutantclone nucleotide pairs and p_{j }is the probability of that the wildtype nucleotide is of type j. Estimation of p_{j }is equivalent to determining the ratio of (G+C)/(A+T) content for the wildtype sequence.
The nature of the genetic code, be it standard or otherwise, dictates it be extremely unusual for an organism to code for a protein using a nucleotide sequence with (G+C)content approaching either 0% or 100%. The fraction (G+C)/(A+T) should then be very well approximated by the ratio of the (G+C) to (A+T) counts. Such approximation allows us to assume that the expected (G+C)content is equal to the observed (G+C)content, with the result that (26) can be simplified to
where c_{+j }is the rowvector of columntotals of c_{ij }and can be taken as a given value. Note that in the pedagogical or extremely rare case that (27) is not accurate, the (G+C)content can always be estimated via standard Bayesian methods at the expense of the simplified computation that we utilize. It is also worth noting that mutagenic PCR never incorporates enough nucleotide changes to appreciably change (G+C)content.
The final expression for each entry of the Fisher Information Matrix [F(T)]_{kl,mn }is therefore proportional to the negative of
where each parameter p_{ij }is a function of T, and Jeffreys' prior easily computed via the 12dimensional pseudodeterminant. All that remains is to compute the first and secondorder derivatives of p_{ij }with respect to the 16 entries of T.
These derivatives could be computed analytically. However, since typical experiments use on the order of k ≈ 30 PCR cycles, analytic derivatives of P with respect to T result in unwieldily and numericallyunstable expressions. Instead, secondorder differentiation arithmetic [44,45] is used via operator overloading in Fortran to compute all required derivatives of p_{ij }with respect to T. Briey, differentiation arithmetic computes a function and its gradient and Hessian simultaneously by utilizing the algebra of differential operators. The simple structure of all relevant equations make them particularly amenable toward straightforward implementation.
Sampling from the Posterior
Computing F(T) has the secondary benefit of simplifying the procedure of realizing samples from
the posterior distribution of T as well. This simplification arises from two sources. First, under Jeffreys' prior
the maximum a posteriori (MAP) parameter estimate of T can be used as central pointestimate of T that is invariant to reparameterization [41,42]. The MAP pointestimate is similar, conceptually, to the pointestimate given by
maximumlikelihood methods. Second, given the central MAP estimate of
where
Contrasting Methods
The classical molecular evolution literature suggests two contrasting approaches to the analytical model and method we have described which deserve particular recognition. The first contrast is between models of sequence evolution and the second is between established statistical methods.
Codon Evolution Models
Codonspecific models for classical molecular evolution have been previously described by Goldman et al. [46] and Mayrose et al. [47]. These models seek to describe the combined effects of codon mutation and selection through a continuoustime Markov process by prescribing a constrained form of the infinitesimal Markov generator Q for the hypothesized 64by64 codon substitution matrix M. The principal constraints used enforce the ideas that (a) only singlenucleotide changes may have nonzero rates, and (b) the nonzero substitution rates have a biologicallyrelevant parameterization. In contrast, unigenic evolution proceeds through multiple rounds of mutation from a single ancestor before concluding with a single selective sweep. It therefore describes a process fundamentally different from the combined mutationplusselection process approximated by classical molecular evolution models. Toward this end, we describe unigenic evolution by a discretetime Markov transition model M. Note that since classical molecular evolution must account for hypothetical unobserved states throughout continuous time, it is necessarily parameterized by the instantaneous Markov generator Q. An immediate consequence of this differing parameterization is that two nucleotide misincorporations per codon per "instantaneous mutation step" necessarily have zero probability under the classical molecular evolution model but nonnegligible probability under the unigenic evolution model.
Furthermore, even approximate comparisons between the two models are difficult to describe because the combined mutationplusselection model of classical molecular evolution is necessarily constrained to satisfy the detailbalance relationship Q_{ij}p_{j }= Q_{ji}p_{i}, where p denotes the unique stationary vector of Q. In contrast, the stepwisemutation mechanism of Taq misincorporation that drives unigenic evolution implies that detail balance is not required nor even desirable.
Thus we conclude that the models and processes describing classical molecular evolution and unigenic evolution are sufficiently different to preclude straightforward comparison.
Maximum Likelihood Methods
Both likelihood methods [48] and Bayesian methods have a rich history of use in classical molecular evolution, and the general differences between these approaches have been discussed extensively in both the evolutionary and statistical literature [9]. To a first approximation, likelihood methods are well known to be equivalent to their Bayesian counterparts under the conditions of (a) an asymptotically uniform prior and (b) asymptotically large sample sizes. Under these conditions, the task of parameter estimation is essentially equivalent to the task of hypothesis testing, and different statistical frameworks yield essentially identical inferences.
In unigenic evolution experiments we know a priori that nucleotide misincorporation probabilities are very small. Since our hypothesis tests rely on indirectly estimating the probability of very rare events, it is therefore difficult to justify the assumption of "asymptotically large sample size" required by likelihood techniques. Furthermore, it is well known for these types of multinomialinference problems that uniform priors are generally inappropriate [25,38]. Thus the two key requirements of maximum likelihood theory are not met by our model of unigenic evolution, motivating our decision to use Bayesian methods.
Furthermore, our methods have been developed specifically to test hypotheses about sitespecific selection and not estimate the strength of selection. Unlike likelihood methods, for Bayesian methods the tasks of "model selection" and "parameter estimation" are often not equivalent and can give seeminglyinconsistent inferences without careful analysis (see Kass et al. [31,49] for details). We believe that the most direct comparison with previous work can be done in the context of NeymanPearson testing [12]. For traditional NeymanPearson testing, a critical value of observed nonsynonymous substitutions would be computed for each site based on estimated Taq misincorporation frequencies. If fewer nonsynonymous substitutions than that critical value are observed, that site is classified as being under selection. In some sense this procedure "doubledips" the data; on one hand using observations to infer misincorporation frequencies, and on the other using observations to actually classify the site.
In contrast, our KullbackLeiblerbased approach uses the data only to test hypotheses at each site, integrating over all possible hypothetical data sets that could have been observed given plausible misincorporation rates. More specific and extensive comparisons between the KullbackLeibler approach and the NeymanPearson approach have been extensively studied in the statistical literature, although they are not especially well known [[26], especially pp. 45]. More important than their differences, however, are their similarities. Both compute truepositive/falsenegative and truenegative/falsepositive classification ratios for the null and alternate hypotheses; they simply use different methodological approaches to do so.
Software Availability
Software and sample input and output are available from the authors and are also online as Additional Files 6 and 7.
Additional file 6. Software Package. Source code for an RProject software package that we call 'unigenic'. The code has been tested on Mac OS 10.5 and recent versions of Linuxbased operating systems, and requires that R ≥ 2.9.1 and a modern Fortran95 compiler be available. For help installing R packages, see http://cran.rproject.org/doc/manuals/Radmin.html#Installingpackages webcite.
Format: GZ Size: 26KB Download file
Additional file 7. Sample Input and Output. Sample input, output, and driver files for the given software package.
Format: ZIP Size: 732KB Download file
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
AF conceived of, designed and tested the statistical methods for this study, and drafted the manuscript. DE designed and BK carried out the molecular studies in the laboratory. DE, LW, and GG participated in the design and validation of the study, its subsequent iterative refinement, and editing the manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work was supported by the Natural Sciences and Engineering Research Council of Canada, the Government of Ontario, and the University of Western Ontario. Data analysis was performed using software from the Rproject [18].
References

Deminoff SJ, Tornow J, Santangelo GM: Unigenic evolution: a novel genetic method localizes a putative leucine zipper that mediates dimerization of the Saccharomyces cerevisiae regulator Gcr1p.
Genetics 1995, 141(4):12631274. PubMed Abstract  PubMed Central Full Text

Friedman KL, Cech TR: Essential functions of aminoterminal domains in the yeast telomerase catalytic subunit revealed by selection for viable mutants.
Genes and Development 1999, 13(21):28632874. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

San Filippo J, Lambowitz AM: Characterization of the Cterminal DNAbinding/DNA endonuclease region of a group II intronencoded protein.
Journal of Molecular Biology 2002, 324(5):933951. PubMed Abstract  Publisher Full Text

Zeng X, Zhang D, Dorsey M, Ma J: Hypomutable regions of yeast TFIIB in a unigenic evolution test represent structural domains.
Gene 2003, 309:4956. PubMed Abstract  Publisher Full Text

Behrsin CD, Bailey ML, Bateman KS, Hamilton KS, Wahl LM, Brandl CJ, Shilton BH, Litchfield DW: Functionally important residues in the peptidylprolyl isomerase Pin1 revealed by unigenic evolution.
Journal of Molecular Biology 2007, 365(4):11431162. PubMed Abstract  Publisher Full Text

Cadwell RC, Joyce GF: Mutagenic PCR.
Genome Research (PCR Methods and Applications) 1994, 3(6):S136S140. Publisher Full Text

Behrsin CD, Brandl CJ, Litchfield DW, Shilton BH, Wahl LM: Development of an unbiased statistical method for the analysis of unigenic evolution.
BMC Bioinformatics 2006, 7:150. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Yates F: Contingency Tables Involving Small Numbers and the χ^{2 }Test. [http://www.jstor.org/stable/2983604] webcite
Supplement to the Journal of the Royal Statistical Society 1934, 1(2):217235. Publisher Full Text

Robert CP: The Bayesian choice: from decisiontheoretic foundations to computational implementation. 2nd edition. Springer texts in statistics, New York: Springer; 2001.

Rawlings ND, Barrett AJ: Families of serine peptidases.
Methods in Enzymology 1994, 244:1961. PubMed Abstract  Publisher Full Text

Polgár L: The catalytic triad of serine peptidases.
Cellular and Molecular Life Sciences 2005, 62(1920):21612172. PubMed Abstract  Publisher Full Text

Neyman J, Pearson E: On the Problem of the Most Efficient Tests of Statistical Hypotheses.
Philosophical Transactions of the Royal Society of London, Series A 1933, 231:289337. Publisher Full Text

Hubbard R, Bayarri MJ: Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing.
The American Statistician 2003, 57(3):171178. Publisher Full Text

Berger J: Could Fisher, Jeffreys and Neyman Have Agreed on Testing? [http://www.jstor.org/stable/3182859] webcite
Statistical Science 2003, 18:112. Publisher Full Text

Hubbard R, Bayarri MJ: P Values are not Error Probabilities. [http://www.uv.es/sestio/TechRep/tr1403.pdf] webcite
Tech Rep TR1403 University of Valencia, Department of Statistics and Operations Research; 2003.

Christensen R: Testing Fisher, Neyman, Pearson, and Bayes.
The American Statistician 2005, 59(2):121126. Publisher Full Text

Hubbard R, Armstrong JS: Why We Don't Really Know What Statistical Significance Means: Implications for Educators.
Journal of Marketing Education 2006, 28:114120. Publisher Full Text

R Development Core Team: [http://www.Rproject.org] webcite
R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2009.

Wolpert DH: Determining Whether Two Data Sets are from the Same Distribution.
In Maximum Entropy and Bayesian Methods: Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods Edited by Hanson KM, Silver RN. 1995.

Kleinstiver BP, Fernandes AD, Gloor GB, Edgell DR: A unified genetic, computational and experimental framework identifies functionally relevant residues of the homing endonuclease IBmol.
Nucleic Acids Res 2010, 38(7):241127. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Edgell DR, Shub DA: Related homing endonucleases IBmoI and ITevI use different strategies to cleave homologous recognition sites.
Proceedings of the National Academy of Sciences USA 2001, 98(14):7898903. Publisher Full Text

Edgell DR, Stanger MJ, Belfort M: Importance of a single base pair for discrimination between introncontaining and intronless alleles by endonuclease IBmoI.
Current Biology 2003, 13(11):9738. PubMed Abstract  Publisher Full Text

Edgell DR, Stanger MJ, Belfort M: Coincidence of cleavage sites of intron endonuclease ITevI and critical sequences of the host thymidylate synthase gene.
Journal of Molecular Biology 2004, 343(5):123141. PubMed Abstract  Publisher Full Text

Aitchison J, Shen SM: LogisticNormal Distributions: Some Properties and Uses. [http://www.jstor.org/stable/2335470] webcite
Biometrika 1980, 67(2):261272. Publisher Full Text

Berger JO, Bernardo JM: Ordered Group Reference Priors with Application to the Multinomial Problem.
Biometrika 1992, 79:2537. Publisher Full Text

Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarcelóVidal C: Isometric Logratio Transformations for Compositional Data Analysis.
Mathematical Geology 2003, 35(3):279300. Publisher Full Text

Egozcue JJ, PawlowskyGlahn V: Groups of parts and their balances in compositional data analysis.
Mathematical Geology 2005, 37(7):795828. Publisher Full Text

Egozcue JJ, DíazBarrero JL, PawlowskyGlahn V: Hilbert space of probability density functions based on Aitchison geometry.
Acta Mathematica Sinica 2006, 22(4):11751182. Publisher Full Text

Kullback S, Leibler RA: On information and sufficiency.
Annals of Mathematical Statistics 1951, 22:7986. Publisher Full Text

Kass R, Raftery A: Bayes Factors.
Journal of the American Statistical Association 1995, 90(430):773795. Publisher Full Text

Jeffreys H: Theory of probability. 3rd edition. The International series of monographs on physics, Oxford: Clarendon Press; 1961.

Frederico LA, Kunkel TA, Shaw BR: Cytosine deamination in mismatched base pairs.
Biochemistry 1993, 32(26):65236530. PubMed Abstract  Publisher Full Text

Lindahl T: The Croonian Lecture, 1996: Endogenous Damage to DNA.
Philosophical transactions of the Royal Society of London, Series B 1996, 351(1347):15291538. Publisher Full Text

Hofreiter M, Jaenicke V, Serre D, Haeseler Av A, Pääbo S: DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA.
Nucleic Acids Research 2001, 29(23):47934799. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Jaynes ET, Bretthorst GL: [http://www.loc.gov/catdir/samples/cam033/2002071486.html] webcite
Probability Theory: the Logic of Science. Cambridge, UK: Cambridge University Press; 2003.

Bernardo JM, Rueda R: Bayesian hypothesis testing: a reference approach.
International Statistical Review 2002, 70(3):351372. Publisher Full Text

Berger JO, Bernardo JM, Sun D: The Formal Definition of Reference Priors.
Annals of Statistics 2009, 37(2):905938. Publisher Full Text

Jeffreys H: An Invariant Form for the Prior Probability in Estimation Problems.
Proceedings of the Royal Society of London, Series A 1946, 186:453461. Publisher Full Text

Bernardo JM: Reference Posterior Distributions for Bayesian Inference.
Proceedings of the Royal Society of London, Series B 1974, 41:113147.

Wallace CS, Freeman PR: Estimation and Inference by Compact Coding. [http://www.jstor.org/stable/2985992] webcite
Journal of the Royal Statistical Society, Series B 1987, 49(3):240265.

Jermyn IH: Invariant Bayesian estimation on manifolds.
Annals of Statistics 2005, 33:583605. Publisher Full Text

Spee JH, de Vos WM, Kuipers OP: Efficient random mutagenesis method with adjustable mutation frequency by use of PCR and dITP. [http://nar.oxfordjournals.org] webcite
Nucleic Acids Research 1993, 21(3):777778. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Rall LB: Automatic differentiation: techniques and applications. Volume 120. Berlin: SpringerVerlag; 1981.

Griewank A: [http://www.loc.gov/catdir/enhancements/fy0621/99052587d.html] webcite
Evaluating derivatives: principles and techniques of algorithmic differentiation. Philadelphia: Society for Industrial and Applied Mathematics; 2000., 19

Goldman N, Yang Z: A codonbased model of nucleotide substitution for proteincoding DNA sequences.
Molecular Biology and Evolution 1994, 11(5):72536. PubMed Abstract  Publisher Full Text

Mayrose I, DoronFaigenboim A, Bacharach E, Pupko T: Towards realistic codon models: among site variability and dependency of synonymous and nonsynonymous rates.
Bioinformatics 2007, 23(13):i319i327. PubMed Abstract  Publisher Full Text

Goldman N: Statistical tests of models of DNA substitution. [http://dx.doi.org/10.1007/BF00166252] webcite
Journal of Molecular Evolution 1993, 36:182198.
[10.1007/BF00166252]
PubMed Abstract  Publisher Full Text 
Kass RE, Greenhouse JB: Comments on "Investigating therapies of potentially great benefit: ECMO".
Statistical Science 1989, 4:310317. Publisher Full Text