Estimating the evidence of selection and the reliability of inference in unigenic evolution
-
* Corresponding author: Andrew D Fernandes andrew@fernandes.org
1 Department of Biochemistry, The University of Western Ontario, N6A 5C1 Canada
2 Department of Applied Mathematics, The University of Western Ontario, N6A 5B7 Canada
Algorithms for Molecular Biology 2010, 5:35 doi:10.1186/1748-7188-5-35
Published: 8 November 2010Additional files
Additional file 1:
Homogeneity Tests are Insufficient to Detect Selection. The necessity of computing the codon mutation frequencies M via nucleotide frequencies P is shown by the lack of statistical power for determining selection purely by codon-by-codon comparison of unselected and selected clones. (A) Using the test for such multinomial homogeneity as given by Wolpert [19], the posterior log2-odds-ratio between hypotheses, ≈ -0.4, implies that they are virtually indistinguishable. (B) The estimated power of such analysis has a posterior log2-odds-ratio of ≈ 0.05 thereby showing the unsuitability of tests for functional selection that rely only on codon-based mutation counts. Of particular significance is that the the M1 start-codon is not discerned in either selected or unselected population, even though it is absolutely required for protein function in the selected clones and absolutely conserved due to the cloning technique in the unselected population. The complete absence of power at M1 and other sites shows the unsuitability of codon homogeneity to serve as evidence of selection. Note that the additive property of log2-odds-ratios implies that combining counts for identical codon classes increases the log2-odds-ratio only linearly, thereby implying that reasonable power cannot be achieved by codon-class analysis either, for the given sample size.
Format: PDF Size: 83KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 2:
The Kronecker Product, Illustrated. An explicit representation of the Kronecker product P ⊗ P. Since mutations in nucleotide sites are assumed independent, the frequency that nucleotide j is mutated to i is pij . For a second nucleotide, again the frequency that nucleotide l is mutated to k is pkl. Therefore, the joint frequency that both mutations occur is pijpkl. A third Kronecker-multiplication would result in the 64 × 64 matrix M = P ⊗ P ⊗ P. Being given a third mutation of frequency of pmn yields a final codon mutation frequency of pijpklpmn.
Format: PDF Size: 694KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 3:
Details of the GIY-YIGDomain. Numerical details of the GIY-YIG motif grey-highlighted in Figure 1. 'EoS' refers to the
log-odds ratio, 'Total' is the total (synonymous plus nonsynonymous) number of observed
codon mutations, 'NSO' is the observed number of nonsynonymous mutations, and 'NSE'
is the expected number of nonsynonymous mutations. All 'expected' values are conditioned
on the null hypothesis of 'no selection'. Additional expected nonsynonymous counts
for different codons are shown in Additional File 4.
Format: PDF Size: 31KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 4:
The Expected Number of Nonsynonymous Misincorporations Percentiles for the expected number of nonsynonymous mutations under the null hypothesis of 'no selection' for different clone population sample size, given misincorporation frequencies estimated by the unselected population counts shown in Table 1. Of particular importance is the wide range of 'Pr(NS)', the estimated probability of nonsynonymous mutation. This probability ranges from 0.0056 to 0.0633 per codon, an 11.2-fold difference. 'Q02', 'Q50', and 'Q98' represent the 2%, 50%, and 98% binomial percentiles, respectively, indicating that the observed number of nonsynonymous mutations under H0 is 96% likely to be within the indicated range. Codons resistant to nonsynonymous mutation, such as alanine and glycine, show obvious non-normality for even between 200-500 sequenced clones.
Format: PDF Size: 58KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5:
The Effect of Sample Size for I-Bmol. The effect of differing selected and unselected clone population sample sizes on the power of inference. Subsamples of 5, 10, 20, 40, and 87 (all) clone populations were analyzed as per Figure 1 and shown using identical axis scales, with the 87-87 plot therefore identical to Figure 1. All populations are subset inclusive, meaning that the 10-sample subset contained all sequences of the 5-sample subset, and so on. Approximate nucleotide misincorporation frequencies can be estimated by dividing the counts shown in Table 1 as appropriate. We note that even using only 5/87 unselected clones to estimate parameter matrix T resulted in qualitatively similar EoS values (red) for all 87-clone selected populations. Unselected clones were critical, however, in estimating false-positive (blue) rates, with all 87 unselected clones being required to detect the methionine start-signal.
Format: PDF Size: 621KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 6:
Software Package. Source code for an R-Project software package that we call 'unigenic'. The code has been tested on Mac OS 10.5 and recent versions of Linux-based operating systems, and requires that R ≥ 2.9.1 and a modern Fortran95 compiler be available. For help installing R packages, see http://cran.r-project.org/doc/manuals/R-admin.html#Installing-packages webcite.
Format: GZ Size: 26KB Download file
Additional file 7:
Sample Input and Output. Sample input, output, and driver files for the given software package.
Format: ZIP Size: 732KB Download file
