Algorithms for Molecular Biology - Latest Articles
http://www.almob.org
The latest research articles published by Algorithms for Molecular Biology2015-03-25T00:00:00Z Sorting signed permutations by short operations Background:
During evolution, global mutations may alter the order and the orientation of the genes in a genome. Such mutations are referred to as rearrangement events, or simply operations. In unichromosomal genomes, the most common operations are reversals, which are responsible for reversing the order and orientation of a sequence of genes, and transpositions, which are responsible for switching the location of two contiguous portions of a genome. The problem of computing the minimum sequence of operations that transforms one genome into another – which is equivalent to the problem of sorting a permutation into the identity permutation – is a well-studied problem that finds application in comparative genomics. There are a number of works concerning this problem in the literature, but they generally do not take into account the length of the operations (i.e. the number of genes affected by the operations). Since it has been observed that short operations are prevalent in the evolution of some species, algorithms that efficiently solve this problem in the special case of short operations are of interest.
Results:
In this paper, we investigate the problem of sorting a signed permutation by short operations. More precisely, we study four flavors of this problem: (i) the problem of sorting a signed permutation by reversals of length at most 2; (ii) the problem of sorting a signed permutation by reversals of length at most 3; (iii) the problem of sorting a signed permutation by reversals and transpositions of length at most 2; and (iv) the problem of sorting a signed permutation by reversals and transpositions of length at most 3. We present polynomial-time solutions for problems (i) and (iii), a 5-approximation for problem (ii), and a 3-approximation for problem (iv). Moreover, we show that the expected approximation ratio of the 5-approximation algorithm is not greater than 3 for random signed permutations with more than 12 elements. Finally, we present experimental results that show that the approximation ratios of the approximation algorithms cannot be smaller than 3. In particular, this means that the approximation ratio of the 3-approximation algorithm is tight.
${item.link}
Gustavo GalvãoOrlando LeeZanoni DiasAlgorithms for Molecular Biology 2015, null:122015-03-25T00:00:00Z${item.identifier}/content/figures/s13015-015-0040-x-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}122015-03-25T00:00:00ZPDF A virtual pebble game to ensemble average graph rigidity Background:
The body-bar Pebble Game (PG) algorithm is commonly used to calculate network rigidity properties in proteins and polymeric materials. To account for fluctuating interactions such as hydrogen bonds, an ensemble of constraint topologies are sampled, and average network properties are obtained by averaging PG characterizations. At a simpler level of sophistication, Maxwell constraint counting (MCC) provides a rigorous lower bound for the number of internal degrees of freedom (DOF) within a body-bar network, and it is commonly employed to test if a molecular structure is globally under-constrained or over-constrained. MCC is a mean field approximation (MFA) that ignores spatial fluctuations of distance constraints by replacing the actual molecular structure by an effective medium that has distance constraints globally distributed with perfect uniform density.
Results:
The Virtual Pebble Game (VPG) algorithm is a MFA that retains spatial inhomogeneity in the density of constraints on all length scales. Network fluctuations due to distance constraints that may be present or absent based on binary random dynamic variables are suppressed by replacing all possible constraint topology realizations with the probabilities that distance constraints are present. The VPG algorithm is isomorphic to the PG algorithm, where integers for counting “pebbles” placed on vertices or edges in the PG map to real numbers representing the probability to find a pebble. In the VPG, edges are assigned pebble capacities, and pebble movements become a continuous flow of probability within the network. Comparisons between the VPG and average PG results over a test set of proteins and disordered lattices demonstrate the VPG quantitatively estimates the ensemble average PG results well.
Conclusions:
The VPG performs about 20% faster than one PG, and it provides a pragmatic alternative to averaging PG rigidity characteristics over an ensemble of constraint topologies. The utility of the VPG falls in between the most accurate but slowest method of ensemble averaging over hundreds to thousands of independent PG runs, and the fastest but least accurate MCC.
http://www.almob.org/content/10/1/11
Luis GonzálezHui WangDennis LivesayDonald JacobsAlgorithms for Molecular Biology 2015, null:112015-03-18T00:00:00Zdoi:10.1186/s13015-015-0039-3/content/figures/s13015-015-0039-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}112015-03-18T00:00:00ZPDF A simple data-adaptive probabilistic variant calling model Background:
Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments.
Results:
We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions:
In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.
http://www.almob.org/content/10/1/10
Steve HoffmannPeter StadlerKorbinian StrimmerAlgorithms for Molecular Biology 2015, null:102015-03-04T12:00:00Zdoi:10.1186/s13015-015-0037-5/content/figures/s13015-015-0037-5-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}102015-03-04T12:00:00ZXMLProtein docking with predicted constraintsThis paper presents a constraint-based method for improving protein docking results. Efficient constraint propagation cuts over 95% of the search time for finding the configurations with the largest contact surface, provided a contact is specified between two amino acid residues. This makes it possible to scan a large number of potentially correct constraints, lowering the requirements for useful contact predictions. While other approaches are very dependent on accurate contact predictions, ours requires only that at least one correct contact be retained in a set of, for example, one hundred constraints to test. It is this feature that makes it feasible to use readily available sequence data to predict specific potential contacts. Although such prediction is too inaccurate for most purposes, we demonstrate with a Naïve Bayes Classifier that it is accurate enough to more than double the average number of acceptable models retained during the crucial filtering stage of protein docking when combined with our constrained docking algorithm. All software developed in this work is freely available as part of the Open Chemera Library.
http://www.almob.org/content/10/1/9
Ludwig KrippahlPedro BarahonaAlgorithms for Molecular Biology 2015, null:92015-02-20T12:00:00Zdoi:10.1186/s13015-015-0036-6/content/figures/s13015-015-0036-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}92015-02-20T12:00:00ZXMLMoDock: A multi-objective strategy improves the accuracy for molecular dockingBackground:
As a main method of structure-based virtual screening, molecular docking is the most widely used in practice. However, the non-ideal efficacy of scoring functions is thought as the biggest barrier which hinders the improvement of the molecular docking method.
Results:
A new multi-objective strategy for molecular docking, named as MoDock, is presented to further improve the docking accuracy with available scoring functions. Instead of simple combination of multiple objectives with fixed weight factors, an aggregate function is adopted to approximate the real solution of the original multi-objective and multi-constraint problem, which will simultaneously smooth the energy surface of the combined scoring functions. Then, method of centers and genetic algorithm are used to find the optimal solution. Tests of MoDock against the GOLD test data set reveal the multi-objective strategy improves the docking accuracy over the individual scoring functions. Meanwhile, a 70% ratio of the good docking solutions with the RMSD value below 1.0 Å outperforms other 6 commonly used docking programs, even with a flexible receptor docking program included.
Conclusions:
The results show MoDock is an effective strategy to overcome the deviations brought by single scoring function, and improves the prediction power of molecular docking.
http://www.almob.org/content/10/1/8
Junfeng GuXu YangLing KangJinying WuXicheng WangAlgorithms for Molecular Biology 2015, null:82015-02-18T00:00:00Zdoi:10.1186/s13015-015-0034-8/content/figures/s13015-015-0034-8-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}82015-02-18T00:00:00ZXMLAlgorithmic approaches to protein-protein interaction site predictionInteraction sites on protein surfaces mediate virtually all biological activities, and their identification holds promise for disease treatment and drug design. Novel algorithmic approaches for the prediction of these sites have been produced at a rapid rate, and the field has seen significant advancement over the past decade. However, the most current methods have not yet been reviewed in a systematic and comprehensive fashion. Herein, we describe the intricacies of the biological theory, datasets, and features required for modern protein-protein interaction site (PPIS) prediction, and present an integrative analysis of the state-of-the-art algorithms and their performance. First, the major sources of data used by predictors are reviewed, including training sets, evaluation sets, and methods for their procurement. Then, the features employed and their importance in the biological characterization of PPISs are explored. This is followed by a discussion of the methodologies adopted in contemporary prediction programs, as well as their relative performance on the datasets most recently used for evaluation. In addition, the potential utility that PPIS identification holds for rational drug design, hotspot prediction, and computational molecular docking is described. Finally, an analysis of the most promising areas for future development of the field is presented.
http://www.almob.org/content/10/1/7
Tristan Aumentado-ArmstrongBogdan IstrateRobert MurgitaAlgorithms for Molecular Biology 2015, null:72015-02-15T12:00:00Zdoi:10.1186/s13015-015-0033-9/content/figures/s13015-015-0033-9-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}72015-02-15T12:00:00ZXMLAn integrative approach for a network based meta-analysis of viral RNAi screensBackground:
Big data is becoming ubiquitous in biology, and poses significant challenges in data analysis and interpretation. RNAi screening has become a workhorse of functional genomics, and has been applied, for example, to identify host factors involved in infection for a panel of different viruses. However, the analysis of data resulting from such screens is difficult, with often low overlap between hit lists, even when comparing screens targeting the same virus. This makes it a major challenge to select interesting candidates for further detailed, mechanistic experimental characterization.
Results:
To address this problem we propose an integrative bioinformatics pipeline that allows for a network based meta-analysis of viral high-throughput RNAi screens. Initially, we collate a human protein interaction network from various public repositories, which is then subjected to unsupervised clustering to determine functional modules. Modules that are significantly enriched with host dependency factors (HDFs) and/or host restriction factors (HRFs) are then filtered based on network topology and semantic similarity measures. Modules passing all these criteria are finally interpreted for their biological significance using enrichment analysis, and interesting candidate genes can be selected from the modules.
Conclusions:
We apply our approach to seven screens targeting three different viruses, and compare results with other published meta-analyses of viral RNAi screens. We recover key hit genes, and identify additional candidates from the screens. While we demonstrate the application of the approach using viral RNAi data, the method is generally applicable to identify underlying mechanisms from hit lists derived from high-throughput experimental data, and to select a small number of most promising genes for further mechanistic studies.
http://www.almob.org/content/10/1/6
Sandeep AmberkarLars KaderaliAlgorithms for Molecular Biology 2015, null:62015-02-13T00:00:00Zdoi:10.1186/s13015-015-0035-7/content/figures/s13015-015-0035-7-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}62015-02-13T00:00:00ZXMLEstimating evolutionary distances between genomic sequences from spaced-word matchesAlignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator d
N
of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.
http://www.almob.org/content/10/1/5
Burkhard MorgensternBingyao ZhuSebastian HorwegeChris LeimeisterAlgorithms for Molecular Biology 2015, null:52015-02-11T12:00:00Zdoi:10.1186/s13015-015-0032-x/content/figures/s13015-015-0032-x-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}52015-02-11T12:00:00ZXMLClustering of reads with alignment-free measures and quality valuesBackground:
The data volume generated by Next-Generation Sequencing (NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15 %).
Results:
In this scenario it will be fundamental to exploit quality value information within the alignment-free framework. To the best of our knowledge this is the first study that incorporates quality value information and k-mers counts, in the context of alignment-free measures, for the comparison of reads data. Based on this principles, in this paper we present a family of alignment-free measures called D
q
-type. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. Also results on de novo assembly and metagenomic reads classification show that the introduction of quality values improves over standard alignment-free measures. These statistics are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).
http://www.almob.org/content/10/1/4
Matteo CominAndrea LeoniMichele SchimdAlgorithms for Molecular Biology 2015, null:42015-01-28T12:00:00Zdoi:10.1186/s13015-014-0029-x/content/figures/s13015-014-0029-x-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}42015-01-28T12:00:00ZXMLEUCALYPT : efficient tree reconciliation enumeratorBackground:
Phylogenetic tree reconciliation is the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites. It consists in a mapping between the parasite tree and the host tree using event-based maximum parsimony. Given a cost model for the events, many optimal reconciliations are however possible. Any further biological interpretation of them must therefore take this into account, making the capacity to enumerate all optimal solutions a crucial point. Only two algorithms currently exist that attempt such enumeration; in one case not all possible solutions are produced while in the other not all cost vectors are currently handled. The objective of this paper is two-fold. The first is to fill this gap, and the second is to test whether the number of solutions generally observed can be an issue in terms of interpretation.
Results:
We present a polynomial-delay algorithm for enumerating all optimal reconciliations. We show that in general many solutions exist. We give an example where, for two pairs of host-parasite trees having each less than 41 leaves, the number of solutions is 5120, even when only time-feasible ones are kept. To facilitate their interpretation, those solutions are also classified in terms of how many of each event they contain. The number of different classes of solutions may thus be notably smaller than the number of solutions, yet they may remain high enough, in particular for the cases where losses have cost 0. In fact, depending on the cost vector, both numbers of solutions and of classes thereof may increase considerably. To further deal with this problem, we introduce and analyse a restricted version where host switches are allowed to happen only between species that are within some fixed distance along the host tree. This restriction allows us to reduce the number of time-feasible solutions while preserving the same optimal cost, as well as to find time-feasible solutions with a cost close to the optimal in the cases where no time-feasible solution is found.
Conclusions:
We present Eucalypt, a polynomial-delay algorithm for enumerating all optimal reconciliations which is freely available at http://eucalypt.gforge.inria.fr/.
http://www.almob.org/content/10/1/3
Beatrice DonatiChristian BaudetBlerina SinaimeriPierluigi CrescenziMarie-France SagotAlgorithms for Molecular Biology 2015, null:32015-01-23T12:00:00Zdoi:10.1186/s13015-014-0031-3/content/figures/s13015-014-0031-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}32015-01-23T12:00:00ZXML