Algorithms for Molecular Biology - Latest Articles
http://www.almob.org
The latest research articles published by Algorithms for Molecular Biology2015-01-23T12:00:00ZEUCALYPT : Efficient tree reconciliation enumeratorBackground:
Phylogenetic tree reconciliation is the approach of choice for investigating the coevolution of sets of organisms such as hosts and parasites. It consists in a mapping between the parasite tree and the host tree using event-based maximum parsimony. Given a cost model for the events, many optimal reconciliations are however possible. Any further biological interpretation of them must therefore take this into account, making the capacity to enumerate all optimal solutions a crucial point. Only two algorithms currently exist that attempt such enumeration; in one case not all possible solutions are produced while in the other not all cost vectors are currently handled. The objective of this paper is two-fold. The first is to fill this gap, and the second is to test whether the number of solutions generally observed can be an issue in terms of interpretation.
Results:
We present a polynomial-delay algorithm for enumerating all optimal reconciliations. We show that in general many solutions exist. We give an example where, for two pairs of host-parasite trees having each less than 41 leaves, the number of solutions is 5120, even when only time-feasible ones are kept. To facilitate their interpretation, those solutions are also classified in terms of how many of each event they contain. The number of different classes of solutions may thus be notably smaller than the number of solutions, yet they may remain high enough, in particular for the cases where losses have cost 0. In fact, depending on the cost vector, both numbers of solutions and of classes thereof may increase considerably. To further deal with this problem, we introduce and analyse a restricted version where host switches are allowed to happen only between species that are within some fixed distance along the host tree. This restriction allows us to reduce the number of time-feasible solutions while preserving the same optimal cost, as well as to find time-feasible solutions with a cost close to the optimal in the cases where no time-feasible solution is found.
Conclusions:
We present Eucalypt, a polynomial-delay algorithm for enumerating all optimal reconciliations which is freely available at http://eucalypt.gforge.inria.fr/.
http://www.almob.org/content/10/1/3
Beatrice DonatiChristian BaudetBlerina SinaimeriPierluigi CrescenziMarie-France SagotAlgorithms for Molecular Biology 2015, null:32015-01-23T12:00:00Zdoi:10.1186/s13015-014-0031-3/content/figures/s13015-014-0031-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}32015-01-23T12:00:00ZPDFA two-phase binning algorithm using l -mer frequency on groups of non-overlapping readsBackground:
Metagenomics is the study of genetic materials derived directly from complex microbial samples, instead of from culture. One of the crucial steps in metagenomic analysis, referred to as “binning”, is to separate reads into clusters that represent genomes from closely related organisms. Among the existing binning methods, unsupervised methods base the classification on features extracted from reads, and especially taking advantage in case of the limitation of reference database availability. However, their performance, under various aspects, is still being investigated by recent theoretical and empirical studies. The one addressed in this paper is among those efforts to enhance the accuracy of the classification.
Results:
This paper presents an unsupervised algorithm, called BiMeta, for binning of reads from different species in a metagenomic dataset. The algorithm consists of two phases. In the first phase of the algorithm, reads are grouped into groups based on overlap information between the reads. The second phase merges the groups by using an observation on l-mer frequency distribution of sets of non-overlapping reads. The experimental results on simulated and real datasets showed that BiMeta outperforms three state-of-the-art binning algorithms for both short and long reads (≥700 b
p) datasets.
Conclusions:
This paper developed a novel and efficient algorithm for binning of metagenomic reads, which does not require any reference database. The software implementing the algorithm and all test datasets mentioned in this paper can be downloaded at http://it.hcmute.edu.vn/bioinfo/bimeta/index.htm.
http://www.almob.org/content/10/1/2
Le VinhTran LangLe BinhTran HoaiAlgorithms for Molecular Biology 2015, null:22015-01-16T12:00:00Zdoi:10.1186/s13015-014-0030-4/content/figures/s13015-014-0030-4-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}22015-01-16T12:00:00ZXMLDouble and multiple knockout simulations for genome-scale metabolic network reconstructionsBackground:
Constraint-based modeling of genome-scale metabolic network reconstructions has become a widely used approach in computational biology. Flux coupling analysis is a constraint-based method that analyses the impact of single reaction knockouts on other reactions in the network.
Results:
We present an extension of flux coupling analysis for double and multiple gene or reaction knockouts, and develop corresponding algorithms for an in silico simulation. To evaluate our method, we perform a full single and double knockout analysis on a selection of genome-scale metabolic network reconstructions and compare the results.SoftwareA prototype implementation of double knockout simulation is available at http://hoverboard.io/L4FC.
http://www.almob.org/content/10/1/1
Yaron GoldsteinAlexander BockmayrAlgorithms for Molecular Biology 2015, null:12015-01-09T12:00:00Zdoi:10.1186/s13015-014-0028-y/content/figures/s13015-014-0028-y-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}12015-01-09T12:00:00ZXMLOn the number of genomic pacemakers: a geometric approachThe universal pacemaker (UPM) model extends the classical molecular clock (MC) model, by allowing each gene, in addition to its individual intrinsic rate as in the MC, to accelerate or decelerate according to the universal pacemaker. Under UPM, the relative evolutionary rates of all genes remain nearly constant whereas the absolute rates can change arbitrarily. It was shown on several taxa groups spanning the entire tree of life that the UPM model describes the evolutionary process better than the MC model. In this work we provide a natural generalization to the UPM model that we denote multiple pacemakers (MPM). Under the MPM model every gene is still affected by a single pacemaker, however the number of pacemakers is not confined to one. Such a model induces a partition over the gene set where all the genes in one part are affected by the same pacemaker and task is to identify the pacemaker partition, or in other words, finding for each gene its associated pacemaker. We devise a novel heuristic procedure, relying on statistical and geometrical tools, to solve the problem and demonstrate by simulation that this approach can cope satisfactorily with considerable noise and realistic problem sizes. We applied this procedure to a set of over 2000 genes in 100 prokaryotes and demonstrated the significant existence of two pacemakers.
http://www.almob.org/content/9/1/26
Sagi SnirAlgorithms for Molecular Biology 2014, null:262014-12-31T12:00:00Zdoi:10.1186/s13015-014-0026-0/content/figures/s13015-014-0026-0-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}262014-12-31T12:00:00ZXMLBicPAM: Pattern-based biclustering for biomedical data analysisBackground:
Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed.
Methods:
The proposed BicPAM biclustering approach integrates existing principles made available by state-of-the-art pattern-based approaches with two new contributions. First, BicPAM is the first efficient attempt to exhaustively mine non-constant types of biclusters, including additive and multiplicative coherencies in the presence or absence of symmetries. Second, BicPAM provides strategies to effectively compose different biclustering structures and to handle arbitrary levels of noise inherent to data and with discretization procedures.
Results:
Results show BicPAM’s superiority against its peers and its ability to retrieve unique types of biclusters of interest, to efficiently deliver exhaustive solutions and to successfully recover planted biclusters in datasets with varying levels of missing values and noise. Its application over gene expression data leads to unique solutions with heightened biological relevance.
Conclusions:
BicPAM approaches integrate existing disperse efforts towards pattern-based biclustering and provides the first critical strategies to efficiently discover exhaustive solutions of biclusters with shifting, scaling and symmetric assumptions with varying quality and underlying structures. Additionally, BicPAM dynamically adapts its behavior to mine data with different levels of missing values and noise.
http://www.almob.org/content/9/1/27
Rui HenriquesSara MadeiraAlgorithms for Molecular Biology 2014, null:272014-12-16T12:00:00Zdoi:10.1186/s13015-014-0027-z/content/figures/s13015-014-0027-z-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}272014-12-16T12:00:00ZXMLAnalysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov ModelsBackground:
Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length.
Results:
We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities.
http://www.almob.org/content/9/1/25
Mireille RégnierEvgenia FurletovaVictor YakovlevMikhail RoytbergAlgorithms for Molecular Biology 2014, null:252014-12-16T12:00:00Zdoi:10.1186/s13015-014-0025-1/content/figures/s13015-014-0025-1-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}252014-12-16T12:00:00ZPDFA constraint solving approach to model reduction by tropical equilibrationModel reduction is a central topic in systems biology and dynamical systems theory, for reducing the complexity of detailed models, finding important parameters, and developing multi-scale models for instance. While singular perturbation theory is a standard mathematical tool to analyze the different time scales of a dynamical system and decompose the system accordingly, tropical methods provide a simple algebraic framework to perform these analyses systematically in polynomial systems. The crux of these methods is in the computation of tropical equilibrations. In this paper we show that constraint-based methods, using reified constraints for expressing the equilibration conditions, make it possible to numerically solve non-linear tropical equilibration problems, out of reach of standard computation methods. We illustrate this approach first with the detailed reduction of a simple biochemical mechanism, the Michaelis-Menten enzymatic reaction model, and second, with large-scale performance figures obtained on the http://biomodels.net
repository.
http://www.almob.org/content/9/1/24
Sylvain SolimanFrançois FagesOvidiu RadulescuAlgorithms for Molecular Biology 2014, null:242014-12-04T12:00:00Zdoi:10.1186/s13015-014-0024-2/content/figures/s13015-014-0024-2-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}242014-12-04T12:00:00ZXMLAtom mapping with constraint programmingChemical reactions are rearrangements of chemical bonds. Each atom in an educt molecule thus appears again in a specific position of one of the reaction products. This bijection between educt and product atoms is not reported by chemical reaction databases, however, so that the “Atom Mapping Problem” of finding this bijection is left as an important computational task for many practical applications in computational chemistry and systems biology. Elementary chemical reactions feature a cyclic imaginary transition state (ITS) that imposes additional restrictions on the bijection between educt and product atoms that are not taken into account by previous approaches. We demonstrate that Constraint Programming is well-suited to solving the Atom Mapping Problem in this setting. The performance of our approach is evaluated for a manually curated subset of chemical reactions from the KEGG database featuring various ITS cycle layouts and reaction mechanisms.
http://www.almob.org/content/9/1/23
Martin MannFeras NaharNorah SchnorrRolf BackofenPeter StadlerChristoph FlammAlgorithms for Molecular Biology 2014, null:232014-11-29T00:00:00Zdoi:10.1186/s13015-014-0023-3/content/figures/s13015-014-0023-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}232014-11-29T00:00:00ZXMLA priori assessment of data quality in molecular phylogeneticsSets of sequence data used in phylogenetic analysis are often plagued by both random noise and systematic biases. Since the commonly used methods of phylogenetic reconstruction are designed to produce trees it is an important task to evaluate these trees a posteriori. Preferably, however, one would like to assess the suitability of the input data for phylogenetic analysis a priori and, if possible, obtain information on how to prune the data sets to improve the quality of phylogenetic reconstruction without introducing unwarranted biases. In the last few years several different approaches, algorithms, and software tools have been proposed for this purpose. Here we provide an overview of the state of the art and briefly discuss the most pressing open problems.
http://www.almob.org/content/9/1/22
Bernhard MisofKaren MeusemannBjörn von ReumontPatrick KückSonja ProhaskaPeter StadlerAlgorithms for Molecular Biology 2014, null:222014-09-12T00:00:00Zdoi:10.1186/s13015-014-0022-4/content/figures/s13015-014-0022-4-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}222014-09-12T00:00:00ZXMLGraph-distance distribution of the Boltzmann ensemble of RNA secondary structuresBackground:
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.ResultSpatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n
6
D
5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n
4) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.
Conclusions:
The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.uni-jena.de/RNAgraphdist.html.
http://www.almob.org/content/9/1/19
Jing QinMarkus FrickeManja MarzPeter StadlerRolf BackofenAlgorithms for Molecular Biology 2014, null:192014-09-11T00:00:00Zdoi:10.1186/1748-7188-9-19/content/figures/1748-7188-9-19-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}192014-09-11T00:00:00ZXML