Algorithms for Molecular Biology - Latest Articles
http://www.almob.org
The latest research articles published by Algorithms for Molecular Biology2014-09-12T00:00:00ZA priori assessment of data quality in molecular phylogeneticsSets of sequence data used in phylogenetic analysis are often plagued by both random noise and systematic biases. Since the commonly used methods of phylogenetic reconstruction are designed to produce trees it is an important task to evaluate these trees a posteriori. Preferably, however, one would like to assess the suitability of the input data for phylogenetic analysis a priori and, if possible, obtain information on how to prune the data sets to improve the quality of phylogenetic reconstruction without introducing unwarranted biases. In the last few years several different approaches, algorithms, and software tools have been proposed for this purpose. Here we provide an overview of the state of the art and briefly discuss the most pressing open problems.
http://www.almob.org/content/9/1/22
Bernhard MisofKaren MeusemannBjörn von ReumontPatrick KückSonja ProhaskaPeter StadlerAlgorithms for Molecular Biology 2014, null:222014-09-12T00:00:00Zdoi:10.1186/s13015-014-0022-4/content/figures/s13015-014-0022-4-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}222014-09-12T00:00:00ZXMLGraph-distance distribution of the Boltzmann ensemble of RNA secondary structuresBackground:
Large RNA molecules are often composed of multiple functional domains whose spatial arrangement strongly influences their function. Pre-mRNA splicing, for instance, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium harbors useful information on the shape of the molecule that in turn can give insights into the interplay of its functional domains.ResultSpatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between a fixed pair of nucleotides can be computed in polynomial time by means of dynamic programming. While a naïve implementation would yield recursions with a very high time complexity of O(n
6
D
5) for sequence length n and D distinct distance values, it is possible to reduce this to O(n
4) for practical applications in which predominantly small distances are of of interest. Further reductions, however, seem to be difficult. Therefore, we introduced sampling approaches that are much easier to implement. They are also theoretically favorable for several real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules.
Conclusions:
The graph-distance distribution can be computed using a dynamic programming approach. Although a crude approximation of reality, our initial results indicate that the graph-distance can be related to the smFRET data. The additional file and the software of our paper are available from http://www.rna.uni-jena.de/RNAgraphdist.html.
http://www.almob.org/content/9/1/19
Jing QinMarkus FrickeManja MarzPeter StadlerRolf BackofenAlgorithms for Molecular Biology 2014, null:192014-09-11T00:00:00Zdoi:10.1186/1748-7188-9-19/content/figures/1748-7188-9-19-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}192014-09-11T00:00:00ZXMLOptimal computation of all tandem repeats in a weighted sequenceBackground:
Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment.
Results:
Crochemore’s repetitions algorithm, also referred to as Crochemore’s partitioning algorithm, was introduced in 1981, and was the first optimal
O
(
n
log
n
)
-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore’s partitioning algorithm for weighted sequences, which requires optimal
O
(
n
log
n
)
time, thus improving on the best known
O
n
2
-time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.
http://www.almob.org/content/9/1/21
Carl BartonCostas IliopoulosSolon PissisAlgorithms for Molecular Biology 2014, null:212014-08-16T12:00:00Zdoi:10.1186/s13015-014-0021-5/content/figures/s13015-014-0021-5-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}212014-08-16T12:00:00ZXMLUsing the message passing algorithm on discrete data to detect faults in boolean regulatory networksBackground:
An important problem in systems biology is to model gene regulatory networks which can then be utilized to develop novel therapeutic methods for cancer treatment. Knowledge about which proteins/genes are dysregulated in a regulatory network, such as in the Mitogen Activated Protein Kinase (MAPK) Network, can be used not only to decide upon which therapy to use for a particular case of cancer, but also help in discovering effective targets for new drugs.
Results:
In this work we demonstrate how one can start from a model signal transduction network derived from prior knowledge, and infer from gene expression data the probable locations of dysregulations in the network. Our model is based on Boolean networks, and the inference problem is solved using a version of the message passing algorithm. We have done simulation experiments on synthetic data to verify the efficacy of the algorithm as compared to the results from the much more computationally intensive Markov Chain Monte-Carlo methods. We also applied the model to analyze data collected from fibroblasts, thereby demonstrating how this model can be used on real world data.
http://www.almob.org/content/9/1/20
Anwoy MohantyAniruddha DattaVijayanagaram VenkatrajAlgorithms for Molecular Biology 2014, null:202014-08-16T12:00:00Zdoi:10.1186/s13015-014-0020-6/content/figures/s13015-014-0020-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}202014-08-16T12:00:00ZXMLNot assessing the efficiency of multiple sequence alignment programsOne can search for messages in the digits of π or a Kazakhstan telephone book, but there may be hidden messages closer to home. A recent publication in this journal purportedly compared a set of multiple sequence alignment programs. The real purpose of the article may have been to remind readers how to present scientific data.
http://www.almob.org/content/9/1/18
Andrew TordaAlgorithms for Molecular Biology 2014, null:182014-07-05T00:00:00Zdoi:10.1186/1748-7188-9-18/content/figures/1748-7188-9-18-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}182014-07-05T00:00:00ZXMLRNA-RNA interaction prediction using genetic algorithmBackground:
RNA-RNA interaction plays an important role in the regulation of gene expression and cell development. In this process, an RNA molecule prohibits the translation of another RNA molecule by establishing stable interactions with it. In the RNA-RNA interaction prediction problem, two RNA sequences are given as inputs and the goal is to find the optimal secondary structure of two RNAs and between them. Some different algorithms have been proposed to predict RNA-RNA interaction structure. However, most of them suffer from high computational time.
Results:
In this paper, we introduce a novel genetic algorithm called GRNAs to predict the RNA-RNA interaction. The proposed algorithm is performed on some standard datasets with appropriate accuracy and lower time complexity in comparison to the other state-of-the-art algorithms. In the proposed algorithm, each individual is a secondary structure of two interacting RNAs. The minimum free energy is considered as a fitness function for each individual. In each generation, the algorithm is converged to find the optimal secondary structure (minimum free energy structure) of two interacting RNAs by using crossover and mutation operations.
Conclusions:
This algorithm is properly employed for joint secondary structure prediction. The results achieved on a set of known interacting RNA pairs are compared with the other related algorithms and the effectiveness and validity of the proposed algorithm have been demonstrated. It has been shown that time complexity of the algorithm in each iteration is as efficient as the other approaches.
http://www.almob.org/content/9/1/17
Soheila MontaseriFatemeh Zare-MirakabadNasrollah Moghadam-CharkariAlgorithms for Molecular Biology 2014, null:172014-06-29T00:00:00Zdoi:10.1186/1748-7188-9-17/content/figures/1748-7188-9-17-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}172014-06-29T00:00:00ZXMLEnumerating all maximal frequent subtrees in collections of phylogenetic treesBackground:
A common problem in phylogenetic analysis is to identify frequent patterns in a collection of phylogenetic trees. The goal is, roughly, to find a subset of the species (taxa) on which all or some significant subset of the trees agree. One popular method to do so is through maximum agreement subtrees (MASTs). MASTs are also used, among other things, as a metric for comparing phylogenetic trees, computing congruence indices and to identify horizontal gene transfer events.
Results:
We give algorithms and experimental results for two approaches to identify common patterns in a collection of phylogenetic trees, one based on agreement subtrees, called maximal agreement subtrees, the other on frequent subtrees, called maximal frequent subtrees. These approaches can return subtrees on larger sets of taxa than MASTs, and can reveal new common phylogenetic relationships not present in either MASTs or the majority rule tree (a popular consensus method). Our current implementation is available on the web at https://code.google.com/p/mfst-miner/.
Conclusions:
Our computational results confirm that maximal agreement subtrees and all maximal frequent subtrees can reveal a more complete phylogenetic picture of the common patterns in collections of phylogenetic trees than maximum agreement subtrees; they are also often more resolved than the majority rule tree. Further, our experiments show that enumerating maximal frequent subtrees is considerably more practical than enumerating ordinary (not necessarily maximal) frequent subtrees.
http://www.almob.org/content/9/1/16
Akshay DeepakDavid Fernández-BacaAlgorithms for Molecular Biology 2014, null:162014-06-18T00:00:00Zdoi:10.1186/1748-7188-9-16/content/figures/1748-7188-9-16-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}162014-06-18T00:00:00ZXMLComputing the skewness of the phylogenetic mean pairwise distance in linear timeBackground:
The phylogenetic Mean Pairwise Distance (MPD) is one of the most popular measures for computing the phylogenetic distance between a given group of species. More specifically, for a phylogenetic tree
and for a set of species R represented by a subset of the leaf nodes of
, the MPD of R is equal to the average cost of all possible simple paths in
that connect pairs of nodes in R.Among other phylogenetic measures, the MPD is used as a tool for deciding if the species of a given group R are closely related. To do this, it is important to compute not only the value of the MPD for this group but also the expectation, the variance, and the skewness of this metric. Although efficient algorithms have been developed for computing the expectation and the variance the MPD, there has been no approach so far for computing the skewness of this measure.
Results:
In the present work we describe how to compute the skewness of the MPD on a tree
optimally, in Θ(n) time; here n is the size of the tree
. So far this is the first result that leads to an exact, let alone efficient, computation of the skewness for any popular phylogenetic distance measure. Moreover, we show how we can compute in Θ(n) time several interesting quantities in
, that can be possibly used as building blocks for computing efficiently the skewness of other phylogenetic measures.
Conclusions:
The optimal computation of the skewness of the MPD that is outlined in this work provides one more tool for studying the phylogenetic relatedness of species in large phylogenetic trees. Until now this has been infeasible, given that traditional techniques for computing the skewness are inefficient and based on inexact resampling.
http://www.almob.org/content/9/1/15
Constantinos TsirogiannisBrody SandelAlgorithms for Molecular Biology 2014, null:152014-06-14T00:00:00Zdoi:10.1186/1748-7188-9-15/content/figures/1748-7188-9-15-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}152014-06-14T00:00:00ZXMLIdentification of alternative topological domains in chromatinChromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The ensemble of domains we identify allows us to quantify the degree to which the domain structure is hierarchical as opposed to overlapping, and our analysis reveals a pronounced hierarchical structure in which larger stable domains tend to completely contain smaller domains. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone marks at the boundaries.
http://www.almob.org/content/9/1/14
Darya FilippovaRob PatroGeet DuggalCarl KingsfordAlgorithms for Molecular Biology 2014, null:142014-05-03T00:00:00Zdoi:10.1186/1748-7188-9-14/content/figures/1748-7188-9-14-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}142014-05-03T00:00:00ZXMLCharacterizing compatibility and agreement of
unrooted trees via cuts in graphsBackground:
Deciding whether there is a single tree —a supertree— that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees.Testing for compatibility is an NP-complete problem; however, the problem is solvable in polynomial time when the number of input trees is fixed. Testing for agreement is also NP-complete, but it is not known whether it is fixed-parameter tractable. Compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. No characterization of agreement was known.
Results:
We present a simple and natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement.
Conclusions:
Explicit characterizations of tree compatibility and agreement are essential to finding practical algorithms for these problems. The simplicity of the characterizations presented here could help to achieve this goal.
http://www.almob.org/content/9/1/13
Sudheer VakatiDavid Fernández-BacaAlgorithms for Molecular Biology 2014, null:132014-04-17T00:00:00Zdoi:10.1186/1748-7188-9-13/content/figures/1748-7188-9-13-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}132014-04-17T00:00:00ZXML