Algorithms for Molecular Biology - Latest Articles
http://www.almob.org
The latest research articles published by Algorithms for Molecular Biology2015-06-27T12:00:00Z A polynomial delay algorithm for the enumeration of bubbles with length constraints in directed graphs Background:
The problem of enumerating bubbles with length constraints in directed graphs arises in transcriptomics where the question is to identify all alternative splicing events present in a sample of mRNAs sequenced by RNA-seq.
Results:
We present a new algorithm for enumerating bubbles with length constraints in weighted directed graphs. This is the first polynomial delay algorithm for this problem and we show that in practice, it is faster than previous approaches.
Conclusion:
This settles one of the main open questions from Sacomoto et al. (BMC Bioinform 13:5, 2012). Moreover, the new algorithm allows us to deal with larger instances and possibly detect longer alternative splicing events.
http://www.almob.org/content/10/1/20
Gustavo SacomotoVincent LacroixMarie-France SagotAlgorithms for Molecular Biology 2015, null:202015-06-27T12:00:00Zdoi:10.1186/s13015-015-0046-4/content/figures/s13015-015-0046-4-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}202015-06-27T12:00:00ZXML Fair evaluation of global network aligners Background:
Analogous to genomic sequence alignment, biological network alignment identifies conserved regions between networks of different species. Then, function can be transferred from well- to poorly-annotated species between aligned network regions. Network alignment typically encompasses two algorithmic components: node cost function (NCF), which measures similarities between nodes in different networks, and alignment strategy (AS), which uses these similarities to rapidly identify high-scoring alignments. Different methods use both different NCFs and different ASs. Thus, it is unclear whether the superiority of a method comes from its NCF, its AS, or both. We already showed on state-of-the-art methods, MI-GRAAL and IsoRankN, that combining NCF of one method and AS of another method can give a new superior method. Here, we evaluate MI-GRAAL against a newer approach, GHOST, by mixing-and-matching the methods’ NCFs and ASs to potentially further improve alignment quality. While doing so, we approach important questions that have not been asked systematically thus far. First, we ask how much of the NCF information should come from protein sequence data compared to network topology data. Existing methods determine this parameter more-less arbitrarily, which could affect alignment quality. Second, when topological information is used in NCF, we ask how large the size of the neighborhoods of the compared nodes should be. Existing methods assume that the larger the neighborhood size, the better.
Results:
Our findings are as follows. MI-GRAAL’s NCF is superior to GHOST’s NCF, while the performance of the methods’ ASs is data-dependent. Thus, for data on which GHOST’s AS is superior to MI-GRAAL’s AS, the combination of MI-GRAAL’s NCF and GHOST’s AS represents a new superior method. Also, which amount of sequence information is used within NCF does not affect alignment quality, while the inclusion of topological information is crucial for producing good alignments. Finally, larger neighborhood sizes are preferred, but often, it is the second largest size that is superior. Using this size instead of the largest one would decrease computational complexity.
Conclusion:
Taken together, our results represent general recommendations for a fair evaluation of network alignment methods and in particular of two-stage NCF-AS approaches.
http://www.almob.org/content/10/1/19
Joseph CrawfordYihan SunTijana Milenkovi¿Algorithms for Molecular Biology 2015, null:192015-06-09T12:00:00Zdoi:10.1186/s13015-015-0050-8/content/figures/s13015-015-0050-8-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}192015-06-09T12:00:00ZXML GAML: genome assembly by maximum likelihood Background:
Resolution of repeats and scaffolding of shorter contigs are critical parts of genome assembly. Modern assemblers usually perform such steps by heuristics, often tailored to a particular technology for producing paired or long reads.
Results:
We propose a new framework that allows systematic combination of diverse sequencing datasets into a single assembly. We achieve this by searching for an assembly with the maximum likelihood in a probabilistic model capturing error rate, insert lengths, and other characteristics of the sequencing technology used to produce each dataset. We have implemented a prototype genome assembler GAML that can use any combination of insert sizes with Illumina or 454 reads, as well as PacBio reads. Our experiments show that we can assemble short genomes with N50 sizes and error rates comparable to ALLPATHS-LG or Cerulean. While ALLPATHS-LG and Cerulean require each a specific combination of datasets, GAML works on any combination.
Conclusions:
We have introduced a new probabilistic approach to genome assembly and demonstrated that this approach can lead to superior results when used to combine diverse set of datasets from different sequencing technologies. Data and software is available at http://compbio.fmph.uniba.sk/gaml.
http://www.almob.org/content/10/1/18
Vladimír Bo¿aBro¿a BrejováTomá¿ Vina¿Algorithms for Molecular Biology 2015, null:182015-06-03T00:00:00Zdoi:10.1186/s13015-015-0052-6/content/figures/s13015-015-0052-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}182015-06-03T00:00:00ZXML An online peak extraction algorithm for ion mobility spectrometry data Ion mobility (IM) spectrometry (IMS), coupled with multi-capillary columns (MCCs), has been gaining importance for biotechnological and medical applications because of its ability to detect and quantify volatile organic compounds (VOC) at low concentrations in the air or in exhaled breath at ambient pressure and temperature. Ongoing miniaturization of spectrometers creates the need for reliable data analysis on-the-fly in small embedded low-power devices. We present the first fully automated online peak extraction method for MCC/IMS measurements consisting of several thousand individual spectra. Each individual spectrum is processed as it arrives, removing the need to store the measurement before starting the analysis, as is currently the state of the art. Thus the analysis device can be an inexpensive low-power system such as the Raspberry Pi.The key idea is to extract one-dimensional peak models (with four parameters) from each spectrum and then merge these into peak chains and finally two-dimensional peak models. We describe the different algorithmic steps in detail and evaluate the online method against state-of-the-art peak extraction methods.
http://www.almob.org/content/10/1/17
Dominik KopczynskiSven RahmannAlgorithms for Molecular Biology 2015, null:172015-05-13T12:00:00Zdoi:10.1186/s13015-015-0045-5/content/figures/s13015-015-0045-5-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}172015-05-13T12:00:00ZPDF A graph modification approach for finding core–periphery structures in protein interaction networks The core–periphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global core–periphery structure for a given PPI network. We propose two exact graph-theoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NP-hardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network.
http://www.almob.org/content/10/1/16
Sharon BrucknerFalk HüffnerChristian KomusiewiczAlgorithms for Molecular Biology 2015, null:162015-05-02T12:00:00Zdoi:10.1186/s13015-015-0043-7/content/figures/s13015-015-0043-7-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}162015-05-02T12:00:00ZXML A novel method for identifying disease associated protein complexes based on functional similarity protein complex networks Background:
Protein complexes formed by non-covalent interaction among proteins play important roles in cellular functions. Computational and purification methods have been used to identify many protein complexes and their cellular functions. However, their roles in terms of causing disease have not been well discovered yet. There exist only a few studies for the identification of disease-associated protein complexes. However, they mostly utilize complicated heterogeneous networks which are constructed based on an out-of-date database of phenotype similarity network collected from literature. In addition, they only apply for diseases for which tissue-specific data exist.
Methods:
In this study, we propose a method to identify novel disease-protein complex associations. First, we introduce a framework to construct functional similarity protein complex networks where two protein complexes are functionally connected by either shared protein elements, shared annotating GO terms or based on protein interactions between elements in each protein complex. Second, we propose a simple but effective neighborhood-based algorithm, which yields a local similarity measure, to rank disease candidate protein complexes.
Results:
Comparing the predictive performance of our proposed algorithm with that of two state-of-the-art network propagation algorithms including one we used in our previous study, we found that it performed statistically significantly better than that of these two algorithms for all the constructed functional similarity protein complex networks. In addition, it ran about 32 times faster than these two algorithms. Moreover, our proposed method always achieved high performance in terms of AUC values irrespective of the ways to construct the functional similarity protein complex networks and the used algorithms. The performance of our method was also higher than that reported in some existing methods which were based on complicated heterogeneous networks. Finally, we also tested our method with prostate cancer and selected the top 100 highly ranked candidate protein complexes. Interestingly, 69 of them were evidenced since at least one of their protein elements are known to be associated with prostate cancer.
Conclusions:
Our proposed method, including the framework to construct functional similarity protein complex networks and the neighborhood-based algorithm on these networks, could be used for identification of novel disease-protein complex associations.
http://www.almob.org/content/10/1/14
Duc-Hau LeAlgorithms for Molecular Biology 2015, null:142015-04-28T12:00:00Zdoi:10.1186/s13015-015-0044-6/content/figures/s13015-015-0044-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}142015-04-28T12:00:00ZXMLAlgorithms for detecting and analysing autocatalytic setsBackground:
Autocatalytic sets are considered to be fundamental to the origin of life. Prior theoretical and computational work on the existence and properties of these sets has relied on a fast algorithm for detectingself-sustaining autocatalytic sets in chemical reaction systems. Here, we introduce and apply a modified version and several extensions of the basic algorithm: (i) a modification aimed at reducing the number of calls to the computationally most expensive part of the algorithm, (ii) the application of a previously introduced extension of the basic algorithm to sample the smallest possible autocatalytic sets within a reaction network, and the application of a statistical test which provides a probable lower bound on the number of such smallest sets, (iii) the introduction and application of another extension of the basic algorithm to detect autocatalytic sets in a reaction system where molecules can also inhibit (as well as catalyse) reactions, (iv) a further, more abstract, extension of the theory behind searching for autocatalytic sets.
Results:
(i) The modified algorithm outperforms the original one in the number of calls to the computationally most expensive procedure, which, in some cases also leads to a significant improvement in overall running time, (ii) our statistical test provides strong support for the existence of very large numbers (even millions) of minimal autocatalytic sets in a well-studied polymer model, where these minimal sets share about half of their reactions on average, (iii) “uninhibited” autocatalytic sets can be found in reaction systems that allow inhibition, but their number and sizes depend on the level of inhibition relative to the level of catalysis.
Conclusions:
(i) Improvements in the overall running time when searching for autocatalytic sets can potentially be obtained by using a modified version of the algorithm, (ii) the existence of large numbers of minimal autocatalytic sets can have important consequences for the possible evolvability of autocatalytic sets, (iii) inhibition can be efficiently dealt with as long as the total number of inhibitors is small.
http://www.almob.org/content/10/1/15
Wim HordijkJoshua SmithMike SteelAlgorithms for Molecular Biology 2015, null:152015-04-28T00:00:00Zdoi:10.1186/s13015-015-0042-8/content/figures/s13015-015-0042-8-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}152015-04-28T00:00:00ZXML On the family-free DCJ distance and similarity Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard.
http://www.almob.org/content/10/1/13
Fábio MartinezPedro FeijãoMarília BragaJens StoyeAlgorithms for Molecular Biology 2015, null:132015-04-01T00:00:00Zdoi:10.1186/s13015-015-0041-9/content/figures/s13015-015-0041-9-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}132015-04-01T00:00:00ZXML Sorting signed permutations by short operations Background:
During evolution, global mutations may alter the order and the orientation of the genes in a genome. Such mutations are referred to as rearrangement events, or simply operations. In unichromosomal genomes, the most common operations are reversals, which are responsible for reversing the order and orientation of a sequence of genes, and transpositions, which are responsible for switching the location of two contiguous portions of a genome. The problem of computing the minimum sequence of operations that transforms one genome into another – which is equivalent to the problem of sorting a permutation into the identity permutation – is a well-studied problem that finds application in comparative genomics. There are a number of works concerning this problem in the literature, but they generally do not take into account the length of the operations (i.e. the number of genes affected by the operations). Since it has been observed that short operations are prevalent in the evolution of some species, algorithms that efficiently solve this problem in the special case of short operations are of interest.
Results:
In this paper, we investigate the problem of sorting a signed permutation by short operations. More precisely, we study four flavors of this problem: (i) the problem of sorting a signed permutation by reversals of length at most 2; (ii) the problem of sorting a signed permutation by reversals of length at most 3; (iii) the problem of sorting a signed permutation by reversals and transpositions of length at most 2; and (iv) the problem of sorting a signed permutation by reversals and transpositions of length at most 3. We present polynomial-time solutions for problems (i) and (iii), a 5-approximation for problem (ii), and a 3-approximation for problem (iv). Moreover, we show that the expected approximation ratio of the 5-approximation algorithm is not greater than 3 for random signed permutations with more than 12 elements. Finally, we present experimental results that show that the approximation ratios of the approximation algorithms cannot be smaller than 3. In particular, this means that the approximation ratio of the 3-approximation algorithm is tight.
http://www.almob.org/content/10/1/12
Gustavo GalvãoOrlando LeeZanoni DiasAlgorithms for Molecular Biology 2015, null:122015-03-25T00:00:00Zdoi:10.1186/s13015-015-0040-x/content/figures/s13015-015-0040-x-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}122015-03-25T00:00:00ZXML A virtual pebble game to ensemble average graph rigidity Background:
The body-bar Pebble Game (PG) algorithm is commonly used to calculate network rigidity properties in proteins and polymeric materials. To account for fluctuating interactions such as hydrogen bonds, an ensemble of constraint topologies are sampled, and average network properties are obtained by averaging PG characterizations. At a simpler level of sophistication, Maxwell constraint counting (MCC) provides a rigorous lower bound for the number of internal degrees of freedom (DOF) within a body-bar network, and it is commonly employed to test if a molecular structure is globally under-constrained or over-constrained. MCC is a mean field approximation (MFA) that ignores spatial fluctuations of distance constraints by replacing the actual molecular structure by an effective medium that has distance constraints globally distributed with perfect uniform density.
Results:
The Virtual Pebble Game (VPG) algorithm is a MFA that retains spatial inhomogeneity in the density of constraints on all length scales. Network fluctuations due to distance constraints that may be present or absent based on binary random dynamic variables are suppressed by replacing all possible constraint topology realizations with the probabilities that distance constraints are present. The VPG algorithm is isomorphic to the PG algorithm, where integers for counting “pebbles” placed on vertices or edges in the PG map to real numbers representing the probability to find a pebble. In the VPG, edges are assigned pebble capacities, and pebble movements become a continuous flow of probability within the network. Comparisons between the VPG and average PG results over a test set of proteins and disordered lattices demonstrate the VPG quantitatively estimates the ensemble average PG results well.
Conclusions:
The VPG performs about 20% faster than one PG, and it provides a pragmatic alternative to averaging PG rigidity characteristics over an ensemble of constraint topologies. The utility of the VPG falls in between the most accurate but slowest method of ensemble averaging over hundreds to thousands of independent PG runs, and the fastest but least accurate MCC.
http://www.almob.org/content/10/1/11
Luis GonzálezHui WangDennis LivesayDonald JacobsAlgorithms for Molecular Biology 2015, null:112015-03-18T00:00:00Zdoi:10.1186/s13015-015-0039-3/content/figures/s13015-015-0039-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}112015-03-18T00:00:00ZXML