Algorithms for Molecular Biology - Latest Articles
http://www.almob.org
The latest research articles published by Algorithms for Molecular Biology2015-05-13T12:00:00Z An online peak extraction algorithm for ion mobility spectrometry data Ion mobility (IM) spectrometry (IMS), coupled with multi-capillary columns (MCCs), has been gaining importance for biotechnological and medical applications because of its ability to detect and quantify volatile organic compounds (VOC) at low concentrations in the air or in exhaled breath at ambient pressure and temperature. Ongoing miniaturization of spectrometers creates the need for reliable data analysis on-the-fly in small embedded low-power devices. We present the first fully automated online peak extraction method for MCC/IMS measurements consisting of several thousand individual spectra. Each individual spectrum is processed as it arrives, removing the need to store the measurement before starting the analysis, as is currently the state of the art. Thus the analysis device can be an inexpensive low-power system such as the Raspberry Pi.The key idea is to extract one-dimensional peak models (with four parameters) from each spectrum and then merge these into peak chains and finally two-dimensional peak models. We describe the different algorithmic steps in detail and evaluate the online method against state-of-the-art peak extraction methods.
http://www.almob.org/content/10/1/17
Dominik KopczynskiSven RahmannAlgorithms for Molecular Biology 2015, null:172015-05-13T12:00:00Zdoi:10.1186/s13015-015-0045-5/content/figures/s13015-015-0045-5-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}172015-05-13T12:00:00ZPDF A graph modification approach for finding core–periphery structures in protein interaction networks The core–periphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global core–periphery structure for a given PPI network. We propose two exact graph-theoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NP-hardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network.
http://www.almob.org/content/10/1/16
Sharon BrucknerFalk HüffnerChristian KomusiewiczAlgorithms for Molecular Biology 2015, null:162015-05-02T12:00:00Zdoi:10.1186/s13015-015-0043-7/content/figures/s13015-015-0043-7-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}162015-05-02T12:00:00ZXML A novel method for identifying disease associated protein complexes based on functional similarity protein complex networks Background:
Protein complexes formed by non-covalent interaction among proteins play important roles in cellular functions. Computational and purification methods have been used to identify many protein complexes and their cellular functions. However, their roles in terms of causing disease have not been well discovered yet. There exist only a few studies for the identification of disease-associated protein complexes. However, they mostly utilize complicated heterogeneous networks which are constructed based on an out-of-date database of phenotype similarity network collected from literature. In addition, they only apply for diseases for which tissue-specific data exist.
Methods:
In this study, we propose a method to identify novel disease-protein complex associations. First, we introduce a framework to construct functional similarity protein complex networks where two protein complexes are functionally connected by either shared protein elements, shared annotating GO terms or based on protein interactions between elements in each protein complex. Second, we propose a simple but effective neighborhood-based algorithm, which yields a local similarity measure, to rank disease candidate protein complexes.
Results:
Comparing the predictive performance of our proposed algorithm with that of two state-of-the-art network propagation algorithms including one we used in our previous study, we found that it performed statistically significantly better than that of these two algorithms for all the constructed functional similarity protein complex networks. In addition, it ran about 32 times faster than these two algorithms. Moreover, our proposed method always achieved high performance in terms of AUC values irrespective of the ways to construct the functional similarity protein complex networks and the used algorithms. The performance of our method was also higher than that reported in some existing methods which were based on complicated heterogeneous networks. Finally, we also tested our method with prostate cancer and selected the top 100 highly ranked candidate protein complexes. Interestingly, 69 of them were evidenced since at least one of their protein elements are known to be associated with prostate cancer.
Conclusions:
Our proposed method, including the framework to construct functional similarity protein complex networks and the neighborhood-based algorithm on these networks, could be used for identification of novel disease-protein complex associations.
http://www.almob.org/content/10/1/14
Duc-Hau LeAlgorithms for Molecular Biology 2015, null:142015-04-28T12:00:00Zdoi:10.1186/s13015-015-0044-6/content/figures/s13015-015-0044-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}142015-04-28T12:00:00ZXMLAlgorithms for detecting and analysing autocatalytic setsBackground:
Autocatalytic sets are considered to be fundamental to the origin of life. Prior theoretical and computational work on the existence and properties of these sets has relied on a fast algorithm for detectingself-sustaining autocatalytic sets in chemical reaction systems. Here, we introduce and apply a modified version and several extensions of the basic algorithm: (i) a modification aimed at reducing the number of calls to the computationally most expensive part of the algorithm, (ii) the application of a previously introduced extension of the basic algorithm to sample the smallest possible autocatalytic sets within a reaction network, and the application of a statistical test which provides a probable lower bound on the number of such smallest sets, (iii) the introduction and application of another extension of the basic algorithm to detect autocatalytic sets in a reaction system where molecules can also inhibit (as well as catalyse) reactions, (iv) a further, more abstract, extension of the theory behind searching for autocatalytic sets.
Results:
(i) The modified algorithm outperforms the original one in the number of calls to the computationally most expensive procedure, which, in some cases also leads to a significant improvement in overall running time, (ii) our statistical test provides strong support for the existence of very large numbers (even millions) of minimal autocatalytic sets in a well-studied polymer model, where these minimal sets share about half of their reactions on average, (iii) “uninhibited” autocatalytic sets can be found in reaction systems that allow inhibition, but their number and sizes depend on the level of inhibition relative to the level of catalysis.
Conclusions:
(i) Improvements in the overall running time when searching for autocatalytic sets can potentially be obtained by using a modified version of the algorithm, (ii) the existence of large numbers of minimal autocatalytic sets can have important consequences for the possible evolvability of autocatalytic sets, (iii) inhibition can be efficiently dealt with as long as the total number of inhibitors is small.
http://www.almob.org/content/10/1/15
Wim HordijkJoshua SmithMike SteelAlgorithms for Molecular Biology 2015, null:152015-04-28T00:00:00Zdoi:10.1186/s13015-015-0042-8/content/figures/s13015-015-0042-8-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}152015-04-28T00:00:00ZXML On the family-free DCJ distance and similarity Structural variation in genomes can be revealed by many (dis)similarity measures. Rearrangement operations, such as the so called double-cut-and-join (DCJ), are large-scale mutations that can create complex changes and produce such variations in genomes. A basic task in comparative genomics is to find the rearrangement distance between two given genomes, i.e., the minimum number of rearragement operations that transform one given genome into another one. In a family-based setting, genes are grouped into gene families and efficient algorithms have already been presented to compute the DCJ distance between two given genomes. In this work we propose the problem of computing the DCJ distance of two given genomes without prior gene family assignment, directly using the pairwise similarities between genes. We prove that this new family-free DCJ distance problem is APX-hard and provide an integer linear program to its solution. We also study a family-free DCJ similarity and prove that its computation is NP-hard.
http://www.almob.org/content/10/1/13
Fábio MartinezPedro FeijãoMarília BragaJens StoyeAlgorithms for Molecular Biology 2015, null:132015-04-01T00:00:00Zdoi:10.1186/s13015-015-0041-9/content/figures/s13015-015-0041-9-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}132015-04-01T00:00:00ZXML Sorting signed permutations by short operations Background:
During evolution, global mutations may alter the order and the orientation of the genes in a genome. Such mutations are referred to as rearrangement events, or simply operations. In unichromosomal genomes, the most common operations are reversals, which are responsible for reversing the order and orientation of a sequence of genes, and transpositions, which are responsible for switching the location of two contiguous portions of a genome. The problem of computing the minimum sequence of operations that transforms one genome into another – which is equivalent to the problem of sorting a permutation into the identity permutation – is a well-studied problem that finds application in comparative genomics. There are a number of works concerning this problem in the literature, but they generally do not take into account the length of the operations (i.e. the number of genes affected by the operations). Since it has been observed that short operations are prevalent in the evolution of some species, algorithms that efficiently solve this problem in the special case of short operations are of interest.
Results:
In this paper, we investigate the problem of sorting a signed permutation by short operations. More precisely, we study four flavors of this problem: (i) the problem of sorting a signed permutation by reversals of length at most 2; (ii) the problem of sorting a signed permutation by reversals of length at most 3; (iii) the problem of sorting a signed permutation by reversals and transpositions of length at most 2; and (iv) the problem of sorting a signed permutation by reversals and transpositions of length at most 3. We present polynomial-time solutions for problems (i) and (iii), a 5-approximation for problem (ii), and a 3-approximation for problem (iv). Moreover, we show that the expected approximation ratio of the 5-approximation algorithm is not greater than 3 for random signed permutations with more than 12 elements. Finally, we present experimental results that show that the approximation ratios of the approximation algorithms cannot be smaller than 3. In particular, this means that the approximation ratio of the 3-approximation algorithm is tight.
http://www.almob.org/content/10/1/12
Gustavo GalvãoOrlando LeeZanoni DiasAlgorithms for Molecular Biology 2015, null:122015-03-25T00:00:00Zdoi:10.1186/s13015-015-0040-x/content/figures/s13015-015-0040-x-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}122015-03-25T00:00:00ZXML A virtual pebble game to ensemble average graph rigidity Background:
The body-bar Pebble Game (PG) algorithm is commonly used to calculate network rigidity properties in proteins and polymeric materials. To account for fluctuating interactions such as hydrogen bonds, an ensemble of constraint topologies are sampled, and average network properties are obtained by averaging PG characterizations. At a simpler level of sophistication, Maxwell constraint counting (MCC) provides a rigorous lower bound for the number of internal degrees of freedom (DOF) within a body-bar network, and it is commonly employed to test if a molecular structure is globally under-constrained or over-constrained. MCC is a mean field approximation (MFA) that ignores spatial fluctuations of distance constraints by replacing the actual molecular structure by an effective medium that has distance constraints globally distributed with perfect uniform density.
Results:
The Virtual Pebble Game (VPG) algorithm is a MFA that retains spatial inhomogeneity in the density of constraints on all length scales. Network fluctuations due to distance constraints that may be present or absent based on binary random dynamic variables are suppressed by replacing all possible constraint topology realizations with the probabilities that distance constraints are present. The VPG algorithm is isomorphic to the PG algorithm, where integers for counting “pebbles” placed on vertices or edges in the PG map to real numbers representing the probability to find a pebble. In the VPG, edges are assigned pebble capacities, and pebble movements become a continuous flow of probability within the network. Comparisons between the VPG and average PG results over a test set of proteins and disordered lattices demonstrate the VPG quantitatively estimates the ensemble average PG results well.
Conclusions:
The VPG performs about 20% faster than one PG, and it provides a pragmatic alternative to averaging PG rigidity characteristics over an ensemble of constraint topologies. The utility of the VPG falls in between the most accurate but slowest method of ensemble averaging over hundreds to thousands of independent PG runs, and the fastest but least accurate MCC.
http://www.almob.org/content/10/1/11
Luis GonzálezHui WangDennis LivesayDonald JacobsAlgorithms for Molecular Biology 2015, null:112015-03-18T00:00:00Zdoi:10.1186/s13015-015-0039-3/content/figures/s13015-015-0039-3-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}112015-03-18T00:00:00ZXML A simple data-adaptive probabilistic variant calling model Background:
Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments.
Results:
We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background.
Conclusions:
In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.
http://www.almob.org/content/10/1/10
Steve HoffmannPeter StadlerKorbinian StrimmerAlgorithms for Molecular Biology 2015, null:102015-03-04T12:00:00Zdoi:10.1186/s13015-015-0037-5/content/figures/s13015-015-0037-5-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}102015-03-04T12:00:00ZXMLProtein docking with predicted constraintsThis paper presents a constraint-based method for improving protein docking results. Efficient constraint propagation cuts over 95% of the search time for finding the configurations with the largest contact surface, provided a contact is specified between two amino acid residues. This makes it possible to scan a large number of potentially correct constraints, lowering the requirements for useful contact predictions. While other approaches are very dependent on accurate contact predictions, ours requires only that at least one correct contact be retained in a set of, for example, one hundred constraints to test. It is this feature that makes it feasible to use readily available sequence data to predict specific potential contacts. Although such prediction is too inaccurate for most purposes, we demonstrate with a Naïve Bayes Classifier that it is accurate enough to more than double the average number of acceptable models retained during the crucial filtering stage of protein docking when combined with our constrained docking algorithm. All software developed in this work is freely available as part of the Open Chemera Library.
http://www.almob.org/content/10/1/9
Ludwig KrippahlPedro BarahonaAlgorithms for Molecular Biology 2015, null:92015-02-20T12:00:00Zdoi:10.1186/s13015-015-0036-6/content/figures/s13015-015-0036-6-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}92015-02-20T12:00:00ZXMLMoDock: A multi-objective strategy improves the accuracy for molecular dockingBackground:
As a main method of structure-based virtual screening, molecular docking is the most widely used in practice. However, the non-ideal efficacy of scoring functions is thought as the biggest barrier which hinders the improvement of the molecular docking method.
Results:
A new multi-objective strategy for molecular docking, named as MoDock, is presented to further improve the docking accuracy with available scoring functions. Instead of simple combination of multiple objectives with fixed weight factors, an aggregate function is adopted to approximate the real solution of the original multi-objective and multi-constraint problem, which will simultaneously smooth the energy surface of the combined scoring functions. Then, method of centers and genetic algorithm are used to find the optimal solution. Tests of MoDock against the GOLD test data set reveal the multi-objective strategy improves the docking accuracy over the individual scoring functions. Meanwhile, a 70% ratio of the good docking solutions with the RMSD value below 1.0 Å outperforms other 6 commonly used docking programs, even with a flexible receptor docking program included.
Conclusions:
The results show MoDock is an effective strategy to overcome the deviations brought by single scoring function, and improves the prediction power of molecular docking.
http://www.almob.org/content/10/1/8
Junfeng GuXu YangLing KangJinying WuXicheng WangAlgorithms for Molecular Biology 2015, null:82015-02-18T00:00:00Zdoi:10.1186/s13015-015-0034-8/content/figures/s13015-015-0034-8-toc.gifAlgorithms for Molecular Biology1748-7188${item.volume}82015-02-18T00:00:00ZXML