<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/rss.css" type="text/css"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/"
    xmlns:cc="http://web.resource.org/cc/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:extra="http://www.w3.org/1999/xhtml"
    xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <channel rdf:about="http://www.almob.org/feeds/latestarticles/journal?quantity=&amp;format=rss&amp;version=">
        <title>Algorithms for Molecular Biology - Latest Articles</title>
        <link>http://www.almob.org</link>
        <description>The latest research articles published by Algorithms for Molecular Biology</description>
        <dc:date>2012-05-15T00:00:00Z</dc:date>
        <items>
            <rdf:Seq>
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/13" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/12" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/11" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/10" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/9" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/8" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/7" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/6" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/5" />
                                <rdf:li rdf:resource="http://www.almob.org/content/7/1/4" />
                            </rdf:Seq>
        </items>
                 <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </channel>
        <item rdf:about="http://www.almob.org/content/7/1/13">
        <title>Tree-average distances on certain phylogenetic networks have their weights uniquely determined</title>
        <description>A phylogenetic network N has vertices corresponding to species and arcs corresponding to direct genetic inheritance from the species at the tail to the species at the head.  Measurements of DNA are often made on species in the leaf set, and one seeks to infer properties of the network, possibly including the graph itself. In the case of phylogenetic trees, distances between extant species are frequently used to infer the phylogenetic trees by methods such as neighbor-joining.This paper proposes a &quot;tree-average&quot; distance for  networks more general than trees.  The notion requires a &quot;weight&quot; on each arc measuring the genetic change along the arc. For each displayed tree the distance between two leaves is the sum of the weights along the path joining them.  At a hybrid vertex, each character is inherited from one of its parents.  We will assume that for each hybrid there is a probability that the inheritance of a character is from a specified parent.   Assume that the inheritance events at different hybrids are independent.  Then for each displayed tree there will be a probability that the inheritance of a given character follows the tree; this probability may be interpreted as the probability of the tree.  The &quot;tree-average&quot; distance between the leaves is defined to be the expected value of their distance in the displayed trees.For a class of rooted networks that includes rooted trees, it is shown that the weights and the probabilities at each hybrid vertex can be calculated given the network and the tree-average distances between the leaves.  Hence these weights and probabilities are uniquely determined.  The hypotheses on the networks include that hybrid vertices have indegree exactly 2 and that vertices that are not leaves have a tree-child.</description>
        <link>http://www.almob.org/content/7/1/13</link>
                <dc:creator>Stephen Willson</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:13</dc:source>
        <dc:date>2012-05-15T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-13</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-13-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>13</prism:startingPage>
        <prism:publicationDate>2012-05-15T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/12">
        <title>Fractal MapReduce decomposition of sequence alignment</title>
        <description>The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results - In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a &quot;alignment-free&quot; solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions - The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser&apos;s emergence as an environment for high performance distributed computing.</description>
        <link>http://www.almob.org/content/7/1/12</link>
                <dc:creator>Jonas Almeida</dc:creator>
                <dc:creator>Alexander Gruneberg</dc:creator>
                <dc:creator>Wolfgang Maass</dc:creator>
                <dc:creator>Susana Vinga</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:12</dc:source>
        <dc:date>2012-05-02T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-12</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-12-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>12</prism:startingPage>
        <prism:publicationDate>2012-05-02T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/11">
        <title>Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations</title>
        <description>Background:
Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible.
Results:
We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension - UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS.
Conclusions:
Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/.</description>
        <link>http://www.almob.org/content/7/1/11</link>
                <dc:creator>Tapio Pahikkala</dc:creator>
                <dc:creator>Sebastian Okser</dc:creator>
                <dc:creator>Antti Airola</dc:creator>
                <dc:creator>Tapio Salakoski</dc:creator>
                <dc:creator>Tero Aittokallio</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:11</dc:source>
        <dc:date>2012-05-02T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-11</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-11-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>11</prism:startingPage>
        <prism:publicationDate>2012-05-02T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/10">
        <title>Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis</title>
        <description>Background:
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2^-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
Results:
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
Conclusions:
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.</description>
        <link>http://www.almob.org/content/7/1/10</link>
                <dc:creator>Susana Vinga</dc:creator>
                <dc:creator>Alexandra Carvalho</dc:creator>
                <dc:creator>Alexandre Francisco</dc:creator>
                <dc:creator>Luis Russo</dc:creator>
                <dc:creator>Jonas Almeida</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:10</dc:source>
        <dc:date>2012-05-02T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-10</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-10-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>10</prism:startingPage>
        <prism:publicationDate>2012-05-02T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/9">
        <title>Maximum Parsimony on Phylogenetic Networks</title>
        <description>Background:
Phylogenetic networks are generalizations of phylogenetic trees, that are used to model evolutionary events in various contexts. Several different methods and criteria have been introduced for reconstructing phylogenetic trees.  Maximum Parsimony is a character-based approach that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data assigned on the leaves. Exact solutions for optimizing parsimony scores on phylogenetic trees have been introduced in the past.
Results:
In this paper, we define the parsimony score on networks as the sum of the substitution costs along all the edges of the network; and show that certain well-known algorithms that calculate the optimum parsimony score on trees, such as Sankoff and Fitch algorithms extend naturally for networks, barring conflicting assignments at the reticulate vertices. We provide heuristics for finding the optimum parsimony scores on networks. Our algorithms can be applied for any cost matrix that may contain unequal substitution costs of transforming between different characters along different edges of the network.  We analyzed this for experimental data on 10 leaves or fewer with at most 2 reticulations and found that for almost all networks, the bounds returned by the heuristics matched with the exhaustively determined optimum parsimony scores.
Conclusion:
The parsimony score we define here does not directly reflect the cost of the best tree in the network that displays the evolution of the character. However, when searching for the most parsimonious network that describes a collection of characters, it becomes necessary to add additional cost considerations to prefer simpler structures, such as trees over networks. The parsimony score on a network that we describe here takes into account the substitution costs along the additional edges incident on  each reticulate vertex, in addition to the substitution costs along the other edges which are common to all the branching patterns introduced by the reticulate vertices. Thus the score contains an in-built cost for the number of reticulate vertices in the network, and would provide a criterion that is comparable among all networks.  Although the problem of finding the parsimony score on the network is believed to be computationally hard to solve, heuristics such as the ones described here would be beneficial in our efforts to find a most parsimonious network.</description>
        <link>http://www.almob.org/content/7/1/9</link>
                <dc:creator>Lavanya Kannan</dc:creator>
                <dc:creator>Ward Wheeler</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:9</dc:source>
        <dc:date>2012-05-02T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-9</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-9-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>9</prism:startingPage>
        <prism:publicationDate>2012-05-02T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/8">
        <title>Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots</title>
        <description>Background:
Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking.
Results:
In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a &quot;subcoloring&quot; problem to express the difference between a taxonomy and a phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy.
Conclusions:
The algorithms in this paper, and the associated freely available software, will help biologists better use and understand taxonomically labeled phylogenetic trees.</description>
        <link>http://www.almob.org/content/7/1/8</link>
                <dc:creator>Frederick Matsen</dc:creator>
                <dc:creator>Aaron Gallagher</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:8</dc:source>
        <dc:date>2012-05-02T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-8</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-8-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>8</prism:startingPage>
        <prism:publicationDate>2012-05-02T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/7">
        <title>A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree</title>
        <description>Background:
The ancestries of genes form gene trees which do not necessarily have the same topology as the species tree due to incomplete lineage sorting. Available algorithms determining the probability of a gene tree given a species tree require exponential computational runtime.
Results:
In this paper, we provide a polynomial time algorithm to calculate the probability of a ranked gene tree topology for a given species tree, where a ranked tree topology is a tree topology with the internal vertices being ordered. The probability of a gene tree topology can thus be calculated in polynomial time if the number of orderings of the internal vertices is a polynomial number. However,  the complexity of calculating the probability of a gene tree topology with an exponential number of rankings for a given species tree remains unknown.
Conclusions:
Polynomial algorithms for calculating ranked gene tree probabilities may become useful in  developing methodology to infer species trees based on a collection of gene trees, leading to a more accurate reconstruction of ancestral species relationships.</description>
        <link>http://www.almob.org/content/7/1/7</link>
                <dc:creator>Tanja Stadler</dc:creator>
                <dc:creator>James Degnan</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:7</dc:source>
        <dc:date>2012-04-30T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-7</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-7-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>7</prism:startingPage>
        <prism:publicationDate>2012-04-30T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>PDF</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/6">
        <title>Computing evolutionary distinctiveness indices in large scale analysis</title>
        <description>We present optimal linear time algorithms for computing the Shapley values and &apos;heightened evolutionary distinctiveness&apos; (HED) scores for the set of taxa in a phylogenetic tree. We demonstrate the efficiency of these new algorithms by applying them to a set of 10,000 reasonable 5139-species mammal trees. This is the first time these indices have been computed on such a large taxon and we contrast our finding with an ad-hoc index for mammals, fair proportion (FP), used by the Zoological Society of London&apos;s EDGE programme. Our empirical results follow expectations. In particular, the Shapley values are very strongly correlated with the FP scores, but provide a higher weight to the few monotremes that comprise the sister to all other mammals. We also find that the HED score, which measures a species&apos; unique contribution to future subsets as function of the probability that close relatives will go extinct, is very sensitive to the estimated probabilities. When they are low, HED scores are less than FP scores, and approach the simple measure of a species&apos; age. Deviations (like the Solendon genus of the West Indies) occur when sister species are both at high risk of extinction and their clade roots deep in the tree. Conversely, when endangered species have higher probabilities of being lost, HED scores can be greater than FP scores and species like the African elephant Loxondonta africana, the two solendons and the thumbless bat Furipterus horrens can move up the rankings. We suggest that conservation attention be applied to such species that carry genetic responsibility for imperiled close relatives. We also briefly discuss extensions of Shapley values and HED scores that are possible with the algorithms presented here.</description>
        <link>http://www.almob.org/content/7/1/6</link>
                <dc:creator>Iain Martyn</dc:creator>
                <dc:creator>Tyler Kuhn</dc:creator>
                <dc:creator>Arne Mooers</dc:creator>
                <dc:creator>Vincent Moulton</dc:creator>
                <dc:creator>Andreas Spillner</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:6</dc:source>
        <dc:date>2012-04-13T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-6</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-6-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>6</prism:startingPage>
        <prism:publicationDate>2012-04-13T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>XML</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/5">
        <title>A normalization strategy for comparing tag count data</title>
        <description>Background:
High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data.
Results:
We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset.
Conclusion:
Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data.</description>
        <link>http://www.almob.org/content/7/1/5</link>
                <dc:creator>Koji Kadota</dc:creator>
                <dc:creator>Tomoaki Nishiyama</dc:creator>
                <dc:creator>Kentaro Shimizu</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:5</dc:source>
        <dc:date>2012-04-05T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-5</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-5-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>5</prism:startingPage>
        <prism:publicationDate>2012-04-05T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>XML</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <item rdf:about="http://www.almob.org/content/7/1/4">
        <title>TS-AMIR: A Topology String Alignment Method for Intensive Rapid Protein Structure Comparison</title>
        <description>Background:
In structural biology, similarity analysis of protein structure is a crucial step in studying the relationship between proteins. Despite the considerable number of techniques that have been explored within the past two decades, the development of new alternative methods is still an active research area due to the need for high performance tools.
Results:
In this paper, we present TS-AMIR, a Topology String Alignment Method for Intensive Rapid comparison of protein structures. The proposed method works in two stages: In the first stage, the method generates a topology string based on the geometric details of secondary structure elements, and then, utilizes an n-gram modelling technique over entropy concept to capture similarities in these strings. This initial correspondence map between secondary structure elements is submitted to the second stage in order to obtain the alignment at the residue level. Applying the Kabsch method, a heuristic step-by-step algorithm is adopted in the second stage to align the residues, resulting in an optimal rotation matrix and minimized RMSD. The performance of the method was assessed in different information retrieval tests and the results were compared with those of CE and TM-align, representing two geometrical tools, and YAKUSA, 3D-BLAST and SARST as three representatives of linear encoding schemes. It is shown that the method obtains a high running speed similar to that of the linear encoding schemes. In addition, the method runs about 800 and 7200 times faster than TM-align and CE respectively, while maintaining a competitive accuracy with TM-align and CE.
Conclusions:
The experimental results demonstrate that linear encoding techniques are capable of reaching the same high degree of accuracy as that achieved by geometrical methods, while generally running hundreds of times faster than conventional programs.</description>
        <link>http://www.almob.org/content/7/1/4</link>
                <dc:creator>Jafar Razmara</dc:creator>
                <dc:creator>Safaai Deris</dc:creator>
                <dc:creator>Sepideh Parvizpour</dc:creator>
                <dc:source>Algorithms for Molecular Biology 2012, null:4</dc:source>
        <dc:date>2012-02-15T00:00:00Z</dc:date>
        <dc:identifier>doi:10.1186/1748-7188-7-4</dc:identifier>
                                <prism:require>/content/figures/1748-7188-7-4-toc.gif</prism:require>
                <prism:publicationName>Algorithms for Molecular Biology</prism:publicationName>
        <prism:issn>1748-7188</prism:issn>
        <prism:volume>${item.volume}</prism:volume>
        <prism:startingPage>4</prism:startingPage>
        <prism:publicationDate>2012-02-15T00:00:00Z</prism:publicationDate>
                <prism:versionidentifier>XML</prism:versionidentifier>
                <cc:license rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
    </item>
        <cc:License rdf:about="http://creativecommons.org/licenses/by/2.0/">
        <cc:permits rdf:resource="http://creativecommons.org/ns#Reproduction" />
        <cc:permits rdf:resource="http://creativecommons.org/ns#Distribution" />
        <cc:permits rdf:resource="http://creativecommons.org/ns#DerivativeWorks" />
    </cc:License>
</rdf:RDF>

