<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1748-7188-1-6</ui>
   <ji>1748-7188</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Multiple sequence alignment with user-defined anchor points</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Morgenstern</snm>
               <fnm>Burkhard</fnm>
               <insr iid="I1"/>
               <email>burkhard@gobics.de</email>
            </au>
            <au id="A2">
               <snm>Prohaska</snm>
               <mi>J</mi>
               <fnm>Sonja</fnm>
               <insr iid="I2"/>
               <email>sonja@bioinf.uni-leipzig.de</email>
            </au>
            <au id="A3">
               <snm>P&#246;hler</snm>
               <fnm>Dirk</fnm>
               <insr iid="I1"/>
               <email>dpoehler@math.uni-goettingen.de</email>
            </au>
            <au id="A4">
               <snm>Stadler</snm>
               <mi>F</mi>
               <fnm>Peter</fnm>
               <insr iid="I2"/>
               <email>Peter.Stadler@bioinf.uni-leipzig.de</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Universit&#228;t G&#246;ttingen, Institut f&#252;r Mikrobiologie und Genetik, Abteilung f&#252;r Bioinformatik, Goldschmidtstrasse. 1, D-37077 G&#246;ttingen, Germany</p>
            </ins>
            <ins id="I2">
               <p>Universit&#228;t Leipzig, Institut f&#252;r Informatik und Interdisziplin&#228;res Zentrum f&#252;r Bioinformatik, Kreuzstrasse 7b, D-04103 Leipzig, Germany</p>
            </ins>
         </insg>
         <source>Algorithms for Molecular Biology</source>
         <issn>1748-7188</issn>
         <pubdate>2006</pubdate>
         <volume>1</volume>
         <issue>1</issue>
         <fpage>6</fpage>
         <url>http://www.almob.org/content/1/1/6</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">16722533</pubid>
               <pubid idtype="doi">10.1186/1748-7188-1-6</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>15</day>
               <month>2</month>
               <year>2006</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>19</day>
               <month>4</month>
               <year>2006</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>19</day>
               <month>4</month>
               <year>2006</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2006</year>
         <collab>Morgenstern et al; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Automated software tools for multiple alignment often fail to produce biologically meaningful results. In such situations, expert knowledge can help to improve the quality of alignments.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Herein, we describe a <it>semi-automatic </it>version of the alignment program <it>DIALIGN </it>that can take pre-defined constraints into account. It is possible for the user to specify parts of the sequences that are assumed to be homologous and should therefore be aligned to each other. Our software program can use these sites as <it>anchor points </it>by creating a multiple alignment respecting these constraints. This way, our alignment method can produce alignments that are biologically more meaningful than alignments produced by fully automated procedures. As a demonstration of how our method works, we apply our approach to genomic sequences around the <it>Hox </it>gene cluster and to a set of DNA-binding proteins. As a by-product, we obtain insights about the performance of the <it>greedy </it>algorithm that our program uses for multiple alignment and about the underlying objective function. This information will be useful for the further development of DIALIGN. The described alignment approach has been integrated into the TRACKER software system.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Multiple sequence alignment is a crucial prerequisite for biological sequence data analysis, and a large number of multi-alignment programs have been developed during the last twenty years. Standard methods for multiple DNA or protein alignment are, for example, <it>CLUSTAL W </it><abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, <it>DIALIGN </it><abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and <it>T-COFFEE </it><abbrgrp><abbr bid="B3">3</abbr></abbrgrp>; an overview about these tools and other established methods is given in <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Recently, some new alignment approaches have been developed such as <it>POA </it><abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, <it>MUSCLE </it><abbrgrp><abbr bid="B6">6</abbr></abbrgrp> or <it>PROBCONS </it><abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. These programs are often superior to previously developed methods in terms of alignment quality and computational costs. The performance of multi-alignment tools has been studied extensively using various sets of real and simulated benchmark data <abbrgrp><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr></abbrgrp>.</p>
         <p>All of the above mentioned alignment methods are fully <it>automated</it>, i.e., they construct alignments following a fixed set of algorithmical rules. Most methods use a well-defined <it>objective function </it>assigning numerical quality score to every possible output alignment of an input sequence set and try to find an optimal or near-optimal alignment according to this objective function. In this process, a number of program parameters such as gap penalties can be adjusted. While the overall influence of these parameters is quite obvious, there is usually no <it>direct </it>way of influencing the outcome of an alignment program.</p>
         <p>Automated alignment methods are clearly necessary and useful where large amounts of data are to be processed or in situations where no additional expert information is available. However, if a researcher is familiar with a specific sequence family under study, he or she may already know certain parts of the sequences that are functionally, structurally or phylogenetically related and should therefore be aligned to each other. In situations where automated programs <it>fail </it>to align these regions correctly, it is desirable to have an alignment method that would accept such user-defined homology information and would then align the remainder of the sequences automatically, respecting these user-specified <it>constraints</it>.</p>
         <p>The interactive program <it>MACAW </it><abbrgrp><abbr bid="B11">11</abbr></abbrgrp> can be used for semi-automatic alignment with user-defined constraints; similarly the program <it>OWEN </it><abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> accepts anchor points for pairwise alignment. Multiple-alignment methods accepting pre-defined constraints have also been proposed by Myers <it>et al</it>. <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> and Sammeth <it>et al</it>. <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>. The multi-alignment program DIALIGN <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr></abbrgrp> has an option that can be used to calculate alignments under user-specified constraints. Originally, this program feature has been introduced to reduce the alignment search space and program running time for large genomic sequences <abbrgrp><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>; see also <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. At <it>G&#246;ttingen Bioinformatics Compute Server (GOBICS)</it>, we provide a user-friendly web interface where anchor points can be used to guide the multiple alignment procedure <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Herein, we describe our anchored-alignment approach in detail using a previously introduced set-theoretical alignment concept. We apply our method to genomic sequences of the <it>Hox </it>gene clusters. For these sequences, the default version of DIALIGN produces serious mis-alignments where entire genes are incorrectly aligned, but meaningful alignments can be obtained if the known gene boundaries are used as anchor points.</p>
         <p>In addition, our anchoring procedure can be used to obtain information for the further development of alignment algorithms. To improve the performance of automatic alignment methods, it is important to know what exactly goes wrong in those situations where these methods fail to produce biologically reasonable alignments. In principle, there are two possible reasons for failures of alignment programs. It is possible that the underlying <it>objective function </it>is 'wrong' by assigning high numerical scores to biologically meaningless alignments. But it is also possible that the objective function is 'correct' &#8211; i.e. biologically correct alignments have numerically optimal scores -and the employed heuristic <it>optimisation algorithm </it>fails to return mathematically optimal or near-optimal alignments. The anchoring approach that we implemented can help to find out which component of our alignment program is to blame if automatically produced alignments are biologically incorrect.</p>
         <p>One result of our study is that anchor points can not only improve the <it>biological </it>quality of the output alignments but can in certain situations lead to alignments with significantly higher <it>numerical </it>scores. This demonstrates that the heuristic optimisation procedure used in DIALIGN may produce output alignments with scores far below the optimum for the respective data set. The latter result has important consequences for the further development of our alignment approach: it seems worthwile to develop more efficient algorithms for the optimisation problem that arises in the context of the DIALIGN algorithm. In other situations, the numerical scores of biologically correct alignments turned out to be below the scores of biololgically wrong alignments returned by the non-anchored version of our program. Here, improved optimisation functions will not lead to biologically more meaningful alignments. It is therefore also promising to develop improved objective function for our alignment approach.</p>
      </sec>
      <sec>
         <st>
            <p>Alignment of tandem duplications</p>
         </st>
         <p>There are many situations where automated alignment procedures can produce biologically incorrect aligments. An obvious challenge are <it>distantly </it>related input sequences where homologies at the primary sequence level may be obscured by spurious random similarities. Another notorious challenge for alignment programs are <it>duplications </it>within the input sequences. Here, <it>tandem duplications </it>are particularly hard to align, see e.g. <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. Specialised software tools have been developed to cope with the problems caused by sequence duplications <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. For the segment-based alignment program DIALIGN, the situation is as follows. As described in previous publications, the program constructs pairwise and multiple alignments from pairwise local sequence similarities, so-called <it>fragment alignments or fragments </it><abbrgrp><abbr bid="B17">17</abbr><abbr bid="B16">16</abbr></abbrgrp>. A fragment is defined as an un-gapped pair of equal-length segments from two of the input sequences. Based on statistical considerations, the program assigns a <it>weight score </it>to each possible fragment and tries to find a consistent collection of fragments with maximum total score. For pairwise alignment, a <it>chain </it>of fragments with maximum score can be identified <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. For multiple sequence sets, all possible pairwise alignments are performed and fragments contained in these pairwise alignments are integrated <it>greedily </it>into a resulting multiple alignment.</p>
         <p>As indicated in Figure <figr fid="F1">1</figr>, tandem duplications can create various problems for the above outlined alignment approach. In the following, we discuss two simple examples where duplications can confuse the segment-based alignment algorithm. Let us consider a motif that is duplicated in one or several of the input sequences <it>S</it><sub>1</sub>,..., <it>S</it><sub><it>k</it></sub>. For simplicity, let us assume that our sequences do not share any significant similarity outside the motif. Moreover, we assume that the degree of similarity among all instances of the motif is roughly comparable. There are no difficulties if two sequences are to be aligned and the motif is duplicated in <it>both </it>sequences, i.e if one has instances <graphic file="1748-7188-1-6-i1.gif"/> and <graphic file="1748-7188-1-6-i2.gif"/> of the motif in sequence <it>S</it><sub>1 </sub>and instances <graphic file="1748-7188-1-6-i3.gif"/> and <graphic file="1748-7188-1-6-i4.gif"/> of the same motif in sequence <it>S</it><sub>2</sub> as in Figure <figr fid="F1">1 (A)</figr>. In such a situation, our alignment approach will correctly align <graphic file="1748-7188-1-6-i1.gif"/> to <graphic file="1748-7188-1-6-i3.gif"/> and <graphic file="1748-7188-1-6-i2.gif"/> to <graphic file="1748-7188-1-6-i4.gif"/> since, for pairwise alignment, our algorithm returns a <it>chain </it>of fragments with maximum <it>total </it>score.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Possible mis-alignments caused by tandem duplications in the segment-based alignment approach (DIALIGN)</p>
            </caption>
            <text>
               <p>Possible mis-alignments caused by tandem duplications in the segment-based alignment approach (DIALIGN). We assume that various instances of a motif are contained in the input sequence set and that the degree of similarity among the different instances is approximately equal. For simplicity, we also assume that the sequences do not share any similarity outside the conserved motif. Lines connecting the sequences denote fragments identified by DIALIGN in the respective pairwise alignment procedures. (<it>A</it>) If a tandem duplication occurs in two sequences, the correct alignment will be found since the algorithm identifies a <it>chain </it>of local alignments with maximum <it>total </it>score. (<it>B</it>) If a motif is duplicated in one sequence but only one instance <it>M</it><sub>2 </sub>is contained in the second sequence, it may happen that <it>M</it><sub>2 </sub>is split up and aligned to different instances of the motif in the first sequence. (<it>C</it>) If the motif is duplicated in the first sequence but only one instance of it is contained in sequences two and three, respectively, <it>consistency </it>conflicts can occur. In this case, local similarities identified in the respective pairwise alignments cannot be integrated into one single output alignment. To select a consistent subset of these pairwise similarities, DIALIGN uses a <it>greedy </it>heuristic. Depending on the degree of similarity among the instances of the motif, the greedy approach may lead to serious mis-alignments (<it>D</it>).</p>
            </text>
            <graphic file="1748-7188-1-6-1"/>
         </fig>
         <p>Note that a strictly greedy algorithm could be confused by this situation and could align, for example, <graphic file="1748-7188-1-6-i1.gif"/> to <graphic file="1748-7188-1-6-i4.gif"/> in Figure <figr fid="F1">1</figr> if the similarity among these two instances of the motif happens to be slightly stronger than the similarity among <graphic file="1748-7188-1-6-i1.gif"/> and <graphic file="1748-7188-1-6-i3.gif"/>, and among <graphic file="1748-7188-1-6-i2.gif"/> and <graphic file="1748-7188-1-6-i4.gif"/>, respectively. However, DIALIGN uses a greedy approach only for <it>multiple </it>alignment where an exact solution is not feasible, but for pairwise alignment, the program returns an <it>optimal </it>alignment with respect to the underlying objective function. Thus, under the above assumtion, a meaningful alignment will be produced even if <graphic file="1748-7188-1-6-i1.gif"/> exhibits stronger similarity to <graphic file="1748-7188-1-6-i4.gif"/> than to <graphic file="1748-7188-1-6-i3.gif"/>.</p>
         <p>The trouble starts if a tandem duplication <graphic file="1748-7188-1-6-i1.gif"/>, <graphic file="1748-7188-1-6-i2.gif"/> occurs in <it>S</it><sub>1 </sub>but only one instance of the motif, <it>M</it><sub>2</sub>, is present in <it>S</it><sub>2</sub>. Here, it can happen that the beginning of <it>M</it><sub>2 </sub>is aligned to the beginning of <graphic file="1748-7188-1-6-i1.gif"/> and the end of <it>M</it><sub>2 </sub>is aligned to the end of <graphic file="1748-7188-1-6-i2.gif"/> as in Figure <figr fid="F1">1 (B)</figr>. DIALIGN is particularly susceptible to this type of errors since it does not use gap penalties. The situation is even more problematic for multiple alignment. Consider, for example, the three sequences <it>S</it><sub>1</sub>, <it>S</it><sub>1</sub>, <it>S</it><sub>3 </sub>in Figure <figr fid="F1">1 (C)</figr>, where two instances <graphic file="1748-7188-1-6-i1.gif"/>, <graphic file="1748-7188-1-6-i2.gif"/> of a motif occur in <it>S</it><sub>1 </sub>while <it>S</it><sub>2 </sub>and <it>S</it><sub>3 </sub>each contain only one instance of the motif <it>M</it><sub>2 </sub>and <it>M</it><sub>3</sub>, respectively. Under the above assumptions, a <it>biologically </it>meaningful alignment of these sequences would certainly align <it>S</it><sub>2 </sub>to <it>S</it><sub>3</sub>, and both motifs would be aligned either to <graphic file="1748-7188-1-6-i1.gif"/> or to <graphic file="1748-7188-1-6-i2.gif"/> &#8211; depending on the degree of similarity of <it>S</it><sub>2 </sub>and <it>S</it><sub>3 </sub>to <graphic file="1748-7188-1-6-i1.gif"/> and <graphic file="1748-7188-1-6-i2.gif"/>, respectively. Note that such an alignment would also receive a high <it>numerical </it>score since it would involve <it>three </it>pairwise alignments of the conserved motif. However, since the pairwise alignments are carried out independently for each sequence pair, it may happen that the first instance of the motif in sequence <it>S</it><sub>1</sub>, <graphic file="1748-7188-1-6-i1.gif"/> is aligned to <it>M</it><sub>2 </sub>but the second instance, <graphic file="1748-7188-1-6-i2.gif"/>, is aligned to <it>M</it><sub>3 </sub>in the respective pairwise alignments as in Figure <figr fid="F1">1 (C)</figr>. Thus, the correct alignment of <it>M</it><sub>2 </sub>and <it>M</it><sub>3 </sub>will be <it>inconsistent </it>with the first two pairwise alignments. Depending on the degree of similarity among the motifs, alignment of <graphic file="1748-7188-1-6-i2.gif"/> and <it>M</it><sub>3 </sub>may be rejected in the greedy algorithm, so these motifs may not be aligned in the resulting multiple alignment. It is easy to see that the resulting multiple alignment would not only be biologically questionable, but it would also obtain a numerically lower score as it would involve only <it>two </it>pairwise alignments of the motif.</p>
      </sec>
      <sec>
         <st>
            <p>Multiple alignment with user-defined anchor points</p>
         </st>
         <p>To overcome the above mentioned difficulties, and to deal with other situations that cause problems for alignment programs, we implemented a semi-automatic <it>anchored </it>alignment approach. Here, the user can specify an arbitrary number of <it>anchoring points </it>in order to guide the alignment procedure. Each anchor point consists of a pair of equal-length segments of two of the input sequences. An anchor point is therefore characterised by five coordinates: the two <it>sequences </it>involved, the <it>starting positions </it>in these sequences and the <it>length </it>of the anchored segments. As a sixth parameter, our method requires a <it>score </it>that determines the <it>priority </it>of an anchor point. The latter parameter is necessary, since it is in general not meaningful to use <it>all </it>anchors proposed by the user. It is possible that the selected anchor points are <it>inconsistent </it>with each other in the sense that they cannot be included in one single multiple output alignment, see <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> for our concept of consistency. Thus, it may be necessary for the algorithm to select a suitable <it>subset </it>of the proposed anchor points.</p>
         <p>Our software provides two slightly different options for using anchor points. There is a <it>strong </it>anchoring option, where the specified anchor positions are necessarily aligned to each other, consistency provided. The remainder of the sequences is then aligned based on the consistency constraints given by these pre-aligned positions. This option can be used to enforce correct alignment of those parts of the sequences for which additional expert information is available. For example, we are planning to align RNA sequences by using both primary and secondary structure information. Here, locally conserved secondary structures could be used as 'strong' anchor points to make sure that these structures are properly aligned, even if they share no similarity at the primary-structure level.</p>
         <p>In addition, we have a <it>weak </it>anchoring option, where consistent anchor points are only used to constraint the output alignment, but are not necessarily aligned to each other. More precisely, if a position <it>x </it>in sequence <it>S</it><sub><it>i </it></sub>is <it>anchored </it>with a position <it>y </it>in sequence <it>S</it><sub><it>j </it></sub>through one of the anchor points, this means that <it>y </it>is the <it>only </it>position from <it>S</it><sub><it>j </it></sub>that can be aligned to <it>x</it>. Whether or not <it>x </it>and <it>y </it>will actually appear in the same column of the output alignment depends on the degree of local similarity among the sequences around positions <it>x </it>and <it>y</it>. If no statistically significant similarity can be detected, <it>x </it>and <it>y </it>may remain un-aligned. Moreover, anchoring <it>x </it>and <it>y </it>means that positions strictly to the left (or strictly to the right) of <it>x </it>in <it>S</it><sub><it>i </it></sub>can be aligned only to positions strictly to the left (or strictly to the right) of <it>y </it>in <it>S</it><sub><it>j </it></sub>&#8211; and vice versa. Obviously, these relations are <it>transitive</it>, so if position <it>x </it>is anchored with position <it>y</it><sub>1</sub>, <it>y</it><sub>1 </sub>is to the left of another position <it>y</it><sub>2 </sub>in the same sequence, and <it>y</it><sub>2 </sub>in turn, is aligned to a position <it>z</it>, then positions to the left of <it>x </it>can be aligned only to positions to the left of <it>z </it>etc. The 'weak' option may be useful if anchor points are used to reduce the program running time.</p>
         <p>Algorithmically, strong or weak anchor points are treated by DIALIGN in the same way as <it>fragments </it>( = segment pairs) in the greedy procedure for multi-alignment. By transitivity, a set <it>Anc</it> of anchor points defines a <it>quasi partial order relation </it>&#8804;<sub><it>Anc </it></sub>on the set <it>X </it>of all positions of the input sequences &#8211; in exactly the same way as an alignment <it>Ali</it> induces a quasi partial order relation &#8804;<sub><it>Ali </it></sub>on <it>X </it>as described in <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B25">25</abbr></abbrgrp>. Formally, we consider an alignment <it>Ali</it> as well as a set of anchor points <it>Anc</it> as an <it>equivalence relation </it>defined on the set <it>X </it>of all positions of the input sequences. Next, we consider the partial order relation &#8804; on <it>X </it>that is given by the 'natural' ordering of positions within the sequences. In order-theoretical terms, &#8804; is the <it>direct sum </it>of the <it>linear </it>order relations defined on the individual sequences. The partial order relation &#8804;<sub><it>Anc </it></sub>is then defined as the <it>transitive closure </it>of the union &#8804; &#8746; <it>Anc</it>. In other words, we have <it>x </it>&#8804;<sub><it>Anc </it></sub><it>y </it>if and only if there is a chain x<sub>0</sub>, ..., <it>x</it><sub><it>k </it></sub>of positions with <it>x</it><sub>0 </sub>= <it>x </it>and <it>x</it><sub><it>k </it></sub>= <it>y </it>such that for every <it>i </it>&#8712; {1,..., <it>k</it>}, position <it>x</it><sub><it>i</it>-1 </sub>is either anchored with <it>x</it><sub><it>i </it></sub>or <it>x</it><sub><it>i</it>-1 </sub>and <it>x</it><sub><it>i </it></sub>belong to the same sequence, and <it>x</it><sub><it>i</it>-1 </sub>is on the left-hand side of <it>x</it><sub><it>i </it></sub>in that sequence.</p>
         <p>In our set-theoretical setting, a relation <it>R </it>on <it>X </it>is called consistent if all restrictions of the tansitive closure of the union &#8804; &#8746; <it>R </it>to the idividual sequences <it>coincides </it>with their respective 'natural' linear orderings. With the <it>weak </it>version of our anchored-alignment approach, we are looking for an alignment <it>Ali</it> wich maximum score such that the union <it>Ali</it> &#8746; <it>Anc</it> is consistent. With the <it>strong </it>option, we are looking for a maximum-scoring alignment <it>Ali</it> that is a superset of <it>Anc</it>. With both program options, our optimisation problem is to find an alignment <it>Ali</it> with maximum score &#8211; under the additional constraint that the set-theoretical union <it>Ali</it> &#8746; <it>Anc</it> is consistent. In the weak anchoring approach, the output alignment is <it>Ali</it> while with the strong option, the program returns the transitive closure of the union <it>Ali</it> &#8746; <it>Anc</it>.</p>
         <p>The above optimisation problem makes sense only if the set <it>Anc</it> of anchor points is itself consistent. Since a user-defined set of anchor points cannot be expectd to be consistent, the first step in our anchoring procedure is to select a consistent <it>subset </it>of the anchor points proposed by the user. To this end, the program uses the same greedy approach that it applies in the optimisation procedure for multiple alignment. That is, each anchor point is associated with some user-defined score, and the program accepts input anchor points in order of decreasing scores &#8211; provided they are consistent with the previously accepted anchors.</p>
         <p>The greedy selection of anchor points makes it possible for the user to <it>prioritise </it>potential anchor points according to arbitrary user-defined criteria. For example, one may use known gene boundaries in genomic sequences to define anchor points as we did in the <it>Hox </it>gene example described below. In addition, one may want to use <it>automatically </it>produced local alignments as anchor points to speed up the alignment procedure as outlined in <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Note that the set of gene boundaries will be necessarily consistent as long as the relative ordering among the genes is conserved. However, the automatically created anchor points may well be <it>inconsistent </it>with those 'biologically defined' anchors or inconsistent with each other. Since anchor points derived from expert knowledge should be more reliable than anchor points identified by some software program, it would make sense to first accept the known gene boundaries as anchors and then to use the automatically created local alignments, under the condition that they are consistent with the known gene boundaries. So in this case, one could use local alignment scores as scores for the <it>automatically </it>created anchor points, while one would assign arbitrarily defined higher scores to the <it>biologically </it>verified gene boundaries.</p>
      </sec>
      <sec>
         <st>
            <p>Applications to <it>Hox </it>gene clusters</p>
         </st>
         <p>As explained above, tandem duplications pose a hard problem for automatic alignment algorithms. Clusters of such paralogous genes are therefore particularly hard to align. As a real-life example we consider here the <it>Hox </it>gene clusters of vertebrates. <it>Hox </it>genes code for homeodomain transcription factors that regulate the anterior/posterior patterning in most bilaterian animals <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>. This group of genes, together with the so-called <it>ParaHox </it>genes, arose early in metazoan history from a single ancestral <it>"UrHox </it>gene" <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. Their early evolution was dominated by a series of tandem duplications. As a consequence, most bilaterians share at least eight distinct types (in arthropods, and 13 or 14 in chordates), usually referred to as paralogy classes. These <it>Hox </it>genes are usually organised in tightly linked clusters such that the genes at the 5'end (paralogy groups 9&#8211;13) determine features at the posterior part of the animal while the genes at the 3'end (paralogy groups 1&#8211;3) determine the anterior patterns.</p>
         <p>In contrast to all known invertebrates, all vertebrate lineages investigated so far exhibit multiple copies of <it>Hox </it>clusters that presumably arose through genome duplications in early vertebrate evolution and later in the actinopterygian (ray finned fish) lineage <abbrgrp><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr><abbr bid="B33">33</abbr></abbrgrp>. These duplication events were followed by massive loss of the duplicated genes in different lineages, see e.g. <abbrgrp><abbr bid="B34">34</abbr></abbrgrp> for a recent review on the situation in teleost fishes. The individual <it>Hox </it>clusters of gnathostomes have a length of some 100,000nt and share besides a set of homologous genes also a substantial amount of conserved non-coding DNA <abbrgrp><abbr bid="B35">35</abbr></abbrgrp> that predominantly consists of transcription factor binding sites. Most recently, however, some of these "phylogenetic footprints" were identified as microRNAs <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>.</p>
         <p>Figure <figr fid="F2">2</figr> and <figr fid="F3">3</figr> show four of the seven <it>Hox </it>clusters of the pufferfish <it>Takifugu rubripes</it>. Despite the fact that the <it>Hox </it>genes within a paralogy group are significantly more similar to each other than to members of other paralogy groups, there are several features that make this dataset particularly difficult and tend to mislead automatic alignment procedures: (1) Neither one of the 13 <it>Hox </it>paralogy groups nor the <it>Evx </it>gene is present in all four sequences. (2) Two genes, <it>HoxC8a </it>and <it>HoxA2a </it>are present in only a single sequence. (3) The clusters have different sizes and numbers of genes (33481 nt to 125385 nt, 4 to 10 genes).</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>The pufferfish <it>Takifugu rubripes </it>has seven <it>Hox </it>clusters of which we use four in our computational example</p>
            </caption>
            <text>
               <p>The pufferfish <it>Takifugu rubripes </it>has seven <it>Hox </it>clusters of which we use four in our computational example. The <it>Evx </it>gene, another homedomain transcription factor is usually liked with the <it>Hox </it>genes and can be considered as part of the <it>Hox </it>cluster. The paralogy groups are indicated. Filled boxes indicates intact <it>Hox </it>genes, the open box indicates a <it>HoxA7a </it>pseudogene [45].</p>
            </text>
            <graphic file="1748-7188-1-6-2"/>
         </fig>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Result of a DIALIGN run on the <it>Hox </it>sequences from Figure 2 without anchoring</p>
            </caption>
            <text>
               <p>Result of a DIALIGN run on the <it>Hox </it>sequences from Figure 2 without anchoring. The diagram represents sequences and gene positions to scale. All incorrectly aligned segments (defined as parts of a gene that are aligned with parts of gene from a different paralogy group) are indicated by lines between the sequences.</p>
            </text>
            <graphic file="1748-7188-1-6-3"/>
         </fig>
         <p>We observe that without anchoring DIALIGN mis-aligns many of of the <it>Hox </it>genes in this example by matching blocks from one <it>Hox </it>gene with parts of a <it>Hox </it>gene from a different paralogy group. As a consequence, genes that should be aligned, such as <it>HoxA1Oa </it>and <it>HoxDIOa</it>, are not aligned with each other.</p>
         <p>Anchoring the alignment, maybe surprisingly, increases the number of columns that contain aligned sequence positions from 3870 to 4960, i.e., by about 28%, see Table <tblr tid="T2">2</tblr>. At the same time, the CPU time is reduced by almost a factor of 3.</p>
         <p>We investigated not only the <it>biological </it>quality of the anchored and non-anchored alignments but also looked at their <it>numerical </it>scores. Note that in DIALIGN, the score of an alignment is defined as the sum of weight scores of the fragments it is composed of <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. For some sequence sets we found that the score of the anchored alignment was above the non-anchored alignment while for other sequences, the non-anchored score exceeded the anchored one. For example, with the sequence set shown in Figure <figr fid="F2">2</figr>, the alignment score of the &#8211; biologically more meaningful &#8211; anchored alignment was > 13% <it>below </it>the non-anchored alignment (see Table <tblr tid="T1">1</tblr>). In contrast, another sequence set with five <it>HoxA </it>cluster sequences (TrAa, TnAa, DrAb, TrAb, TnAb) from three teleost fishes <it>(Takifugu rubripes</it>, Tr; <it>Tetraodon nigroviridis</it>, Tn; <it>Danio rerio</it>, Dr) yields an anchored alignment score that is some 15% <it>above </it>the non-anchored score.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Effect of different anchors in the Fugu example of Figure 2. We consider aligned sequence positions in intergenic regions (i.e., <it>outside </it>the coding regions and introns) only. Column 2 gives the number of sequence positions for which DIALIGN added at least one additional sequence that was not represented in original TRACKER footprint. Column 3 lists the total number of nucleotides in footprints that were not detected by tracker but were aligned by anchored DIALIGN.</p>
            </caption>
            <tblbdy cols="4">
               <r>
                  <c ca="left">
                     <p>anchor</p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>nt positions in footprints</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="right">
                     <p>total</p>
                  </c>
                  <c ca="right">
                     <p>expanding</p>
                  </c>
                  <c ca="right">
                     <p>new</p>
                  </c>
               </r>
               <r>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>none</p>
                  </c>
                  <c ca="right">
                     <p>1546</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>618</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>genes</p>
                  </c>
                  <c ca="right">
                     <p>1686</p>
                  </c>
                  <c ca="right">
                     <p>39</p>
                  </c>
                  <c ca="right">
                     <p>694</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>genes and BLASTZ hits</p>
                  </c>
                  <c ca="right">
                     <p>2433</p>
                  </c>
                  <c ca="right">
                     <p>39</p>
                  </c>
                  <c ca="right">
                     <p>841</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Aligned sequence positions that result from fragment aligments in the Fugu <it>Hox </it>cluster example. To compare these alignments, we counted the number of columns where two, three or four residues are aligned, respectively. Here, we counted only upper-case residues in the DIALIGN output since lower-case residues are not considered to be aligned by DIALIGN. The number of columns in which two or three residues are aligned increases when more anchors are used, while the number of columns in which all sequences are aligned decreases. This is because in our example no single <it>Hox </it>gene is contained in all four input sequences, see Figure 2. Therefore a biologically correct alignment of these sequences should not contain columns with four residues. CPU times are measured on a PC with two Intel Xeon 2.4GHz processors and 1 Gbyte of RAM.</p>
            </caption>
            <tblbdy cols="7">
               <r>
                  <c ca="left">
                     <p>anchor</p>
                  </c>
                  <c ca="right">
                     <p>alignment length</p>
                  </c>
                  <c cspan="3" ca="center">
                     <p>aligned sequences</p>
                  </c>
                  <c ca="right">
                     <p>CPU time</p>
                  </c>
                  <c ca="right">
                     <p>score</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="right">
                     <p>2</p>
                  </c>
                  <c ca="right">
                     <p>3</p>
                  </c>
                  <c ca="right">
                     <p>4</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>none</p>
                  </c>
                  <c ca="right">
                     <p>281759</p>
                  </c>
                  <c ca="right">
                     <p>2958</p>
                  </c>
                  <c ca="right">
                     <p>668</p>
                  </c>
                  <c ca="right">
                     <p>244</p>
                  </c>
                  <c ca="right">
                     <p>4:22:07</p>
                  </c>
                  <c ca="right">
                     <p>1166</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>genes</p>
                  </c>
                  <c ca="right">
                     <p>252346</p>
                  </c>
                  <c ca="right">
                     <p>3674</p>
                  </c>
                  <c ca="right">
                     <p>1091</p>
                  </c>
                  <c ca="right">
                     <p>195</p>
                  </c>
                  <c ca="right">
                     <p>1:18:12</p>
                  </c>
                  <c ca="right">
                     <p>1007</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>BLASTZ hits</p>
                  </c>
                  <c ca="right">
                     <p>239326</p>
                  </c>
                  <c ca="right">
                     <p>4036</p>
                  </c>
                  <c ca="right">
                     <p>1139</p>
                  </c>
                  <c ca="right">
                     <p>33</p>
                  </c>
                  <c ca="right">
                     <p>0:19:32</p>
                  </c>
                  <c ca="right">
                     <p>742</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
      </sec>
      <sec>
         <st>
            <p>Anchored protein alignments</p>
         </st>
         <p>BAliBASE is a benchmark database to evaluate the performance of software programs for multiple protein alignment <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. The database consists of a large number of protein families with known 3D structure. These structures are used to define so-called <it>core blocks </it>for which 'biologically correct' alignments are known. There are two scoring systems to evaluate the accuracy of multiple alignments on BAliBASE protein families. The BAliBASE <it>sum-of-pairs </it>score measures the percentage of correctly aligned pairs of amino acid residues within the core blocks. By contrast, the <it>column score </it>measures the percentage of correctly aligned columns in the core blocks, see <abbrgrp><abbr bid="B38">38</abbr><abbr bid="B10">10</abbr></abbrgrp> for more details. These BAliBASE scoring functions are not to be confused with the objective functions used by different alignment algorithms.</p>
         <p>Thus, alignment programs can be evaluated by their ability to correctly align these core blocks. BAliBASE covers various alignment situations, e.g. protein families with global similarity or protein families with large internal or terminal insertions or deletions. However, it is important to mention that most sequences in the standard version of BAliBASE are <it>not </it>real-world sequences, but have been artificially truncated by the database authors who simply removed non-homologous C-terminal or N-terminal parts of the sequences. Only the most recent version of BAliBASE provides the original full-length sequence sets together with the previous truncated data. Therefore, most studies based on BAliBASE have a strong bias in favour of <it>global </it>alignment programs such as CLUSTAL W <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>; these programs perform much better on the BAliBASE data than they would perform on on realistic full-length protein sequences. The performance of programs that are based on <it>local </it>sequence similarities, on the other hand, is systematically <it>underestimated </it>by BAliBASE. Despite this systematic error, test runs on BAliBASE can give a rough impression about the performance of multiple-alignment programs in different situations.</p>
         <p>DIALIGN has been shown to perform well on those data sets in BAliBASE that contain large insertions and deletions. On the other hand, it is often outperformed by global alignment methods on those data sets where homology extends over the entire sequence length but similarity is low at the primary-sequence level. For the further development and improvement of the program, it is crucial to find out which components of DIALIGN are to blame for the inferiority of the program on this type if sequence families. One possibility is that biologically meaningful alignments on BAliBASE would have high numerical scores, but the greedy heuristic used by DIALIGN is inefficient and returns low-scoring alignments that do not align the core blocs correctly. In this case, one would use more efficient optimisation strategies to improve the performance of DIALIGN on BAliBASE. On the other hand, it is possible that the scoring function used in DIALIGN assigns highest scores to biologically wrong alignments. In this case, an improved optimisation algorithm would not lead to any improvement in the biological quality of the output alignments and it would be necessary to improve the objective function used by the program.</p>
         <p>To find out which component of DIALIGN is to blame for its unsatisfactory performance on some of the BAliBASE data, we applied our program to BAliBASE (<it>a</it>) using the non-anchored default version of the program and (<it>b</it>) using the <it>core blocks </it>as anchor points in order to <it>enforce </it>biologically correct alignments of the sequences. We then compared the numerical DIALIGN scores of the anchored alignments to the non-anchored default alignments. The results of these program runs are summarised in Table <tblr tid="T3">3</tblr>. The numerical alignment scores of the (biologically correct) anchored alignments turned out to be slightly <it>below </it>the scores of the non-anchored default alignments.</p>
         <tbl id="T3">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>DIALIGN alignment scores for anchored and non-anchored alignment of five reference test sets from BAliBASE. As anchor points, we used the so-called <it>core-blocks </it>in BAliBASE, thereby enforcing biologically correct alignments of the input sequences. The figures in the first and second line refer to the sum of DIALIGN alignment scores of all protein families in the respective reference set. Line four contains the number of sequence sets where the anchoring <it>improved </it>the alignment score together with the total number of sequence sets in this reference set. Our test runs show that on these test data, biologically meaningful alignments do not have higher DIALIGN scores than alignments produced by the default version of our program.</p>
            </caption>
            <tblbdy cols="7">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="6" ca="center">
                     <p>Alignment scores</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="right">
                     <p>Ref1</p>
                  </c>
                  <c ca="right">
                     <p>Ref2</p>
                  </c>
                  <c ca="right">
                     <p>Ref3</p>
                  </c>
                  <c ca="right">
                     <p>Ref4</p>
                  </c>
                  <c ca="right">
                     <p>Ref5</p>
                  </c>
                  <c ca="right">
                     <p>Total</p>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>non-anchored</p>
                  </c>
                  <c ca="right">
                     <p>53,613</p>
                  </c>
                  <c ca="right">
                     <p>269,009</p>
                  </c>
                  <c ca="right">
                     <p>283,273</p>
                  </c>
                  <c ca="right">
                     <p>36,515</p>
                  </c>
                  <c ca="right">
                     <p>29,214</p>
                  </c>
                  <c ca="right">
                     <p>671,624</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>anchored</p>
                  </c>
                  <c ca="right">
                     <p>53,417</p>
                  </c>
                  <c ca="right">
                     <p>265,966</p>
                  </c>
                  <c ca="right">
                     <p>283,136</p>
                  </c>
                  <c ca="right">
                     <p>36,611</p>
                  </c>
                  <c ca="right">
                     <p>29,257</p>
                  </c>
                  <c ca="right">
                     <p>668,387</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ratio</p>
                  </c>
                  <c ca="right">
                     <p>0.996</p>
                  </c>
                  <c ca="right">
                     <p>0.988</p>
                  </c>
                  <c ca="right">
                     <p>0.999</p>
                  </c>
                  <c ca="right">
                     <p>1.002</p>
                  </c>
                  <c ca="right">
                     <p>1.001</p>
                  </c>
                  <c ca="right">
                     <p>0.995</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>score improved</p>
                  </c>
                  <c ca="right">
                     <p>23/82</p>
                  </c>
                  <c ca="right">
                     <p>13/23</p>
                  </c>
                  <c ca="right">
                     <p>4/23</p>
                  </c>
                  <c ca="right">
                     <p>6/16</p>
                  </c>
                  <c ca="right">
                     <p>4/12</p>
                  </c>
                  <c ca="right">
                     <p>50/156</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>As an example, Figure <figr fid="F4">4</figr> shows an alignment calculated by the non-anchored default version of DIALIGN for BAliBASE reference set <it>lr69</it>. This sequence set consists of four DNA-binding proteins and is a challenging alignment example as there is only weak similarity at the primary sequence level. These proteins contain three <it>core blocks </it>for which a reliable multi-alignment is known based on 3D-structure information. As shown in Figure <figr fid="F4">4</figr>, most of the core blocks are misaligned by DIALIGN because of the low level of sequence similarity. With the BAliBASE scoring system for multiple alignments, the default alignment produced by DIALIGN has a <it>sum-of-pairs score </it>of only 33%, i.e. 33% of the amino-acid pairs in the core blocks are correctly aligned. The <it>column score </it>of this alignment 0%, i.e. there is not a single column of the core blocks correctly aligned.</p>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Anchored and non-anchored alignment of a set of protein sequences with known 3D structure (data set lr69 from BAliBASE [38])</p>
            </caption>
            <text>
               <p>Anchored and non-anchored alignment of a set of protein sequences with known 3D structure (data set lr69 from BAliBASE [38]). Three <it>core blocks </it>for which the 'correct' alignment is known are shown in red, blue and green. <b>(A) </b>Alignment calculated by DIALIGN with default options. Most of the core blocks are mis-aligned. <b>(B) </b>Alignment calculated by DIALIGN with <it>anchoring </it>option. The first position of the third block has been used as anchor point, i.e. the program has <it>been forced </it>to align this column correctly. The rest of the sequences is automatically aligned by DIALIGN given the constraints defined by this anchor point. Although only one single column has been used for anchoring, the tree blocks are almost perfectly aligned.</p>
            </text>
            <graphic file="1748-7188-1-6-4"/>
         </fig>
         <p>We investigated how many anchor points were necessary to enforce a correct alignment of the three core blocks in this test example. As it turned out, it was sufficient to use one single column of the core blocks as anchor points, namely the first column of the third motif. Technically, this can be done by using three anchor points of length one each: anchor point connecting the first position of this core block in sequence 1 with the corresponding position in sequence 2, another anchor connecting sequence 1 with sequence 3 and a third anchor connecting sequence 1 with sequence 4. Although our anchor points enforced the correct alignment only for a single column, most parts of the core blocks were correctly aligned as shown in Figure <figr fid="F4">4</figr>. The BAliBASE sum-of-pairs score of the resulting alignment was 91% while the column score was 90% as 18 out of 20 columns of the core blocks were correctly aligned. As was generally the case for BAliBASE, the <it>DIALIGN score </it>of the (biologically meaningful) anchored alignment was lower than the score of the (biologically wrong) default alignment. The DIALIGN score of the anchored alignment was 9.82 compared with 11.99 for the non-anchored alignment, so here the score of the anchored alignment was around 18 percent below the score of the non-anchored alignment.</p>
      </sec>
      <sec>
         <st>
            <p>Anchored alignments for phylogenetic footprinting</p>
         </st>
         <p>Evolutionarily conserved regions in non-coding sequences represent a potentially rich source for the discovery of gene regulatory regions. While functional elements are subject to stabilizing selection, the adjacent non-functional DNA evolves much faster. Therefore, blocks of conservation, so-called phylogenetic footprints, can be detected in orthologous non-coding sequences with low overall similarity by comparative genomics <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Alignment algorithms, including DIALIGN, were advocated for this task. As the example in the previous section shows, however, anchoring the alignments becomes a necessity in applications to large genomic regions and clusters of paralogous genes. While interspersed repeats are normally removed ("masked") using e.g. <it>RepeatMasker</it>, they need to be taken into account in the context of phylogenetic footprinting: if a sequence motif is conserved hundreds of millions of years it may well have become a regulatory region even if it is (similar to) a repetitive sequence in some of the organisms under consideration <abbrgrp><abbr bid="B40">40</abbr></abbrgrp>.</p>
         <p>The phylogenetic footprinting program <it>TRACKER </it><abbrgrp><abbr bid="B41">41</abbr></abbrgrp> was designed specifically to search for conserved non-coding sequences in large gene clusters. It is based on a similar philosophy as segment based alignment algorithms. The TRACKER program computes pairwise local alignments of all input sequences using BLASTZ <abbrgrp><abbr bid="B42">42</abbr></abbrgrp> with non-stringent settings. BLASTZ permits alignment of long genomic sequences with large proportions of neutrally evolving regions. A post-processing step aims to remove simple repeats recognized at their low sequence complexity and regions of low conservation. The resulting list of pairwise alignments is then assembled into clusters of partially overlapping regions. Here the approach suffers from the same problem as DIALIGN, which is, however, resolved in a different way: instead of producing a single locally optimal alignment, TRACKER lists all maximal compatible sets of pairwise alignments. For the case of Figure <figr fid="F1">1(C)</figr>, for instance, we obtain both <graphic file="1748-7188-1-6-i1.gif"/><it>M</it><sub>2</sub><it>M</it><sub>3 </sub>and <graphic file="1748-7188-1-6-i2.gif"/><it>M</it><sub>2</sub><it>M</it><sub>3</sub>. Since this step is performed based on the overlap of sequence intervals without explicitly considering the sequence information at all, TRACKER is very fast as long as the number of conflicting pairwise alignments remains small. In the final step DIALIGN is used to explicitly calculate the multiple sequence alignments from the subsequences that belong to individual clusters.</p>
         <p>For the initial pairwise local alignment step the search space is restricted to orthologous intergenic regions, parallel strands and chaining hits. Effectively, TRACKER thus computes alignments anchored at the genes from BLASTZ fragments.</p>
         <p>We have noticed <abbrgrp><abbr bid="B43">43</abbr></abbrgrp> that DIALIGN is more sensitive than TRACKER in general. This is due to detection of smaller and less significant fragments with DIALIGN compared to the larger, contiguous fragments returned by BLASTZ. The combination of BLASTZ and an anchored version of DIALIGN appears to be a very promising approach for phylogenetic footprinting. It makes use of the alignment specificity of BLASTZ and the sensitivity of DIALIGN. A combination of anchoring at appropriate genes (with maximal weight) and BLASTZ hits (with smaller weights proportional e.g. to &#8211; log <it>E </it>values) reduces the CPU requirements for the DIALIGN alignment by more than an order of magnitude. While this is still much slower than TRACKER (20 min vs. 40 s) it increases the sensitivity of the approach by about 30 &#8211; 40% in the Fugu example, Table <tblr tid="T1">1</tblr>. Work in progress aims at improving the significance measures for local multiple alignments. A more thorough discussion of anchored segment-based alignments to phylogenetic footprinting will be published elsewhere.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Automated alignment procedures are based on simple algorithmical rules. For a given set of input sequences, they try to find an alignment with maximum score in the sense of some underlying objective function. The two basic questions in sequence alignment are therefore (<it>a</it>) to define an meaningful objective function and (<it>b</it>) to design an efficient optimisation algorithm that finds optimal or at least near-optimal alignments with respect to the chosen objective function. Most multi-alignment programs are using <it>heuristic </it>optimisation algorithms, i.e. they are, in general, not able to find the mathematically optimal alignment with respect to the objective function. An objective function for sequence alignment should assign <it>numerically </it>high scores to <it>biologically </it>meaningful alignments. However, it is clearly not possible to find a <it>universally </it>applicable objective function that would give highest numerical scores to the biologically correct alignments in all possible situations. This is the main reason why alignment programs may fail to produce biologically reasonable output alignments. In fact, the impossibility to define a universal objective function constitutes a fundamental limitation for <it>all </it>automated alignment algorithms.</p>
         <p>Often a user is already familiar with a sequence family that he or she wants to align, so some knowledge about existing sequence homologies may be available. Such expert knowledge can be used to direct an otherwise automated alignment procedure. To facilitate the use of expert knowledge for sequence alignment, we proposed an <it>anchored alignment </it>approach where known homologies can be used to restrict the alignment search space. This can clearly improve the quality of the produced output alignments in situations where automatic procedures are not able to produce meaningful alignments. In addition, alignment anchors can be used to reduce the program running time. For the <it>Hox </it>gene clusters that we analyzed, the non-anchored version of DIALIGN produced serious misalignments. We used the known gene boundaries as anchor points to guarantee a correct alignment of these genes to each other.</p>
         <p>There are two possible reasons why automated alignment procedures may fail to produce biologically correct alignments, (<it>a</it>) The chosen objective function may not be in accordance with biology, i.e., it may assign mathematically high scores to biologically wrong alignments. In this case, even efficient optimisation algorithms would lead to meaningless alignments. (<it>b</it>) The mathematically optimal alignment is biologically meaningful, but the employed heuristic optimisation procedure is not able to find the alignment with highest score. For the further development of alignment algorithms, it is crucial to find out which one of these reasons is to blame for mis-alignments produced by existing software programs. If (<it>a</it>) is often observed for an alignment program, efforts should be made to improve its underlying objective function. If (<it>b</it>) is the case, the biological quality of the output alignments can be improved by using a more efficient optimisation algorithm. For DIALIGN, it is unknown how close the produced alignments come to the numerically optimal alignment &#8211; in fact, it is possible to construct example sequences where DIALIGN's greedy heuristic produces alignments with arbitrarily low scores compared with the possible optimal alignment.</p>
         <p>In the Fugu example, Figure <figr fid="F2">2</figr> and <figr fid="F3">3</figr>, the <it>numerical </it>alignment score of the (anchored) correct alignment was 13% below the score of the non-anchored alignment. All sequences in Figure <figr fid="F2">2</figr> and <figr fid="F3">3</figr> contain only subsets of the 13 <it>Hox </it>paralogy groups, and different sequences contain different genes. For such an extreme data set, it is unlikely that any reasonable objective function would assign an optimal score to the biologically correct alignment. Here, the problem is that sequence similarity no longer coincides with biological homology. The only way of producing good alignments in such situations is <it>to force </it>a program to align certain known homologies to each other. With our anchoring approach we can do this, for example by using known gene boundaries as <it>anchor points</it>.</p>
         <p>For the BAliBASE benchmark data base, the total score of the (biologically meaningful) anchored alignments was also below the score of the (biologically wrong) non-anchored default alignments.</p>
         <p>This implies, that improved optimisation algorithms will not lead to biologically improved alignments for these sequences. In this case, however, there is some correspondence between sequence similarity and homology, so one should hope that the performance of DIALIGN on these data can be improved by to designing better objective functions. An interesting example from BAliBASE is shown in Figure <figr fid="F4">4</figr>. Here, the non-anchored default version of our program produced a complete mis-alignment. However, it was sufficient to enforce the correct alignment of one <it>single </it>column using corresponding anchor points to obtain a meaningful alignment of the entire sequences where not only the one anchored column but most of the three core blocks are correctly aligned. This indicates that the correct alignment of the core blocks corresponds to a <it>local maximum </it>in the alignment landscape.</p>
         <p>In contrast, in the teleost <it>HoxA </it>cluster example the numerical score of the anchored alignment was around 15% <it>above </it>the score of the non-anchored alignment. This demonstrates that the greedy optimisation algorithm used by DIALIGN can lead to results with scores far below the optimal alignment. In such situations, improved optimisation algorithms may lead not only to mathematically higher-scoring alignments but also to alignments that are closer to the biologically correct alignment. We will use our anchored-alignment approach systematically to study the efficiency of objective functions and optimisation algorithms for our segment-based approach to multiple sequence alignment.</p>
      </sec>
      <sec>
         <st>
            <p>Program availability</p>
         </st>
         <p>The program is available online and as downloadable source code at G&#246;ttingen Bioinformatics Compute Server (GOBICS) <abbrgrp><abbr bid="B44">44</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Competing interests</p>
         </st>
         <p>The author(s) declare that they have no competing interests.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We would like to thank Jan Weyer-Menkhoff, Isabelle Schneider, Rasmus Steinkamp and Amarendran Subramanian for their support in the software development and evaluation and Peter Meinicke for critically reading the manuscript. The work was supported by DFG grant MO 1048/1-1 to BM, by BMBF grant 01AK803G (Medigrid) to BM and by DFG Bioinformatics Initiative BIZ-6/1-2 to SJP and PFS.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Higgins</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Gibson</snm>
                  <fnm>TJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1994</pubdate>
            <volume>22</volume>
            <fpage>4673</fpage>
            <lpage>4680</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308517</pubid>
                  <pubid idtype="pmpid">7984417</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>DIALIGN: Multiple DNA and Protein Sequence Alignment at BiBiServ</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>W33</fpage>
            <lpage>W36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15215344</pubid>
                  <pubid idtype="doi">10.1093/nar/gnh029</pubid>
                  <pubid idtype="pmcid">441511</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>T-Coffee: a novel algorithm for multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Notredame</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Higgins</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2000</pubdate>
            <volume>302</volume>
            <fpage>205</fpage>
            <lpage>217</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10964570</pubid>
                  <pubid idtype="doi">10.1006/jmbi.2000.4042</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Recent progress in multiple sequence alignment: a survey</p>
            </title>
            <aug>
               <au>
                  <snm>Notredame</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Pharmacogenomics</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>131</fpage>
            <lpage>144</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11966409</pubid>
                  <pubid idtype="doi">10.1517/14622416.3.1.131</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Multiple sequence alignment using partial order graphs</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Grasso</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sharlow</snm>
                  <fnm>MF</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <issue>3</issue>
            <fpage>452</fpage>
            <lpage>464</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11934745</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/18.3.452</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>MUSCLE: Multiple sequence alignment with high score accuracy and high throughput</p>
            </title>
            <aug>
               <au>
                  <snm>Edgar</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nuc Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>1792</fpage>
            <lpage>1797</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/nar/gkh340</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>ProbCons: Probabilistic consistency-based multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Do</snm>
                  <fnm>CB</fnm>
               </au>
               <au>
                  <snm>Mahabhashyam</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Brudno</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Batzoglou</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>330</fpage>
            <lpage>340</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15687296</pubid>
                  <pubid idtype="doi">10.1101/gr.2821705</pubid>
                  <pubid idtype="pmcid">546535</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Quality assessment of multiple alignment programs</p>
            </title>
            <aug>
               <au>
                  <snm>Lassmann</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>FEBS Letters</source>
            <pubdate>2002</pubdate>
            <volume>529</volume>
            <fpage>126</fpage>
            <lpage>130</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12354624</pubid>
                  <pubid idtype="doi">10.1016/S0014-5793(02)03189-7</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Benchmarking tools for the alignment of functional noncoding DNA</p>
            </title>
            <aug>
               <au>
                  <snm>Pollard</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Bergman</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>6</fpage>
            <url>http://www.biomedcentral.com/1471-2105/5/6</url>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14736341</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-5-6</pubid>
                  <pubid idtype="pmcid">344529</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A comprehensive comparison of protein sequence alignment programs</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Plewniak</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Poch</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Research</source>
            <pubdate>1999</pubdate>
            <volume>27</volume>
            <fpage>2682</fpage>
            <lpage>2690</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10373585</pubid>
                  <pubid idtype="doi">10.1093/nar/27.13.2682</pubid>
                  <pubid idtype="pmcid">148477</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>A Workbench for Multiple Alignment Construction and Analysis</p>
            </title>
            <aug>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>PROTEINS: Structure, Function and Genetics</source>
            <pubdate>1991</pubdate>
            <volume>9</volume>
            <fpage>180</fpage>
            <lpage>190</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/prot.340090304</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A hierarchical approach to aligning collinear regions of genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Roytberg</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ogurtsov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Shabalina</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>1673</fpage>
            <lpage>1680</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12490453</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/18.12.1673</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>OWEN: aligning long collinear regions of genomes</p>
            </title>
            <aug>
               <au>
                  <snm>Ogurtsov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Roytberg</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Shabalina</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kondrashov</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>1703</fpage>
            <lpage>1704</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12490463</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/18.12.1703</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Progressive Multiple Alignment with Constraints</p>
            </title>
            <aug>
               <au>
                  <snm>Myers</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Selznick</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>J Computational Biology</source>
            <pubdate>1996</pubdate>
            <volume>3</volume>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Divide-and-Conquer Alignment with segment-based constraints</p>
            </title>
            <aug>
               <au>
                  <snm>Sammeth</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Stoye</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics, ECCB special issue</source>
            <pubdate>2003</pubdate>
            <volume>19</volume>
            <fpage>iil89</fpage>
            <lpage>iil95</lpage>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Multiple DNA and protein sequence alignment based on segment-to-segment comparison</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Dress</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Werner</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1996</pubdate>
            <volume>93</volume>
            <fpage>12098</fpage>
            <lpage>12103</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">8901539</pubid>
                  <pubid idtype="doi">10.1073/pnas.93.22.12098</pubid>
                  <pubid idtype="pmcid">37949</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>211</fpage>
            <lpage>218</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10222408</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/15.3.211</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Fast and sensitive multiple alignment of large genomic sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Brudno</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chapman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>G&#246;ttgens</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Batzoglou</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>BMC Bioinformatics</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>66</fpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14693042</pubid>
                  <pubid idtype="doi">10.1186/1471-2105-4-66</pubid>
                  <pubid idtype="pmcid">521198</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Exon Discovery by Genomic Sequence Alignment</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Rinner</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Abdeddaim</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Haase</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Mayer</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Dress</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mewes</snm>
                  <fnm>HW</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>777</fpage>
            <lpage>787</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12075013</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/18.6.777</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Accurate anchoring alignment of divergent sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Huang</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Umbach</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2006</pubdate>
            <volume>22</volume>
            <fpage>29</fpage>
            <lpage>34</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16301203</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/bti772</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Multiple sequence alignment with user-defined constraints at GOBICS</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Werner</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>RSI</fnm>
               </au>
               <au>
                  <snm>Subramanian</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
               <au>
                  <snm>Weyer-Menkhoff</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>1271</fpage>
            <lpage>1273</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15546937</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/bti142</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Sequence alignment with tandem duplication</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>J Comp Biol</source>
            <pubdate>1997</pubdate>
            <volume>4</volume>
            <fpage>351</fpage>
            <lpage>367</lpage>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Detection of internal repeats: how common are they?</p>
            </title>
            <aug>
               <au>
                  <snm>Heringa</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Curr Opin Struc Biol</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>338</fpage>
            <lpage>345</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0959-440X(98)80068-7</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences</p>
            </title>
            <aug>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Applied Mathematics Letters</source>
            <pubdate>2002</pubdate>
            <volume>15</volume>
            <fpage>11</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0893-9659(01)00085-4</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Speeding up the DIALIGN multiple alignment program by using the 'Greedy Alignment of Biological Sequences LIBrary' (GABIOS-LIB)</p>
            </title>
            <aug>
               <au>
                  <snm>Abdedda&#239;m</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Morgenstern</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Lecture Notes in Computer Science</source>
            <pubdate>2001</pubdate>
            <volume>2066</volume>
            <fpage>1</fpage>
            <lpage>11</lpage>
         </bibl>
         <bibl id="B26">
            <title>
               <p>The structural and functional organization of the murine HOX gene family resembles that <it>of Drosophila </it>homeotic genes</p>
            </title>
            <aug>
               <au>
                  <snm>Duboule</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Doll&#233;</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>EMBO J</source>
            <volume>8</volume>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Homeobox genes and axial patterning</p>
            </title>
            <aug>
               <au>
                  <snm>McGinnis</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Krumlauf</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>1992</pubdate>
            <volume>68</volume>
            <fpage>283</fpage>
            <lpage>302</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">1346368</pubid>
                  <pubid idtype="doi">10.1016/0092-8674(92)90471-N</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Ancient Origin of the <it>Hox </it>gene cluster</p>
            </title>
            <aug>
               <au>
                  <snm>Ferrier</snm>
                  <fnm>DEK</fnm>
               </au>
               <au>
                  <snm>Holland</snm>
                  <fnm>PWH</fnm>
               </au>
            </aug>
            <source>Nat Rev Genet</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>33</fpage>
            <lpage>38</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11253066</pubid>
                  <pubid idtype="doi">10.1038/35047605</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Gene duplication and the origins of vertebrate development</p>
            </title>
            <aug>
               <au>
                  <snm>Holland</snm>
                  <fnm>PWH</fnm>
               </au>
               <au>
                  <snm>Garcia-Fern&#225;ndez</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>NA</fnm>
               </au>
               <au>
                  <snm>Sidow</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Development</source>
            <pubdate>1994</pubdate>
            <issue>Suppl</issue>
            <fpage>125</fpage>
            <lpage>133</lpage>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Archetypal organization of the amphioxus Hox gene cluster</p>
            </title>
            <aug>
               <au>
                  <snm>Garcia-Fern&#225;ndez</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Holland</snm>
                  <fnm>PW</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1994</pubdate>
            <volume>370</volume>
            <fpage>563</fpage>
            <lpage>566</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">7914353</pubid>
                  <pubid idtype="doi">10.1038/370563a0</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Zebrafish <it>Hox </it>clusters and vertebrate genome evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Amores</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Force</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Yan</snm>
                  <fnm>YL</fnm>
               </au>
               <au>
                  <snm>Joly</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Amemiya</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fritz</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Ho</snm>
                  <fnm>RK</fnm>
               </au>
               <au>
                  <snm>Langeland</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Prince</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>YL</fnm>
               </au>
               <au>
                  <snm>Westerfield</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ekker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Postlethwait</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>282</volume>
            <fpage>1711</fpage>
            <lpage>1714</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">9831563</pubid>
                  <pubid idtype="doi">10.1126/science.282.5394.1711</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Hox clusters as models for vertebrate genome evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Hoegg</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Meyer</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <issue>8</issue>
            <fpage>421</fpage>
            <lpage>424</lpage>
            <url>http://www.hubmed.org/display.cgi?uids=15967537</url>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15967537</pubid>
                  <pubid idtype="doi">10.1016/j.tig.2005.06.004</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>The fish specific Hox cluster duplication is coincident with the origin of teleosts</p>
            </title>
            <aug>
               <au>
                  <snm>Crow</snm>
                  <fnm>KD</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
               <au>
                  <snm>Lynch</snm>
                  <fnm>VJ</fnm>
               </au>
               <au>
                  <snm>Amemiya</snm>
                  <fnm>CT</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>GP</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2006</pubdate>
            <volume>23</volume>
            <fpage>121</fpage>
            <lpage>136</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">16162861</pubid>
                  <pubid idtype="doi">10.1093/molbev/msj020</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>The Duplication of the <it>Hox </it>Gene Clusters in Teleost Fishes</p>
            </title>
            <aug>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
            </aug>
            <source>Theor Biosci</source>
            <pubdate>2004</pubdate>
            <volume>123</volume>
            <fpage>89</fpage>
            <lpage>110</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.thbio.2004.03.004</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Molecular evolution of the HoxA cluster in the three major gnathostome lineages</p>
            </title>
            <aug>
               <au>
                  <snm>Chiu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Amemiya</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Dewar</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kim</snm>
                  <fnm>CB</fnm>
               </au>
               <au>
                  <snm>Ruddle</snm>
                  <fnm>FH</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>GP</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>5492</fpage>
            <lpage>5497</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">11943847</pubid>
                  <pubid idtype="doi">10.1073/pnas.052709899</pubid>
                  <pubid idtype="pmcid">122797</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>MircoRNA-directed cleavage <it>of HoxB8 </it>mRNA</p>
            </title>
            <aug>
               <au>
                  <snm>Yekta</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Shih</snm>
                  <fnm>Ih</fnm>
               </au>
               <au>
                  <snm>Bartel</snm>
                  <fnm>DP</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>2004</pubdate>
            <volume>304</volume>
            <fpage>594</fpage>
            <lpage>596</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">15105502</pubid>
                  <pubid idtype="doi">10.1126/science.1097434</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Koehl</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ripp</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Poch</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Proteins: Structure, Function, andBioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>61</volume>
            <fpage>127</fpage>
            <lpage>136</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1002/prot.20527</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>BAliBASE: A benchmark alignment database for the evaluation of multiple sequence alignment programs</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Plewniak</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Poch</snm>
                  <fnm>O</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>87</fpage>
            <lpage>88</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">10068696</pubid>
                  <pubid idtype="doi">10.1093/bioinformatics/15.1.87</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus): nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints</p>
            </title>
            <aug>
               <au>
                  <snm>Tagle</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Koop</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Goodman</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Slightom</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hess</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Journal of Molecular Biology</source>
            <pubdate>1888</pubdate>
            <volume>203</volume>
            <fpage>439</fpage>
            <lpage>455</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/0022-2836(88)90011-3</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>The consensus sequence of a major Alu subfamily contains a functional retinoic acid response element</p>
            </title>
            <aug>
               <au>
                  <snm>Vansant</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Reynolds</snm>
                  <fnm>WF</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1995</pubdate>
            <volume>92</volume>
            <fpage>8229</fpage>
            <lpage>8233</lpage>
            <url>http://www.hubmed.org/display.cgi?uids=7667273</url>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">41130</pubid>
                  <pubid idtype="pmpid" link="fulltext">7667273</pubid>
                  <pubid idtype="doi">10.1073/pnas.92.18.8229</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Surveying Phylogenetic Footprints in Large Gene Clusters: Applications to Hox Cluster Duplications</p>
            </title>
            <aug>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fried</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Flamm</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
            </aug>
            <source>Mol Evol Phylog</source>
            <pubdate>2004</pubdate>
            <volume>31</volume>
            <fpage>581</fpage>
            <lpage>604</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/j.ympev.2003.08.009</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Human-Mouse Alignments with BLASTZ</p>
            </title>
            <aug>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kent</snm>
                  <fnm>WJ</fnm>
               </au>
               <au>
                  <snm>Smit</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>R Baertsch</snm>
                  <fnm>RH</fnm>
               </au>
               <au>
                  <snm>Haussler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Genome Research</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>103</fpage>
            <lpage>107</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">12529312</pubid>
                  <pubid idtype="doi">10.1101/gr.809403</pubid>
                  <pubid idtype="pmcid">430961</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Phylogenetic Footprint Patterns in Large Gene Clusters</p>
            </title>
            <aug>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Fried</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Flamm</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
            </aug>
            <source>Tech. rep., University of Leipzig, Bioinformatics Group 2003. Extended Abstract: Proceedings of the German Conference on Bioinformatics</source>
            <editor>Mewes H-W, Heun V, Frishman D, Kramer S</editor>
            <pubdate>2003</pubdate>
            <volume>II</volume>
            <fpage>145</fpage>
            <lpage>147</lpage>
            <url>http://www.bioinf.uni- leipzig.de/Publications/POSTERS/P-005abs.pdf</url>
            <note>belleville Verlag Michael Farin, M&#252;nchen</note>
         </bibl>
         <bibl id="B44">
            <title>
               <p>G&#246;ttingen Bioinformatics Compute Server</p>
            </title>
            <url>http://gobics.de/</url>
         </bibl>
         <bibl id="B45">
            <title>
               <p>Bichir <it>HoxA </it>cluster sequence reveals surprising trends in rayfinned fish genomic evolution</p>
            </title>
            <aug>
               <au>
                  <snm>Chiu</snm>
                  <fnm>CH</fnm>
               </au>
               <au>
                  <snm>Dewar</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>GP</fnm>
               </au>
               <au>
                  <snm>Takahashi</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ruddle</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Ledje</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bartsch</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Scemama</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Stellwag</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Fried</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Prohaska</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Stadler</snm>
                  <fnm>PF</fnm>
               </au>
               <au>
                  <snm>Amemiya</snm>
                  <fnm>CT</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>11</fpage>
            <lpage>17</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14707166</pubid>
                  <pubid idtype="doi">10.1101/gr.1712904</pubid>
                  <pubid idtype="pmcid">314268</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
