<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1748-7188-5-11</ui><ji>1748-7188</ji><fm>
<dochead>Research</dochead>
<bibl>
<title>
<p>Efficient algorithms for analyzing segmental duplications with deletions and inversions in genomes</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Kahn</snm><mi>L</mi><fnm>Crystal</fnm><insr iid="I1"/><email>clkahn@cs.brown.edu</email></au>
<au ca="yes" id="A2"><snm>Mozes</snm><fnm>Shay</fnm><insr iid="I1"/><email>shay@cs.brown.edu</email></au>
<au ca="yes" id="A3"><snm>Raphael</snm><mi>J</mi><fnm>Benjamin</fnm><insr iid="I1"/><insr iid="I2"/><email>braphael@cs.brown.edu</email></au>
</aug>
<insg>
<ins id="I1"><p>Department of Computer Science, Brown University, Providence, RI 02912, USA</p></ins>
<ins id="I2"><p>Center for Computational Molecular Biology, Brown University, Providence, RI 02912, USA</p></ins>
</insg>
<source>Algorithms for Molecular Biology</source>
<issn>1748-7188</issn>
<pubdate>2010</pubdate>
<volume>5</volume>
<issue>1</issue>
<fpage>11</fpage>
<url>http://www.almob.org/content/5/1/11</url>
<xrefbib><pubidlist><pubid idtype="doi">10.1186/1748-7188-5-11</pubid><pubid idtype="pmpid">20047668</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>11</day><month>8</month><year>2009</year></date></rec><acc><date><day>4</day><month>1</month><year>2010</year></date></acc><pub><date><day>4</day><month>1</month><year>2010</year></date></pub></history>
<cpyrt><year>2010</year><collab>Kahn et al; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<sec>
<st>
<p>Background</p>
</st>
<p>Segmental duplications, or low-copy repeats, are common in mammalian genomes. In the human genome, most segmental duplications are mosaics comprised of multiple duplicated fragments. This complex genomic organization complicates analysis of the evolutionary history of these sequences. One model proposed to explain this mosaic patterns is a model of repeated aggregation and subsequent duplication of genomic sequences.</p>
</sec>
<sec>
<st>
<p>Results</p>
</st>
<p>We describe a polynomial-time exact algorithm to compute duplication distance, a genomic distance defined as the most parsimonious way to build a target string by repeatedly copying substrings of a fixed source string. This distance models the process of repeated aggregation and duplication. We also describe extensions of this distance to include certain types of substring deletions and inversions. Finally, we provide a description of a sequence of duplication events as a context-free grammar (CFG).</p>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>These new genomic distances will permit more biologically realistic analyses of segmental duplications in genomes.</p>
</sec>
</sec>
</abs>
</fm><meta>
<classifications>
<classification id="wabi" subtype="theme_series_title" type="BMC">Selected papers from WABI 09</classification>
<classification id="wabi" subtype="theme_series_editor" type="BMC">Tandy Warnow and Steven Salzberg</classification>
</classifications>
</meta><bdy>
<sec>
<st>
<p>Introduction</p>
</st>
<p>Genomes evolve via many types of mutations ranging in scale from single nucleotide mutations to large genome rearrangements. Computational models of these mutational processes allow researchers to derive similarity measures between genome sequences and to reconstruct evolutionary relationships between genomes. For example, considering chromosomal inversions as the only type of mutation leads to the so-called reversal distance problem of finding the minimum number of inversions/reversals that transform one genome into another <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. Several elegant polynomial-time algorithms have been found to solve this problem (cf. <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp> and references therein). Developing genome rearrangement models that are both biologically realistic <it>and </it>computationally tractable remains an active area of research.</p>
<p>Duplicated sequences in genomes present a particular challenge for genome rearrangement analysis and often make the underlying computational problems more difficult. For instance, computing reversal distance in genomes with duplicated segments is NP-hard <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>. Models that include both duplications and other types of mutations - such as inversions - often result in similarity measures that cannot be computed efficiently. Thus, most current approaches for duplication analysis rely on heuristics, approximation algorithms, or restricted models of duplication <abbrgrp>
<abbr bid="B3">3</abbr>
<abbr bid="B4">4</abbr>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
</abbrgrp>. For example, there are efficient algorithms for computing tandem duplication histories <abbrgrp>
<abbr bid="B8">8</abbr>
<abbr bid="B9">9</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
</abbrgrp> and whole-genome duplication histories <abbrgrp>
<abbr bid="B12">12</abbr>
<abbr bid="B13">13</abbr>
</abbrgrp>. Here we consider another class of duplications: large segmental duplications (also known as low-copy repeats) that are common in many mammalian genomes <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp>. These segmental duplications can be quite large (up to hundreds of kilobases), but their evolutionary history remains poorly understood, particularly in primates. The mystery surrounding them is due in part to their complex organization; many segmental duplications are found within contiguous regions of the genome called <it>duplication blocks </it>that contain mosaic patterns of smaller repeated segments, or <it>duplicons </it>
<abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>. Duplication blocks that are located on different chromosomes, or that are separated by large physical distances on a chromosome, often share sequences of duplicons <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>. These conserved sequences suggest that these duplicons were copied together across large genomic distances. One hypothesis proposed to explain these conserved mosaic patterns is a two-step model of duplication <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp>. In this model, a first phase of duplications copies duplicons from the ancestral genome and aggregates these copies into primary duplication blocks. Then in a second phase, portions of these primary duplication blocks are copied and reinserted into the genome at disparate loci forming secondary duplication blocks.</p>
<p>In <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp>, we introduced a measure called <it>duplication distance </it>that models the duplication of contiguous substrings over large genomic distances. We used duplication distance in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp> to find the most parsimonious duplication scenario consistent with the two-step model of segmental duplication. The duplication distance from a source string <b>x </b>to a target string <b>y </b>is the minimum number of substrings of <b>x </b>that can be sequentially copied from <b>x </b>and pasted into an initially empty string in order to construct <b>y</b>. We derived an efficient exact algorithm for computing the duplication distance between a pair of strings. Note that the string <b>x </b>does <it>not </it>change during the sequence of duplication events. Moreover, duplication distance does not model local rearrangements, like tandem duplications, deletions or inversions, that occur within a duplication block during its construction. While such local rearrangements undoubtedly occur in genome evolution, the duplication distance model focuses on identifying the duplicate operations that account for the construction of repeated patterns within duplication blocks by aggregating substrings of other duplication blocks over large genomic distances. Thus, like nearly every other genome rearrangement model, the duplication distance model makes some simplifying assumptions about the underlying biology to achieve computational tractability. Here, we extend the duplication distance measure to include certain types of deletions and inversions. These extensions make our model less restrictive - although we still maintain the restriction that <b>x </b>is unchanged - and permit the construction of more rich, and perhaps more biologically plausible, duplication scenarios. In particular, our contributions are the following.</p>
<sec>
<st>
<p>Summary of Contributions</p>
</st>
<p>Let <it>&#956;</it>(<b>x</b>) denote the number of times a character appears in the string <b>x</b>. Let |<b>x</b>| denote the length of <b>x</b>.</p>
<p>1. We provide an <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<it>&#956;</it>(<b>x</b>) <it>&#956;</it>(<b>y</b>))-time algorithm to compute the distance between (signed) strings <b>x </b>and <b>y </b>when duplication and certain types of deletion operations are permitted.</p>
<p>2. We provide an <it>O</it>(|<b>y</b>|<sup>2</sup>
<it>&#956;</it>(<b>x</b>) <it>&#956;</it>(<b>y</b>))-time algorithm to compute the distance between (signed) strings <b>x </b>and <b>y </b>when duplicated strings may be inverted before being inserted into the target string.</p>
<p>3. We provide an <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>))-time algorithm to compute the distance between signed strings <b>x </b>and <b>y </b>when duplicated strings may be inverted before being inserted into the target string, and deletion operations are also permitted.</p>
<p>4. We provide an <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<sup>3</sup>
<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>))-time algorithm to compute the distance between signed strings <b>x </b>and <b>y </b>when any substring of the duplicated string may be inverted before being inserted into the target string. Deletion operations are also permitted.</p>
<p>5. We provide a formal proof of correctness of the duplication distance recurrence presented in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>. No proof of correctness was previously given.</p>
<p>6. We show how a sequence of duplicate operations that generates a string can be described by a context-free grammar (CFG).</p>
</sec>
</sec>
<sec>
<st>
<p>Preliminaries</p>
</st>
<p>We begin by reviewing some definitions and notation that were introduced in <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp> and <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>. Let &#8709; denote the empty string. For a string <b>x </b>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>n</it>
</sub>, let <b>x</b>
<sub>
<it>i</it>, <it>j </it>
</sub>denote the substring <it>x</it>
<sub>
<it>i</it>
</sub>
<it>x</it>
<sub>
<it>i</it>+1 </sub>. . . <it>x</it>
<sub>
<it>j </it>
</sub>. We define a <it>subsequence S </it>of <b>x </b>to be a string <inline-formula>
<graphic file="1748-7188-5-11-i1.gif"/>
</inline-formula> with <it>i</it>
<sub>1 </sub>&lt;<it>i</it>
<sub>2 </sub>&lt; &#8943; &lt;<it>i</it>
<sub>
<it>k</it>
</sub>. We represent <it>S </it>by listing the indices at which the characters of <it>S </it>occur in <b>x</b>. For example, if <b>x </b>= <it>abcdef</it>, then the subsequence <it>S </it>= (1, 3, 5) is the string <it>ace</it>. Note that every substring is a subsequence, but a subsequence need not be a substring since the characters comprising a subsequence need not be contiguous. For a pair of subsequences <it>S</it>
<sub>1</sub>, <it>S</it>
<sub>2</sub>, denote by <it>S</it>
<sub>1 </sub>&#8745; <it>S</it>
<sub>2 </sub>the maximal subsequence common to both <it>S</it>
<sub>1 </sub>and <it>S</it>
<sub>2</sub>.</p>
<p>
<b>Definition 1</b>. <it>Subsequences S </it>= (<it>s</it>
<sub>1</sub>, <it>s</it>
<sub>2</sub>) <it>and T </it>= (<it>t</it>
<sub>1</sub>, <it>t</it>
<sub>2</sub>) <it>of a string <b>x </b>are <b>alternating </b>in <b>x </b>if either s</it>
<sub>1 </sub>&lt;<it>t</it>
<sub>1 </sub>&lt;<it>s</it>
<sub>2 </sub>&lt;<it>t</it>
<sub>2 </sub>
<it>or t</it>
<sub>1 </sub>&lt;<it>s</it>
<sub>1 </sub>&lt;<it>t</it>
<sub>2 </sub>&lt;<it>s</it>
<sub>2</sub>.</p>
<p>
<b>Definition 2</b>. <it>Subsequences S </it>= (<it>s</it>
<sub>1</sub>, . . ., <it>s</it>
<sub>
<it>k</it>
</sub>) <it>and T </it>= (<it>t</it>
<sub>1</sub>, . . ., <it>t</it>
<sub>
<it>l</it>
</sub>) <it>of a string <b>x </b>are <b>overlapping </b>in <b>x </b>if there exist indices i, i' and j, j' such that </it>1 &#8804; <it>i </it>&lt;<it>i' </it>&#8804; <it>k</it>, 1 &#8804; <it>j </it>&lt;<it>j' </it>&#8804; l, <it>and </it>(<it>s</it>
<sub>
<it>i</it>
</sub>, <it>s</it>
<sub>
<it>i</it>'</sub>) <it>and </it>(<it>t</it>
<sub>
<it>j</it>
</sub>, <it>t</it>
<sub>
<it>j</it>'</sub>) <it>are alternating in <b>x</b>. See Figure </it>
<figr fid="F1">1</figr>.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Overlapping</p></caption><text>
   <p><b>Overlapping</b>. The red subsequence is overlapping with the blue subsequence in <b>x</b>. The indices (<it>s</it><sub><it>i</it></sub>, <it>s</it><sub><it>i</it>'</sub>) and (<it>t</it><sub><it>j</it></sub>, <it>t</it><sub><it>j</it>'</sub>) are alternating in <b>x</b>.</p>
</text><graphic file="1748-7188-5-11-1"/></fig>
<p>
<b>Definition 3</b>. <it>Given subsequences S </it>= (<it>s</it>
<sub>1</sub>, . . ., <it>s</it>
<sub>
<it>k</it>
</sub>) <it>and T </it>= (<it>t</it>
<sub>1</sub>, . . ., <it>t</it>
<sub>
<it>l</it>
</sub>) <it>of a string <b>x</b>, S is inside of T if there exists an index i such that </it>1 &#8804; <it>i </it>&lt;<it>l and t</it>
<sub>
<it>i </it>
</sub>&lt;<it>s</it>
<sub>1 </sub>&lt;<it>s</it>
<sub>
<it>k </it>
</sub>&lt;<it>t</it>
<sub>
<it>i</it>+1</sub>. <it>That is, the entire subsequence S occurs in between successive characters of T. See Figure </it>
<figr fid="F2">2</figr>.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Inside</p></caption><text>
   <p><b>Inside</b>. The red subsequence is inside the blue subsequence <it>T </it>. All the characters of the red subsequence occur between the indices <it>t</it><sub><it>i </it></sub>and <it>t</it><sub><it>i</it>+1 </sub>of <it>T</it>.</p>
</text><graphic file="1748-7188-5-11-2"/></fig>
<p>
<b>Definition 4</b>. <it>A <b>duplicate operation </b>from <b>x</b>, &#948;<sub>
<it>x</it>
</sub>
</it>(<it>s, t, p</it>), <it>copies a substring x<sub>
<it>s </it>
</sub>
</it>. . . <it>x</it>
<sub>
<it>t </it>
</sub>
<it>of the source string <b>x </b>and pastes it into a target string at position p. Specifically, if <b>x </b>
</it>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>m </it>
</sub>
<it>and <b>z </b>
</it>= <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>n</it>
</sub>, <it>then <b>z </b>
</it>&#8728; <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s, t, p</it>) = <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>p</it>-1</sub>
<it>x</it>
<sub>
<it>s </it>
</sub>. . . <it>x</it>
<sub>
<it>t</it>
</sub>
<it>z</it>
<sub>
<it>p</it>
</sub>. . . <it>z</it>
<sub>
<it>n</it>
</sub>. <it>See Figure </it>
<figr fid="F3">3</figr>.</p>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>A duplicate operation</p></caption><text>
   <p><b>A duplicate operation</b>. A duplicate operation, denoted <it>&#948;</it><sub><it>x</it></sub>(<it>s, t, p</it>). A substring <it>x</it><sub><it>s</it></sub><it>x</it><sub><it>s</it>+1 </sub>. . <it>x</it><sub><it>t </it></sub>of the source string <b>x </b>is copied and inserted into the target string <b>z </b>at index <it>p</it>.</p>
</text><graphic file="1748-7188-5-11-3"/></fig>
<p>
<b>Definition 5</b>. <it>The <b>duplication distance </b>from a source string <b>x </b>to a target string <b>y </b>is the minimum number of duplicate operations from <b>x </b>that generates <b>y </b>from an initially empty target string. That is, <b>y </b>
</it>= &#8709; &#8728; <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>1</sub>, <it>t</it>
<sub>1</sub>, <it>p</it>
<sub>1</sub>) &#8728; <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>2</sub>, <it>t</it>
<sub>2</sub>, <it>p</it>
<sub>2</sub>) &#8728; &#8943; &#8728; <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>
<it>l</it>
</sub>, <it>t</it>
<sub>
<it>l</it>
</sub>, <it>p</it>
<sub>
<it>l</it>
</sub>).</p>
<p>To compute the duplication distance from <b>x </b>to <b>y</b>, we assume that every character in <b>y </b>appears at least once in <b>x</b>. Otherwise, the duplication distance is undefined.</p>
</sec>
<sec>
<st>
<p>Duplication Distance</p>
</st>
<p>In this section we review the basic recurrence for computing duplication distance that was introduced in <abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp>. The recurrence examines the characters of the target string, <b>y</b>, and considers the sets of characters of <b>y </b>that could have been <it>generated</it>, or copied from the source string in a single duplicate operation. Such a set of characters of <b>y </b>necessarily correspond to a substring of the source <b>x </b>(see Def. 4). Moreover, these characters must be a subsequence of <b>y</b>. This is because, in a sequence of duplicate operations, once a string is copied and inserted into the target string, subsequent duplicate operations do not affect the order of the characters in the previously inserted string. Because every character of <b>y </b>is generated by exactly one duplicate operation, a sequence of duplicate operations that generates <b>y </b>partitions the characters of <b>y </b>into disjoint subsequences, each of which is generated in a single duplicate operation. A more interesting observation is that these subsequences are mutually non-overlapping. We formalize this property as follows.</p>
<p>
<b>Lemma 1 (Non-overlapping Property)</b>. <it>Consider a source string <b>x </b>and a sequence of duplicate operations of the form &#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>
<it>i</it>
</sub>, <it>t</it>
<sub>
<it>i</it>
</sub>, <it>p</it>
<sub>
<it>i</it>
</sub>) <it>that generates the final target string <b>y </b>from an initially empty target string. The substrings </it>
<inline-formula>
<graphic file="1748-7188-5-11-i2.gif"/>
</inline-formula>
<it>of <b>x </b>that are duplicated during the construction of <b>y </b>appear as mutually non-overlapping subsequences of <b>y</b>
</it>.</p>
<p>
<it>Proof</it>. Consider a sequence of duplicate operations <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>1</sub>, <it>t</it>
<sub>1</sub>, <it>p</it>
<sub>1</sub>), . . ., <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>
<it>k</it>
</sub>, <it>t</it>
<sub>
<it>k</it>
</sub>, <it>p</it>
<sub>
<it>k</it>
</sub>) that generates <b>y </b>from an initially empty target string. For 1 &#8804; <it>i </it>&#8804; <it>k</it>, Let <b>z</b>
<sup>
<it>i </it>
</sup>be the intermediate target string that results from <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>1</sub>, <it>t</it>
<sub>1</sub>, <it>p</it>
<sub>1</sub>) &#8728; &#8943; &#8728; <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s</it>
<sub>
<it>i</it>
</sub>, <it>t</it>
<sub>
<it>i</it>
</sub>, <it>p</it>
<sub>
<it>i</it>
</sub>). Note that <b>z</b>
<sup>
<it>k </it>
</sup>= <b>y</b>. For <it>j </it>&#8804; <it>i</it>, let <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> be the subsequence of <b>z</b>
<sup>
<it>i </it>
</sup>that corresponds to the characters duplicated by the <it>j</it>
<sup>
<it>th </it>
</sup>operation. We shall show by induction on the length <it>i </it>of the sequence that <inline-formula>
<graphic file="1748-7188-5-11-i4.gif"/>
</inline-formula> are pairwise non-overlapping subsequences of <b>z</b>
<sup>
<it>i</it>
</sup>. For the base case, when there is a single duplicate operation, there is no non-overlap property to show. Assume now that <inline-formula>
<graphic file="1748-7188-5-11-i5.gif"/>
</inline-formula>, . . . <inline-formula>
<graphic file="1748-7188-5-11-i6.gif"/>
</inline-formula> are mutually non-overlapping subsequences in <b>z</b>
<sup>
<it>i </it>-1</sup>. For the induction step note that, by the definition of a duplicate operation, <inline-formula>
<graphic file="1748-7188-5-11-i7.gif"/>
</inline-formula> is inserted as a contiguous substring into <b>z</b>
<sup>
<it>i</it>-1 </sup>at location <it>p</it>
<sub>
<it>i </it>
</sub>to form <b>z</b>
<sup>
<it>i</it>
</sup>. Therefore, for any <it>j</it>, <it>j' </it>&lt;<it>i</it>, if <inline-formula>
<graphic file="1748-7188-5-11-i8.gif"/>
</inline-formula> and <inline-formula>
<graphic file="1748-7188-5-11-i9.gif"/>
</inline-formula> are non overlapping in <b>z</b>
<sup>
<it>i</it>-1 </sup>then <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> and <inline-formula>
<graphic file="1748-7188-5-11-i10.gif"/>
</inline-formula>, are non overlapping in <b>z</b>
<sup>
<it>i</it>
</sup>. It remains to show that for any <it>j </it>&lt;<it>i</it>, <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> and <inline-formula>
<graphic file="1748-7188-5-11-i7.gif"/>
</inline-formula> are non-overlapping in <b>z</b>
<sup>
<it>i</it>
</sup>. There are two cases: (1) the elements of <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> are either all smaller or all greater than the elements of <inline-formula>
<graphic file="1748-7188-5-11-i7.gif"/>
</inline-formula> or (2) <inline-formula>
<graphic file="1748-7188-5-11-i7.gif"/>
</inline-formula> is inside of <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> in <b>z</b>
<sup>
<it>i </it>
</sup>(Definition 3). In either case, <inline-formula>
<graphic file="1748-7188-5-11-i3.gif"/>
</inline-formula> and <inline-formula>
<graphic file="1748-7188-5-11-i7.gif"/>
</inline-formula> are not overlapping in <b>z</b>
<sup>
<it>i </it>
</sup>as required.</p>
<p>The non-overlapping property leads to an efficient recurrence that computes duplication distance. When considering subsequences of the final target string <b>y </b>that might have been generated in a single duplicate operation, we rely on the non-overlapping property to identify substrings of <b>y </b>that can be treated as independent subproblems. If we assume that some subsequence <it>S </it>of <b>y </b>is produced in a single duplicate operation, then we know that all other subsequences of <b>y </b>that correspond to duplicate operations cannot overlap the characters in <it>S</it>. Therefore, the substrings of <b>y </b>in between successive characters of <it>S </it>define subproblems that are computed independently.</p>
<p>In order to find the optimal (i.e. minimum) sequence of duplicate operations that generate <b>y</b>, we must consider all subsequences of <b>y </b>that could have been generated by a single duplicate operation. The recurrence is based on the observation that <it>y</it>
<sub>1 </sub>must be the first (i.e. leftmost) character to be copied from <b>x </b>in some duplicate operation. There are then two cases to consider: either (1) <it>y</it>
<sub>1 </sub>was the last (or rightmost) character in the substring that was duplicated from <b>x </b>to generate <it>y</it>
<sub>1</sub>, or (2) <it>y</it>
<sub>1 </sub>was not the last character in the substring that was duplicated from <b>x </b>to generate <it>y</it>
<sub>1</sub>.</p>
<p>The recurrence defines two quantities: <it>d</it>(<b>x</b>, <b>y</b>) and <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>). We shall show, by induction, that for a pair of strings, <b>x </b>and <b>y</b>, the value <it>d</it>(<b>x</b>, <b>y</b>) is equal to the duplication distance from <b>x </b>to <b>y </b>and that <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) is equal to the duplication distance from <b>x </b>to <b>y </b>under the restriction that the character <it>y</it>
<sub>1 </sub>is copied from index <it>i </it>in <b>x</b>, i.e. <it>x</it>
<sub>
<it>i </it>
</sub>
<it>generates y</it>
<sub>1</sub>. <it>d</it>(<b>x</b>, <b>y</b>) is found by considering the minimum among all characters <it>x</it>
<sub>
<it>i </it>
</sub>of <b>x </b>that can generate <it>y</it>
<sub>1</sub>, see Eq. 1.</p>
<p>As described above, we must consider two possibilities in order to compute <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>). Either:</p>
<p>Case 1: <it>y</it>
<sub>1 </sub>was the last (or rightmost) character in the substring of <b>x </b>that was copied to produce <it>y</it>
<sub>1</sub>, (see Fig. <figr fid="F4">4</figr>), or</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Recurrence: Case 1</p></caption><text>
   <p><b>Recurrence: Case 1</b>. <it>y</it><sub>1 </sub>is generated from <it>x</it><sub><it>i </it></sub>in a duplicate operation where <it>y</it><sub>1 </sub>is the last (rightmost) character in the copied substring (Case 1). The total duplication distance is one plus the duplication distance for the suffix <b>y</b><sub>2,|<b>y</b>|</sub>.</p>
</text><graphic file="1748-7188-5-11-4"/></fig>
<p>Case 2: <it>x</it>
<sub>
<it>i</it>+1 </sub>is also copied in the same duplicate operation as <it>x</it>
<sub>
<it>i</it>
</sub>, possibly along with other characters as well (see Fig. <figr fid="F5">5</figr>).</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>Recurrence: Case 2</p></caption><text>
   <p><b>Recurrence: Case 2</b>. <it>y</it><sub>1 </sub>is generated from <it>x</it><sub><it>i </it></sub>in a duplicate operation where <it>y</it><sub>1 </sub>is not the last (rightmost) character in a copied substring (Case 2). In this case, <it>x</it><sub><it>i</it>+1 </sub>is also copied in the same duplicate operation (top). Thus, the duplication distance is the sum of <it>d</it>(<b>x</b>, <b>y</b><sub>2, <it>j</it>-1</sub>), the duplication distance for <b>y</b><sub>2, <it>j</it>-1 </sub>(bottom left), and <it>d</it><sub><it>i</it>+1</sub>(<b>x</b>, <b>y</b><sub><it>j</it>, |<b>y</b>|</sub>), the minimum number of duplicate operations to generate <b>y</b><sub><it>j</it>, |<b>y</b>| </sub>given that <it>x</it><sub><it>i</it>+1 </sub>generates <it>y</it><sub><it>j </it></sub>(bottom right).</p>
</text><graphic file="1748-7188-5-11-5"/></fig>
<p>For case one, the minimum number of duplicate operations is one - for the duplicate that generates <it>y</it>
<sub>1 </sub>- plus the minimum number of duplicate operations to generate the suffix of <b>y</b>, giving a total of 1 + <it>d</it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>) (Fig. <figr fid="F4">4</figr>). For case two, Lemma 1 implies that the minimum number of duplicate operations is the sum of the optimal numbers of operations for two independent subproblems. Specifically, for each <it>j </it>&gt; 1 such that <it>x</it>
<sub>
<it>i</it>+1 </sub>= <it>y</it>
<sub>
<it>j </it>
</sub>we compute: (i) the minimum number of duplicate operations needed to build the substring <b>y</b>
<sub>2, <it>j</it>-1</sub>, namely <it>d</it>(<b>x</b>, <b>y</b>
<sub>2, <it>j</it>-1</sub>), and (ii) the minimum number of duplicate operations needed to build the string <it>y</it>
<sub>1</sub>
<b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>, given that <it>y</it>
<sub>1 </sub>is generated by <it>x</it>
<sub>
<it>i </it>
</sub>and <it>y</it>
<sub>
<it>j </it>
</sub>is generated by <it>x</it>
<sub>
<it>i</it>+1</sub>. To compute the latter, recall that since <it>x</it>
<sub>
<it>i</it>
</sub>and <it>x</it>
<sub>
<it>i</it>+1 </sub>are copied in the same duplicate operation, the number of duplicates necessary to generate <it>y</it>
<sub>1</sub>
<b>y</b>
<sub>
<it>j</it>,|<b>y</b>| </sub>using <it>x</it>
<sub>
<it>i </it>
</sub>and <it>x</it>
<sub>
<it>i</it>+1 </sub>is equal to the number of duplicates necessary to generate <b>y</b>
<sub>
<it>j</it>,|<b>y</b>| </sub>using <it>x</it>
<sub>
<it>i</it>+1</sub>, namely <it>d</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>), (see Fig. <figr fid="F5">5</figr> and Eq. 2).</p>
<p>The recurrence is, therefore:</p>
<p>
<display-formula id="M1">
<graphic file="1748-7188-5-11-i11.gif"/>
</display-formula>
</p>
<p>
<display-formula id="M2">
<graphic file="1748-7188-5-11-i12.gif"/>
</display-formula>
</p>
<p>
<b>Theorem 1</b>. <it>d</it>(<b>
<it>x, y</it>
</b>) <it>is the minimum number of duplicate operations that generate <b>y </b>from <b>x</b>
</it>. <it>For </it>{<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>}, <it>d</it>
<sub>
<it>i</it>
</sub>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the minimum number of duplicate operations that generate <b>y </b>from <b>x </b>such that y</it>
<sub>1 </sub>
<it>is generated by x</it>
<sub>
<it>i</it>
</sub>.</p>
<p>
<it>Proof</it>. Let <it>OPT</it>(<b>x</b>, <b>y</b>) denote minimum length of a sequence of duplicate operations that generate <b>y </b>from <b>x</b>. Let <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) denote the minimum length of a sequence of operations that generate <b>y </b>from <b>x </b>such that <it>y</it>
<sub>1 </sub>is generated by <it>x</it>
<sub>
<it>i</it>
</sub>. We prove by induction on |<b>y</b>| that <it>d</it>(<b>x</b>, <b>y</b>) = <it>OPT</it>(<b>x</b>, <b>y</b>) and <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) = <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>).</p>
<p>For |<b>y</b>| = 1, since we assume there is at least one <it>i </it>for which <it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>, <it>OPT </it>(<b>x</b>, <b>y</b>) = <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) = 1. By definition, the recurrence also evaluates to 1. For the inductive step, assume that <it>OPT </it>(<b>x</b>, <b>y</b>') = <it>d</it>(<b>x</b>, <b>y</b>') and <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>') = <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y'</b>) for any string <b>y' </b>shorter than <b>y</b>. We first show that <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) &#8804; <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>). Since <it>OPT </it>(<b>x</b>, <b>y</b>) = min<sub>
<it>i </it>
</sub>
<it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>), this also implies <it>OPT </it>(<b>x</b>, <b>y</b>) &#8804; <it>d</it>(<b>x</b>, <b>y</b>). We describe different sequences of duplicate operations that generate <b>y </b>from <b>x</b>, using <it>x</it>
<sub>
<it>i </it>
</sub>to generate <it>y</it>
<sub>1</sub>:</p>
<p indent="1">&#8226; Consider a minimum-length sequence of duplicates that generates <b>y</b>
<sub>2,|<b>y</b>|</sub>. By the inductive hypothesis its length is <it>d</it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>). By duplicating <it>y</it>
<sub>1 </sub>separately using <it>x</it>
<sub>
<it>i </it>
</sub>we obtain a sequence of duplicates that generates <b>y </b>whose length is 1 + <it>d</it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>).</p>
<p indent="1">&#8226; For every {<it>j </it>: <it>y</it>
<sub>
<it>j </it>
</sub>= <it>x</it>
<sub>
<it>i</it>+1</sub>, <it>j </it>&gt; 1} consider a minimum-length sequence of duplicates that generates <b>y</b>
<sub>
<it>j</it>,|<b>y</b>| </sub>using <it>x</it>
<sub>
<it>i</it>+1 </sub>to produce <it>y</it>
<sub>
<it>j</it>
</sub>, and a minimum-length sequence of duplicates that generates <b>y</b>
<sub>2, <it>j</it>-1</sub>.</p>
<p>By the inductive hypothesis their lengths are <it>d</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) and <it>d</it>(<b>x</b>, <b>y</b>
<sub>2, <it>j</it>-1</sub>) respectively. By extending the start index <it>s </it>of the duplicate operation that starts with <it>x</it>
<sub>
<it>i</it>+1 </sub>to produce <it>y</it>
<sub>
<it>j </it>
</sub>to start with <it>x</it>
<sub>
<it>i </it>
</sub>and produce <it>y</it>
<sub>1 </sub>as well, we produce <b>y </b>with the same number of duplicate operations.</p>
<p>Since <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) is at most the length of any of these options, it is also at most their minimum. Hence,</p>
<p>
<display-formula>
<graphic file="1748-7188-5-11-i13.gif"/>
</display-formula>
</p>
<p>To show the other direction (i.e. that <it>d</it>(<it>x, y</it>) &#8804; <it>OPT </it>(<it>x, y</it>) and <it>d</it>
<sub>
<it>i</it>
</sub>(<it>x, y</it>) &#8804; <it>OPT</it>
<sub>
<it>i</it>
</sub>(<it>x, y</it>)), consider a minimum-length sequence of duplicate operations that generate <b>y </b>from <b>x</b>, using <it>x</it>
<sub>
<it>i </it>
</sub>to generate <it>y</it>
<sub>1</sub>. There are a few cases:</p>
<p indent="1">&#8226; If <it>y</it>
<sub>1 </sub>is generated by a duplicate operation that only duplicates <it>x</it>
<sub>
<it>i</it>
</sub>, then <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) = 1 + <it>OPT </it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>). By the inductive hypothesis this equals 1 + <it>d</it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>) which is at least <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>).</p>
<p indent="1">&#8226; Otherwise, <it>y</it>
<sub>1 </sub>is generated by a duplicate operation that copies <it>x</it>
<sub>
<it>i </it>
</sub>and also duplicates <it>x</it>
<sub>
<it>i</it>+1 </sub>to generate some character <it>y</it>
<sub>
<it>j </it>
</sub>. In this case the sequence &#916; of duplicates that generates <b>y</b>
<sub>2, <it>j</it>-1 </sub>must appear after the duplicate operation that generates <it>y</it>
<sub>1 </sub>and <it>y</it>
<sub>
<it>j </it>
</sub>because <b>y</b>
<sub>2, <it>j</it>-1 </sub>is inside (Definition 3) of (<it>y</it>
<sub>1</sub>, <it>y</it>
<sub>
<it>j</it>
</sub>). Without loss of generality, suppose &#916; is ordered after all the other duplicates so that first <it>y</it>
<sub>1</sub>
<it>y</it>
<sub>
<it>j </it>
</sub>. . . <it>y</it>
<sub>|<b>y</b>| </sub>is generated, and then &#916; generates <it>y</it>
<sub>2 </sub>. . . <it>y</it>
<sub>
<it>j</it>-1 </sub>between <it>y</it>
<sub>1 </sub>and <it>y</it>
<sub>
<it>j </it>
</sub>. Hence, <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) = <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <it>y</it>
<sub>1</sub>
<b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) + <it>OPT </it>(<b>x</b>, <it>y</it>
<sub>2, <it>j</it>-1</sub>). Since in the optimal sequence <it>x</it>
<sub>
<it>i </it>
</sub>generates <it>y</it>
<sub>1 </sub>in the same duplicate operation that generates <it>y</it>
<sub>
<it>j </it>
</sub>from <it>x</it>
<sub>
<it>i</it>+1</sub>, we have <it>OPT</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <it>y</it>
<sub>1</sub>
<b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) = <it>OPT</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>). By the inductive hypothesis, <it>OPT </it>(<b>x</b>, <b>y</b>
<sub>2, <it>j</it>-1</sub>) + <it>OPT</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) = <it>d</it>(<b>x</b>, <b>y</b>
<sub>2, <it>j</it>-1</sub>) + <it>d</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) which is at least <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>).&#160;&#160;&#160;&#9633;</p>
<p>This recurrence naturally translates into a dynamic programing algorithm that computes the values of <it>d</it>(<b>x</b>, <it>&#183;</it>) and <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <it>&#183;</it>) for various target strings. To analyze the running time of this algorithm, note that both <b>y</b>
<sub>2, <it>j </it>
</sub>and <b>y</b>
<sub>
<it>j</it>,|<b>y</b>| </sub>are substrings of <b>y</b>. Since the set of substrings of <b>y </b>is closed under taking substrings, we only encounter substrings of <b>y</b>. Also note that since <it>i </it>is chosen from the set {<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>}, there are <it>O</it>(<it>&#956;</it>(<b>x</b>)) choices for <it>i</it>, where <it>&#956;</it>(<b>x</b>) is the maximal multiplicity of a character in <b>x</b>. Thus, there are <it>O</it>(<it>&#956;</it>(<b>x</b>)|<b>y</b>|<sup>2</sup>) different values to compute. Each value is computed by considering the minimization over at most <it>&#956;</it>(<b>y</b>) previously computed values, so the total running time is bounded by <it>O</it>(|<b>y</b>|<sup>2</sup>
<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>)), which is <it>O</it>(|<b>y</b>|<sup>3</sup>|<it>x</it>|) in the worst case. As with most dynamic programming approaches, this algorithm (and all others presented in subsequent sections) can be extended through trace-back to reconstruct the optimal sequence of operations needed to build <b>y</b>. We omit the details.</p>
<p>
<b>Extending to Affine Duplication Cost</b>
</p>
<p>It is easy to extend the recurrence relations in Eqs. (1), (2) to handle costs for duplicate operations. In the above discussion, the cost of each duplicate operation is 1, so the sum of costs of the operations in a sequence that generates a string <b>y </b>is just the length of that sequence. We next consider a more general cost model for duplication in which the cost of a duplicate operation <it>&#948;</it>
<sub>
<it>x</it>
</sub>(<it>s, t, p</it>) is &#916;<sub>1 </sub>+ (<it>t - s </it>+ 1) &#916;<sub>2 </sub>(i.e., the cost is affine in the number of duplicated characters). Here &#916;<sub>1</sub>, &#916;<sub>2 </sub>are some non-negative constants. This extension is obtained by assigning a cost of &#916;<sub>2 </sub>to each duplicated character, except for the last character in the duplicated string, which is assigned a cost of &#916;<sub>1 </sub>+ &#916;<sub>2</sub>. We do that by adding a cost term to each of the cases in Eq. 2. If <it>x</it>
<sub>
<it>i</it>
</sub>is the last character in the duplicated string (case 1), we add &#916;<sub>1 </sub>+ &#916;<sub>2 </sub>to the cost. Otherwise <it>x</it>
<sub>
<it>i </it>
</sub>is not the last duplicated character (case 2), so we add just &#916;<sub>2 </sub>to the cost. Eq. (2) thus becomes</p>
<p>
<display-formula id="M3">
<graphic file="1748-7188-5-11-i14.gif"/>
</display-formula>
</p>
<p>The running time analysis for this recurrence is the same as for the one with unit duplication cost.</p>
</sec>
<sec>
<st>
<p>Duplication-Deletion Distance</p>
</st>
<p>In this section we generalize the model to include deletions. Consider the intermediate string <b>z </b>generated after some number of duplicate operations. A deletion operation removes a contiguous substring <it>z</it>
<sub>
<it>i</it>
</sub>, . . ., <it>z</it>
<sub>
<it>j </it>
</sub>of <b>z</b>, and subsequent duplicate and deletion operations are applied to the resulting string.</p>
<p>
<b>Definition 6</b>. <it>A <b>delete operation</b>, &#964; </it>(<it>s, t</it>), <it>deletes a substring z</it>
<sub>
<it>s </it>
</sub>. . . <it>z</it>
<sub>
<it>t </it>
</sub>
<it>of the target string <b>z</b>, thus making <b>z </b>shorter. Specifically, if <b>z </b>
</it>= <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>s </it>
</sub>. . . <it>z</it>
<sub>
<it>t </it>
</sub>. . . <it>z</it>
<sub>
<it>m</it>
</sub>, <it>then <b>z </b>
</it>&#8728; <it>&#964; </it>(<it>s, t</it>) = <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>s</it>-1</sub>
<it>z</it>
<sub>
<it>t</it>+1 </sub>. . . <it>z</it>
<sub>
<it>m</it>
</sub>. <it>See Figure </it>
<figr fid="F6">6</figr>.</p>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>A delete operation</p></caption><text>
   <p><b>A delete operation</b>. A delete operation, denoted <it>t </it>(<it>s, t</it>). The substring <b>z</b><sub><it>s</it>, <it>t </it></sub>is deleted.</p>
</text><graphic file="1748-7188-5-11-6"/></fig>
<p>The cost associated with <it>t </it>(<it>s, t</it>) depends on the number <it>t - s </it>+ 1 of characters deleted and is denoted &#934;(<it>t - s </it>+ 1).</p>
<p>
<b>Definition 7</b>. <it>The <b>duplication-deletion </b>distance from a source string <b>x </b>to a target string <b>y </b>is the cost of a minimum sequence of duplicate operations from <b>x </b>and deletion operations, in any order, that generates <b>y</b>
</it>.</p>
<p>We now show that although we allow arbitrary deletions from the intermediate string, it suffices to consider deletions from the duplicated strings before they are pasted into the intermediate string, provided that the cost function for deletion, &#934;(&#183;) is non-decreasing and obeys the triangle inequality.</p>
<p>
<b>Definition 8</b>. <it>A <b>duplicate-delete </b>operation from <b>x</b>, &#951;</it>
<sub>
<it>x</it>
</sub>(<it>i</it>
<sub>1</sub>, <it>j</it>
<sub>1</sub>, <it>i</it>
<sub>2</sub>, <it>j</it>
<sub>2</sub>,. . ., <it>i</it>
<sub>
<it>k</it>
</sub>, <it>j</it>
<sub>
<it>k</it>
</sub>, <it>p</it>), <it>for i</it>
<sub>1 </sub>&#8804; <it>j</it>
<sub>1 </sub>&lt;<it>i</it>
<sub>2 </sub>&#8804; <it>j</it>
<sub>2 </sub>&lt; &#8943; &lt;<it>i</it>
<sub>
<it>k </it>
</sub>&#8804; <it>j</it>
<sub>
<it>k </it>
</sub>
<it>copies the subsequence </it>
<inline-formula>
<graphic file="1748-7188-5-11-i15.gif"/>
</inline-formula>
<it>of the source string <b>x </b>and pastes it into a target string at position p. Specifically, if <b>x </b>
</it>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>m </it>
</sub>
<it>and <b>z </b>
</it>= <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>n</it>
</sub>, <it>then <b>z </b>&#8728; &#951;</it>
<sub>
<it>x</it>
</sub>(<it>i</it>
<sub>1</sub>, <it>j</it>
<sub>1</sub>, . . ., <it>i</it>
<sub>
<it>k</it>
</sub>, <it>j</it>
<sub>
<it>k</it>
</sub>, <it>p</it>) = <inline-formula>
<graphic file="1748-7188-5-11-i16.gif"/>
</inline-formula>.</p>
<p>The cost associated with such a duplication-deletion is &#916;<sub>1 </sub>+ (<it>j</it>
<sub>
<it>k </it>
</sub>- <it>i</it>
<sub>1 </sub>+ 1)&#916;<sub>2 </sub>+ <inline-formula>
<graphic file="1748-7188-5-11-i17.gif"/>
</inline-formula>. The first two terms in the cost reflect the affine cost of duplicating an entire substring of length <it>j</it>
<sub>
<it>k </it>
</sub>- <it>i</it>
<sub>1 </sub>+ 1, and the second term reflects the cost of deletions made to that substrings.</p>
<p>
<b>Lemma 2</b>. <it>If the affine cost for duplications is non-decreasing and </it>&#934; (&#183;) <it>is non-decreasing and obeys the triangle inequality then the cost of a minimum sequence of duplicate and delete operations that generates a target string <b>y </b>from a source string <b>x </b>is equal to the cost of a minimum sequence of duplicate-delete operations that generates <b>y </b>from <b>x</b>
</it>.</p>
<p>
<it>Proof</it>. Since duplicate operations are a special case of duplicate-delete operations, the cost of a minimal sequence of duplicate-delete operations and delete operations that generates <b>y </b>cannot be more than that of a sequence of just duplicate operations and delete operations. We show the (stronger) claim that an arbitrary sequence of duplicate-delete and delete operations that produces a string <b>y </b>with cost <it>c </it>can be transformed into a sequence of just duplicate-delete operations that generates <b>y </b>with cost at most <it>c </it>by induction on the number of delete operations. The base case, where the number of deletions is zero, is trivial. Consider the first delete operation, <it>&#964; </it>. Let <it>k </it>denote the number of duplicate-delete operations that precede <it>&#964;</it>, and let <b>z </b>be the intermediate string produced by these <it>k </it>operations. For <it>i </it>= 1, . . ., <it>k</it>, let <it>S</it>
<sub>
<it>i </it>
</sub>be the subsequence of <b>x </b>that was used in the <it>i</it>th duplicate-delete operation. By lemma 1, <it>S</it>
<sub>1</sub>, . . ., <it>S</it>
<sub>
<it>k </it>
</sub>form a partition of <b>z </b>into disjoint, non-overlapping subsequences of <b>z</b>. Let <it>d </it>denote the substring of <b>z </b>to be deleted. Since <it>d </it>is a contiguous substring, <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>is a (possibly empty) substring of <it>S</it>
<sub>
<it>i </it>
</sub>for each <it>i</it>. There are several cases:</p>
<p>1. <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>= &#8709;. In this case we do not change any operation.</p>
<p>2. <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>= <it>S</it>
<sub>
<it>i</it>
</sub>. In this case all characters produced by the <it>i</it>th duplicate-delete operation are deleted, so we may omit the <it>i</it>th operation altogether and decrease the number of characters deleted by <it>&#964; </it>. Since &#934; (&#183;) is non-decreasing, this does not increase the cost of generating <b>z </b>(and hence <b>y</b>).</p>
<p>3. <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>is a prefix (or suffix) of <it>S</it>
<sub>
<it>i</it>
</sub>. Assume it is a prefix. The case of suffix is similar. Instead of deleting the characters <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>we can avoid generating them in the first place. Let <it>r </it>be the smallest index in <it>S</it>
<sub>
<it>i</it>
</sub>\<it>d </it>(that is, the first character in <it>S</it>
<sub>
<it>i </it>
</sub>that is not deleted by <it>&#964;</it>). We change the <it>i</it>th duplicate-delete operation to start at <it>r </it>and decrease the number of characters deleted by <it>&#964; </it>. Since the affine cost for duplications is non-decreasing and &#934; (&#183;) is non-decreasing, the cost of generating <b>z </b>does not increase.</p>
<p>4. <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>is a non-empty substring of <it>S</it>
<sub>
<it>i </it>
</sub>that is neither a prefix nor a suffix of <it>S</it>
<sub>
<it>i</it>
</sub>. We claim that this case applies to at most one value of <it>i</it>. This implies that after taking care of all the other cases <it>&#964; </it>only deletes characters in <it>S</it>
<sub>
<it>i</it>
</sub>. We then change the <it>i</it>th duplicate-delete operation to also delete the characters deleted by <it>&#964;</it>, and omit <it>&#964; </it>. Since &#934; (&#183;) obeys the triangle inequality, this will not increase the total cost of deletion. By the inductive hypothesis, the rest of <b>y </b>can be generated by just duplicate-delete operations with at most the same cost. It remains to prove the claim. Recall that the set {<it>S</it>
<sub>
<it>i</it>
</sub>} is comprised of mutually non-overlapping subsequences of <b>z</b>. Suppose that there exist indices <it>i </it>&#8800; <it>j </it>such that <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>is a non-prefix/suffix substring of <it>S</it>
<sub>
<it>i </it>
</sub>and <it>S</it>
<sub>
<it>j </it>
</sub>&#8745; <it>d </it>is a non-prefix/suffix substring of <it>S</it>
<sub>
<it>j </it>
</sub>. There must exist indices of both <it>S</it>
<sub>
<it>i </it>
</sub>and <it>S</it>
<sub>
<it>j </it>
</sub>in <b>z </b>that precede <it>d</it>, are contained in <it>d</it>, and succeed <it>d</it>. Let <it>i</it>
<sub>
<it>p </it>
</sub>&lt;<it>i</it>
<sub>
<it>c </it>
</sub>&lt;<it>i</it>
<sub>
<it>s </it>
</sub>be three such indices of <it>S</it>
<sub>
<it>i </it>
</sub>and let <it>j</it>
<sub>
<it>p </it>
</sub>&lt;<it>j</it>
<sub>
<it>c </it>
</sub>&lt;<it>j</it>
<sub>
<it>s </it>
</sub>be similar for <it>S</it>
<sub>
<it>j </it>
</sub>. It must be the case also that <it>j</it>
<sub>
<it>p </it>
</sub>&lt;<it>i</it>
<sub>
<it>c </it>
</sub>&lt;<it>j</it>
<sub>
<it>s </it>
</sub>and <it>i</it>
<sub>
<it>p </it>
</sub>&lt;<it>j</it>
<sub>
<it>c </it>
</sub>&lt;<it>i</it>
<sub>
<it>s</it>
</sub>. Without loss of generality, suppose <it>i</it>
<sub>
<it>p </it>
</sub>&lt;<it>j</it>
<sub>
<it>p</it>
</sub>. It follows that (<it>i</it>
<sub>
<it>p</it>
</sub>, <it>i</it>
<sub>
<it>c</it>
</sub>) and (<it>j</it>
<sub>
<it>p</it>
</sub>, <it>j</it>
<sub>
<it>s</it>
</sub>) are alternating in <b>z</b>. So, <it>S</it>
<sub>
<it>i </it>
</sub>and <it>S</it>
<sub>
<it>j</it>
</sub>are overlapping which contradicts Lemma 1.</p>
<p>To extend the recurrence from the previous section to duplication-deletion distance, we must observe that because we allow deletions in the string that is duplicated from <b>x</b>, if we assume character <it>x</it>
<sub>
<it>i </it>
</sub>is copied to produce <it>y</it>
<sub>1</sub>, it may not be the case that the character <it>x</it>
<sub>
<it>i</it>+1 </sub>also appears in <b>y</b>; the character <it>x</it>
<sub>
<it>i</it>+1 </sub>may have been deleted. Therefore, we minimize over all possible locations <it>k </it>&gt;<it>i </it>for the next character in the duplicated string that is not deleted. The extension of the recurrence from the previous section to duplication-deletion distance is:</p>
<p>
<display-formula id="M4">
<graphic file="1748-7188-5-11-i18.gif"/>
</display-formula>
</p>
<p>
<display-formula id="M5">
<graphic file="1748-7188-5-11-i19.gif"/>
</display-formula>
</p>
<p>
<b>Theorem 2</b>. <inline-formula>
<graphic file="1748-7188-5-11-i20.gif"/>
</inline-formula>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-deletion distance from <b>x </b>to <b>y</b>. For </it>{<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>}, <inline-formula>
<graphic file="1748-7188-5-11-i21.gif"/>
</inline-formula>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-deletion distance from <b>x </b>to <b>y </b>under the additional restriction that y</it>
<sub>1</sub>
<it>is generated by x</it>
<sub>
<it>i</it>
</sub>.</p>
<p>The proof of Theorem 2 is almost identical to that of Theorem 1 in the previous section and is omitted. However, the running time increases; while the number of entries in the dynamic programming table does not change, the time to compute each entry is multiplied by the possible values of <it>k </it>in the recurrence, which is <it>O</it>(|<b>x</b>|). Therefore, the running time is <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>)), which is <it>O</it>(|<b>y</b>|<sup>3</sup>|<b>x</b>|<sup>2</sup>) in the worst case. We conclude this section by showing, in the following lemma, that if both the duplicate and delete cost functions are the identity function (i.e. one per operation), then the duplication-deletion distance is equal to duplication distance without deletions.</p>
<p>
<b>Lemma 3</b>. <it>Given a source string <b>x</b>, a target string <b>y</b>, If the cost of duplication is 1 per duplicate operation, and the cost of deletion is 1 per delete operation, then </it>
<inline-formula>
<graphic file="1748-7188-5-11-i20.gif"/>
</inline-formula>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) = <it>d</it>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>).</p>
<p>
<it>Proof</it>. First we note that if a target string <b>y </b>can be built from <b>x </b>in <it>d</it>(<b>x</b>, <b>y</b>) duplicate operations, then the same sequence of duplicate operations is a valid sequence of duplicate and delete operations as well, so <it>d</it>(<b>x</b>, <b>y</b>) is at least <inline-formula>
<graphic file="1748-7188-5-11-i20.gif"/>
</inline-formula>(<b>x</b>, <b>y</b>).</p>
<p>We claim that every sequence of duplicate and delete operations can be transformed into a sequence of duplicate operations of the same length. The proof of this claim is similar to that of Lemma 2. In that proof we showed how to transform a sequence of duplicate and delete operations into a sequence of duplicate-delete operations of at most the same cost. We follow the same steps, but transform the sequence into an a sequence that consists of just duplicate operations without increasing the number of operations. Recall the four cases in the proof of Lemma 2. In the the first three cases we eliminate the delete operation without increasing the number of duplicate operations. Therefore we only need to consider the last case (<it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>is a non-empty substring of <it>S</it>
<sub>
<it>i </it>
</sub>that is neither a prefix nor a suffix of <it>S</it>
<sub>
<it>i</it>
</sub>). Recall that this case applies to at most one value of <it>i</it>. Deleting <it>S</it>
<sub>
<it>i </it>
</sub>&#8745; <it>d </it>from <it>S</it>
<sub>
<it>i </it>
</sub>leaves a prefix and a suffix of <it>S</it>
<sub>
<it>i</it>
</sub>. We can therefore replace the <it>i</it>
<sup>
<it>th </it>
</sup>duplicate operation and the delete operation with two duplicate operations, one generating the appropriate prefix of <it>S</it>
<sub>
<it>i </it>
</sub>and the other generating the appropriate suffix of <it>S</it>
<sub>
<it>i</it>
</sub>. This eliminates the delete operation without changing the number of operations in the sequence. Therefore, for any string <b>y </b>that results from a sequence of duplicate and delete operations, we can construct the same string using only duplicate operations (without deletes) using at most the same number of operations. So, <it>d</it>(<b>x</b>, <b>y</b>) is no greater than <inline-formula>
<graphic file="1748-7188-5-11-i20.gif"/>
</inline-formula>(<b>x</b>, <b>y</b>).</p>
</sec>
<sec>
<st>
<p>Duplication-Inversion Distance</p>
</st>
<p>In this section we extend the duplication-deletion distance recurrence to allow inversions. We now explicitly define characters and strings as having two orientations: forward (+) and inverse (-).</p>
<p>
<b>Definition 9</b>. <it>A <b>signed string </b>of length m over an alphabet </it>&#931; <it>is an element of </it>({+, <it>-</it>} <it>&#215; </it>&#931;)<sup>
<it>m</it>
</sup>.</p>
<p>For example, (+<it>b -c -a </it>+<it>d</it>) is a signed string of length 4. An inversion of a signed string reverses the order of the characters as well as their signs. Formally,</p>
<p>
<b>Definition 10</b>. <it>The <b>inverse </b>of a signed string <b>x </b>
</it>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>m </it>
</sub>
<it>is a signed string </it>
<inline-formula>
<graphic file="1748-7188-5-11-i22.gif"/>
</inline-formula> = <it>-x</it>
<sub>
<it>m </it>
</sub>. . . -<it>x</it>
<sub>1</sub>.</p>
<p>For example, the inverse of (+<it>b -c -a </it>+<it>d</it>) is (<it>-d </it>+<it>a </it>+<it>c -b</it>).</p>
<p>In a duplicate-invert operation a substring is copied from <b>x </b>and <it>inverted </it>before being inserted into the target string <b>y</b>. We allow the cost of inversion to be an affine function in the length &#8467; of the duplicated inverted string, which we denote &#920;<sub>1 </sub>+ &#8467;&#920;<sub>2</sub>, where &#920;<sub>1</sub>, &#920;<sub>2 </sub>&#8805; 0. We still allow for normal duplicate operations.</p>
<p>
<b>Definition 11</b>. <it>A <b>duplicate-invert operation </b>from <b>x</b>
</it>, <inline-formula>
<graphic file="1748-7188-5-11-i23.gif"/>
</inline-formula>(<it>s, t, p</it>), <it>copies an inverted substring -x</it>
<sub>
<it>t</it>
</sub>, -<it>x</it>
<sub>
<it>t</it>
</sub>
<it>-</it>
<sub>1 </sub>. . ., -<it>x</it>
<sub>
<it>s </it>
</sub>
<it>of the source string <b>x </b>and pastes it into a target string at position p. Specifically, if <b>x </b>
</it>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>m </it>
</sub>
<it>and <b>z </b>
</it>= <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>n</it>
</sub>, <it>then <b>z </b>
</it>&#8728; <inline-formula>
<graphic file="1748-7188-5-11-i23.gif"/>
</inline-formula>(<it>s, t, p</it>) = <inline-formula>
<graphic file="1748-7188-5-11-i24.gif"/>
</inline-formula>.</p>
<p>The cost associated with each duplicate-invert operation is &#920;<sub>1</sub>+ (<it>t </it>- <it>s </it>+ 1)&#920;<sub>2</sub>.</p>
<p>
<b>Definition 12</b>. <it>The <b>duplication-inversion distance </b>from a source string <b>x </b>to a target string <b>y </b>is the cost of a minimum sequence of duplicate and duplicate-invert operations from <b>x</b>, in any order, that generates <b>y</b>
</it>.</p>
<p>The recurrence for duplication distance (Eqs. 1, 3) can be extended to compute the duplication-inversion distance. This is done by introducing a term for inverted duplications whose form is very similar to that of the term for regular duplication (Eq. 3). Specifically, when considering the possible characters to generate <it>y</it>
<sub>1</sub>, we consider characters in <b>x </b>that match either <it>y</it>
<sub>1 </sub>or its inverse, -<it>y</it>
<sub>1</sub>. In the former case, then, we use <inline-formula>
<graphic file="1748-7188-5-11-i25.gif"/>
</inline-formula>(<b>x</b>, <b>y</b>) to denote the duplication-inversion distance with the additional restriction that <it>y</it>
<sub>1 </sub>is generated by <it>x</it>
<sub>
<it>i </it>
</sub>without an inversion. The recurrence for <inline-formula>
<graphic file="1748-7188-5-11-i25.gif"/>
</inline-formula> is the same as for <it>d</it>
<sub>
<it>i </it>
</sub>in Eq. 3. In the latter case, we consider an inverted duplicate in which <it>y</it>
<sub>1 </sub>is generated by -<it>x</it>
<sub>
<it>i</it>
</sub>. This is denoted by <inline-formula>
<graphic file="1748-7188-5-11-i26.gif"/>
</inline-formula>, which follows a similar recurrence. In this recurrence, since an inversion occurs, <it>x</it>
<sub>
<it>i </it>
</sub>is the <it>last </it>character of the duplicated string, rather than the first one. Therefore, the next character in <b>x </b>to be used in this operation is <it>-x</it>
<sub>
<it>i</it>-1 </sub>rather than <it>x</it>
<sub>
<it>i</it>+1</sub>. The recurrence for <inline-formula>
<graphic file="1748-7188-5-11-i26.gif"/>
</inline-formula> also differs in the cost term, where we use the affine cost of the duplicate-invert operation. The extension of the recurrence to duplication-inversion distance is therefore:</p>
<p>
<display-formula id="M6">
<graphic file="1748-7188-5-11-i27.gif"/>
</display-formula>
</p>
<p>
<b>Theorem 3</b>. <inline-formula>
<graphic file="1748-7188-5-11-i28.gif"/>
</inline-formula>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion distance from <b>x </b>to <b>y</b>. For </it>{<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>}, <inline-formula>
<graphic file="1748-7188-5-11-i25.gif"/>
</inline-formula> (<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion distance from <b>x </b>to <b>y </b>under the additional restriction that y</it>
<sub>1 </sub>
<it>is generated by x</it>
<sub>
<it>i</it>
</sub>. <it>For </it>{<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>-y</it>
<sub>1</sub>}, <inline-formula>
<graphic file="1748-7188-5-11-i26.gif"/>
</inline-formula> (<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion distance from <b>x </b>to <b>y </b>under the additional restriction that y</it>
<sub>1</sub>
<it>is generated by -x</it>
<sub>
<it>i</it>
</sub>.</p>
<p>The correctness proof is very similar to that of Theorem 1, only requiring an additional case for handling the case of a duplicate invert operation which is symmetric to the case of regular duplication. The asymptotic running time of the corresponding dynamic programming algorithm is <it>O</it>(|<b>y</b>|<sup>2</sup>
<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>)). The analysis is identical to the one in section 3. The fact that we now consider either a duplicate or a duplicate-invert operation does not change the asymptotic running time.</p>
</sec>
<sec>
<st>
<p>Duplication-Inversion-Deletion Distance</p>
</st>
<p>In this section we extend the distance measure to include delete operations as well as duplicate and duplicate-invert operations. Note that we only handle deletions after inversions of the same substring. The order of operations might be important, at least in terms of costs. The cost of inverting (+<it>a </it>+<it>b </it>+<it>c</it>) and then deleting <it>-b </it>may be different than the cost of first deleting +<it>b </it>from (+<it>a </it>+<it>b </it>+<it>c</it>) and then inverting (+<it>a </it>+<it>c</it>).</p>
<p>
<b>Definition 13</b>. <it>The <b>duplication-inversion-deletion distance </b>from a source string <b>x </b>to a target string <b>y </b>is the cost of a minimum sequence of duplicate and duplicate-invert operations from <b>x </b>and deletion operations, in any order, that generates <b>y</b>
</it>.</p>
<p>
<b>Definition 14</b>. <it>A <b>duplicate-invert-delete </b>operation from <b>x</b>
</it>,</p>
<p>
<inline-formula>
<graphic file="1748-7188-5-11-i29.gif"/>
</inline-formula>(<it>i</it>
<sub>1</sub>, <it>j</it>
<sub>1</sub>, <it>i</it>
<sub>2</sub>, <it>j</it>
<sub>2</sub>, . . ., <it>i</it>
<sub>
<it>k</it>
</sub>, <it>j</it>
<sub>
<it>k</it>
</sub>, <it>p</it>), <it>for i</it>
<sub>1 </sub>&#8804; <it>j</it>
<sub>1 </sub>&lt;<it>i</it>
<sub>2 </sub>&#8804; <it>j</it>
<sub>2 </sub>
<it>&lt;</it>&#8943; &lt;<it>i</it>
<sub>
<it>k</it>
</sub>&#8804; <it>j</it>
<sub>
<it>k </it>
</sub>
<it>pastes the string </it>
<inline-formula>
<graphic file="1748-7188-5-11-i30.gif"/>
</inline-formula>
<it>into a target string at position p. Specifically, if <b>x </b>
</it>= <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>m </it>
</sub>
<it>and <b>z </b>
</it>= <it>z</it>
<sub>1 </sub>. . . <it>z</it>
<sub>
<it>n</it>
</sub>, <it>then <b>z </b>
</it>&#8728; <inline-formula>
<graphic file="1748-7188-5-11-i29.gif"/>
</inline-formula>(<it>i</it>
<sub>1</sub>, <it>j</it>
<sub>1</sub>, <it>i</it>
<sub>2</sub>, <it>j</it>
<sub>2</sub>, . . ., <it>i</it>
<sub>
<it>k</it>
</sub>, <it>j</it>
<sub>
<it>k</it>
</sub>, <it>p</it>) = <inline-formula>
<graphic file="1748-7188-5-11-i31.gif"/>
</inline-formula>.</p>
<p>The cost of such an operation is &#920;<sub>1 </sub>+ (<it>j</it>
<sub>
<it>k </it>
</sub>- <it>i</it>
<sub>1 </sub>+ 1)&#920;<sub>2 </sub>+ <inline-formula>
<graphic file="1748-7188-5-11-i17.gif"/>
</inline-formula>. Similar to the previous section, it suffices to consider just duplicate-invert-delete and duplicate-delete operations, rather than duplicate, duplicate-invert and delete operations.</p>
<p>
<b>Lemma 4</b>. <it>If </it>&#934; (&#183;) <it>is non-decreasing and obeys the triangle inequality and if the cost of inversion is an affine non-decreasing function as defined above, then the cost of a minimum sequence of duplicate, duplicate-invert and delete operations that generates a target string <b>y </b>from a source string <b>x </b>is equal to the cost of a minimum sequence of duplicate-delete and duplicate-invert-delete operations that generates <b>y </b>from <b>x</b>
</it>.</p>
<p>The proof of the lemma is essentially the same as that of Lemma 2. Note that in that proof we did not require all duplicate operations to be from the same string <b>x</b>. Therefore, the arguments in that proof apply to our case, where we can regard some of the duplicates from <b>x </b>and some from the inverse of <b>x</b>.</p>
<p>The recurrence for duplication-inversion-deletion distance is obtained by combining the recurrences for duplication-deletion (Eq. 5) and for duplication-inversion distance (Eq. 6). We use separate terms for duplicate-delete operations (<inline-formula>
<graphic file="1748-7188-5-11-i32.gif"/>
</inline-formula>) and for duplicate-invert-delete operations (<inline-formula>
<graphic file="1748-7188-5-11-i33.gif"/>
</inline-formula>). Those terms differ from the terms in Eq. 6 in the same way Eq. 5 differs from Eq. 2; Because of the possible deletion we do not know that <it>x</it>
<sub>
<it>i</it>+1 </sub>(<it>x</it>
<sub>
<it>i</it>-1</sub>) is the next duplicated character. Instead we minimize over all characters later (earlier) than <it>x</it>
<sub>
<it>i</it>
</sub>.</p>
<p>The recurrence for duplication-inversion-deletion distance is therefore:</p>
<p>
<display-formula>
<graphic file="1748-7188-5-11-i34.gif"/>
</display-formula>
</p>
<p>
<b>Theorem 4</b>. <inline-formula>
<graphic file="1748-7188-5-11-i35.gif"/>
</inline-formula>(<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion-deletion distance from <b>x </b>to <b>y</b>
</it>. <it>For </it>{<it>i </it>:<it>x</it>
<sub>
<it>i </it>
</sub>= <it>y</it>
<sub>1</sub>}, <inline-formula>
<graphic file="1748-7188-5-11-i32.gif"/>
</inline-formula> (<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion-deletion distance from <b>x </b>to <b>y </b>under the additional restriction that y</it>
<sub>1 </sub>
<it>is generated by x</it>
<sub>
<it>i</it>
</sub>. <it>For </it>{<it>i </it>: <it>x</it>
<sub>
<it>i </it>
</sub>= <it>-y</it>
<sub>1</sub>}, <inline-formula>
<graphic file="1748-7188-5-11-i33.gif"/>
</inline-formula> (<b>
<it>x</it>
</b>, <b>
<it>y</it>
</b>) <it>is the duplication-inversion-deletion distance from <b>x </b>to <b>y </b>under the additional restriction that y</it>
<sub>1</sub>
<it>is generated by -x</it>
<sub>
<it>i</it>
</sub>.</p>
<p>The proof, again, is very similar to the proofs in the previous sections. The running time of the corresponding dynamic programming algorithm is the same (asymptotically) as that of duplication-deletion distance. It is <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<it>&#956;</it>(<b>y</b>)<it>&#956;</it>(<b>x</b>)), where the multiplicity <it>&#956;</it>(<b>y</b>) (or <it>&#956;</it>(<b>x</b>)) is the number of times a character appears in the string <b>y </b>(or <b>x</b>), regardless of its sign.</p>
<p>In comparing the models of the previous section and the current one, we note that restricting the model of rearrangement to allow only duplicate and duplicate-invert operations (Section 5) instead of duplicate-invert-delete operations may be desirable from a biological perspective because each duplicate and duplicate-invert requires only three breakpoints in the genome, whereas a duplicate-invert-delete operation can be significantly more complicated, requiring more breakpoints.</p>
</sec>
<sec>
<st>
<p>Variants of Duplication-Inversion-Deletion Distance</p>
</st>
<p>It is possible to extend the model even further. We give here one detailed example which demonstrates how such extensions might be achieved. Other extensions are also possible. In the previous section we handled the model where the duplicated substring of <b>x </b>may be inverted in its entirety before being inserted into the target string. In the generalized model a substring of the duplicated string may be inverted before the string is inserted into <b>y</b>. For example, we allow (+<it>a </it>+<it>b </it>+<it>c </it>+<it>d </it>+<it>e </it>+<it>f</it>) to become (+<it>a </it>+<it>b -e -d -c </it>+<it>f</it>) before being inserted into <b>y</b>. In this model, the cost of duplicating a string of length <it>m </it>with an inversion of a substring of length &#8467; is &#916;<sub>1 </sub>+ <it>m</it>&#916;<sub>2 </sub>+ &#920; (&#8467;), for some non-negative monotonically increasing cost function &#920;.</p>
<p>The way we extend the recurrence is by considering all possible substring inversions to the original string <b>x</b>. For 1 &#8804; <it>s &#8804; t &#8804; |x|</it>, let <inline-formula>
<graphic file="1748-7188-5-11-i36.gif"/>
</inline-formula> be the string <it>x</it>
<sub>1 </sub>. . . <it>x</it>
<sub>
<it>s</it>-1 </sub>-<it>x</it>
<sub>
<it>t</it>
</sub>. . . -<it>x</it>
<sub>
<it>s </it>
</sub>
<it>x</it>
<sub>
<it>t</it>+1 </sub>. . . <it>x</it>
<sub>|<b>x</b>|</sub>. That is, the string that is obtained from <b>x </b>by inverting (in-place) <b>x</b>
<sub>
<it>s</it>, <it>t</it>
</sub>. For convenience, define also <inline-formula>
<graphic file="1748-7188-5-11-i37.gif"/>
</inline-formula> = <b>x</b>. We will use <inline-formula>
<graphic file="1748-7188-5-11-i38.gif"/>
</inline-formula> (<b>x</b>, <b>y</b>) to denote the distance from <b>x </b>to <b>y </b>in this model under the additional restriction that <it>y</it>
<sub>1 </sub>is generated by <it>x</it>
<sub>
<it>i </it>
</sub>and that the substring <b>x</b>
<sub>
<it>s</it>, <it>t </it>
</sub>was inverted. Note that this does not make much sense unless <it>s </it>&#8804; <it>i </it>&#8804; <it>t</it>, since otherwise the inverted substring is not used in the duplication. However, restricting the inversion cost &#920; (&#8467;) to be non-negative and monotonically increasing makes sure that those cases will not contribute to the minimization since inverting a character that is not duplicated will only increase the cost. The recurrence for duplication-deletion with arbitrary-substring-duplicate-inversions distance is given below.</p>
<p>
<display-formula>
<graphic file="1748-7188-5-11-i39.gif"/>
</display-formula>
</p>
<p>The running time is <it>O</it>(|<b>y</b>|<sup>2</sup>|<b>x</b>|<sup>3</sup>
<it>&#956;</it>(<b>x</b>)<it>&#956;</it>(<b>y</b>)). The multiplicative |<b>x</b>|<sup>2 </sup>factor in the running time in comparison with that of the previous section arises from considering all possible inverted substrings of <b>x</b>. We note that if we were only interested in handling inversions to just a prefix or a suffix of the duplicated string, then it is possible to extend the duplication-inversion-deletion recurrence without increasing the asymptotic running time.</p>
</sec>
<sec>
<st>
<p>Duplication Distance as a Context-Free Grammar</p>
</st>
<p>The process of generating a string <b>y </b>by repeatedly copying substings of a source string <b>x </b>and pasting them into an initially empty target string is naturally described by a context-free grammar (CFG). This alternative view might be useful in understanding our algorithms and their correctness. Thus, we provide the basic idea behind this connection for the most simple variant of duplication distance: no inversions or deletions and the cost of each duplicate operation is 1. For a fixed source string <b>x</b>, we construct a grammar <it>G</it>
<sub>
<it>x </it>
</sub>in which for every <it>i, j </it>such that 1 &#8804; <it>i </it>&#8804; <it>j </it>&#8804; |<b>x</b>|, there is a production rule <it>S &#8594; Sx</it>
<sub>
<it>i</it>
</sub>
<it>Sx</it>
<sub>
<it>i</it>+1</sub>
<it>S . . . Sx</it>
<sub>
<it>j</it>
</sub>
<it>S</it>.</p>
<p>These production rules correspond to duplicating the substring <b>x</b>
<sub>
<it>i</it>, <it>j </it>
</sub>. In addition there is a trivial production rule <it>S </it>&#8594; &#8712;, where &#8712; denotes the empty string. It is easy to see that the language described by this grammar is exactly the set of strings that can be duplicated from <b>x</b>. The non-overlapping property (Lemma 1) is now an immediate consequence of the structure of parse trees of CFGs. Finding the duplication distance from <b>x </b>to <b>y </b>is equivalent to finding a parse tree with a minimal number of non-trivial productions among all possible parse trees for <b>y</b>.</p>
<p>Consider now the slightly different grammar obtained by removing the leading <it>S </it>to the left of <it>x</it>
<sub>
<it>i </it>
</sub>from each of the production rules, so that the new rules are of the form <it>S &#8594; x</it>
<sub>
<it>i</it>
</sub>
<it>Sx</it>
<sub>
<it>i</it>+1</sub>
<it>S . . . Sx</it>
<sub>
<it>j </it>
</sub>
<it>S</it>. It is not difficult to see that both grammars produce the same language and have the same minimal size parse tree for every string <b>y</b>. The change only restricts the order in which rules are applied. For example, <it>y</it>
<sub>1 </sub>is always produced by the first production rule.</p>
<p>The recurrence for <it>d</it>
<sub>
<it>i</it>
</sub>(<b>x</b>, <b>y</b>) naturally arises by observing that if <it>T </it>is an optimal parse tree for <b>y </b>in which the first production rule generates <it>y</it>
<sub>1 </sub>by <it>x</it>
<sub>
<it>i </it>
</sub>and <it>y</it>
<sub>
<it>j </it>
</sub>by <it>x</it>
<sub>
<it>i</it>+1</sub>, then the subtree <it>T</it>
<sub>1 </sub>of <it>T </it>that generates <b>y</b>
<sub>2, <it>j</it>-1 </sub>is a valid parse tree which is optimal for <b>y </b>
<sub>2, <it>j</it>-1</sub>. Similarly, the tree <it>T</it>
<sub>2 </sub>obtained by deleting <it>x</it>
<sub>
<it>i </it>
</sub>and <it>T</it>
<sub>1 </sub>from <it>T </it>is a valid parse tree which is optimal for <b>y</b>
<sub>
<it>j</it>,|<b>y</b>| </sub>under the restriction that <it>y</it>
<sub>
<it>j </it>
</sub>must be generated by <it>x</it>
<sub>
<it>i</it>+1 </sub>(see Fig. <figr fid="F7">7</figr>). Moreover, <it>T</it>
<sub>1 </sub>and <it>T</it>
<sub>2 </sub>are disjoint trees which contain all non trivial productions in <it>T </it>. This explains the term <it>d</it>(<b>x</b>, <b>y</b>
<sub>2, <it>j</it>-1</sub>) + <it>d</it>
<sub>
<it>i</it>+1</sub>(<b>x</b>, <b>y</b>
<sub>
<it>j</it>,|<b>y</b>|</sub>) in Eq. 2, which is the heart of the recursion. The minimization over {<it>j </it>: <it>y</it>
<sub>
<it>j </it>
</sub>= <it>x</it>
<sub>
<it>i</it>+1</sub>, <it>j </it>&gt; 1} simply enumerates all of the possibilities for constructing <it>T </it>. The term 1 + <it>d</it>(<b>x</b>, <b>y</b>
<sub>2,|<b>y</b>|</sub>) handles the possibility that <it>y</it>
<sub>1 </sub>is generated by a duplicate operation that ends with <it>x</it>
<sub>
<it>i</it>
</sub>. In this case the tree <it>T</it>
<sub>2 </sub>is empty, so we only consider <it>T</it>
<sub>1</sub>. We add one to account for the production rule at the root of <it>T </it>which is not part of <it>T</it>
<sub>1</sub>. This is illustrated in Fig. <figr fid="F8">8</figr>.</p>
<fig id="F7"><title><p>Figure 7</p></title><caption><p>Example parse tree</p></caption><text>
   <p><b>Example parse tree</b>. An optimal parse tree <it>T </it>for <b>y </b>= bbccd where <b>x </b>= abcd. The root production duplicates <b>x</b><sub>2,4 </sub>= bcd. <it>x</it><sub>2 </sub>generates <it>y</it><sub>1 </sub>and <it>x</it><sub>3 </sub>generates <it>y</it><sub>4</sub>. The trees <it>T</it><sub>1 </sub>and <it>T</it><sub>2 </sub>are indicated. <it>T</it><sub>1 </sub>is an optimal parse tree for <b>y</b><sub>2,4-1 </sub>= bc. <it>T</it><sub>2 </sub>is an optimal parse tree for <b>y</b><sub>4,|<b>y</b>| </sub>= cd.</p>
</text><graphic file="1748-7188-5-11-7"/></fig>
<fig id="F8"><title><p>Figure 8</p></title><caption><p>Example parse tree</p></caption><text>
   <p><b>Example parse tree</b>. An optimal parse tree <it>T </it>for <b>y </b>= dab where <b>x </b>= abcd. The root production duplicates just <it>x</it><sub>4 </sub>= d. The tree <it>T</it><sub>1 </sub>is indicated. <it>T</it><sub>2 </sub>is empty (not indicated). The root production is not part of <it>T</it><sub>1</sub>.</p>
</text><graphic file="1748-7188-5-11-8"/></fig>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>We have shown how to generalize duplication distance to include certain types of deletions and inversions and how to compute these new distances efficiently via dynamic programming. In earlier work <abbrgrp>
<abbr bid="B17">17</abbr>
<abbr bid="B18">18</abbr>
</abbrgrp>, we used duplication distance to derive phylogenetic relationships between human segmental duplications. We plan to apply the generalized distances introduced here to the same data to determine if these richer computational models yield new biological insights.</p>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>CLK, SM, and BJR all designed and analyzed the algorithms and drafted the manuscript. All authors read and approved the final manuscript.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>SM was supported by NSF Grant CCF-0635089. BJR is supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund and by funding from the ADVANCE Program at Brown University, under NSF Grant No. 0548311.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Gene Order Comparisons for Phylogenetic Inference: Evolution of the Mitochondrial Genome</p></title><aug><au><snm>Sankoff</snm><fnm>D</fnm></au><au><snm>Leduc</snm><fnm>G</fnm></au><au><snm>Antoine</snm><fnm>N</fnm></au><au><snm>Paquin</snm><fnm>B</fnm></au><au><snm>Lang</snm><fnm>B</fnm></au><au><snm>Cedergren</snm><fnm>R</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>1992</pubdate><volume>89</volume><issue>14</issue><fpage>6575</fpage><lpage>6579</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.89.14.6575</pubid><pubid idtype="pmcid">49544</pubid><pubid idtype="pmpid">1631158</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><aug><au><snm>Pevzner</snm><fnm>P</fnm></au></aug><source>Computational molecular biology: an algorithmic approach</source><publisher>Cambridge, Mass.: MIT Press</publisher><pubdate>2000</pubdate></bibl><bibl id="B3"><title><p>Assignment of Orthologous Genes via Genome Rearrangement</p></title><aug><au><snm>Chen</snm><fnm>X</fnm></au><au><snm>Zheng</snm><fnm>J</fnm></au><au><snm>Fu</snm><fnm>Z</fnm></au><au><snm>Nan</snm><fnm>P</fnm></au><au><snm>Zhong</snm><fnm>Y</fnm></au><au><snm>Lonardi</snm><fnm>S</fnm></au><au><snm>Jiang</snm><fnm>T</fnm></au></aug><source>IEEE/ACM Trans Comp Biol Bioinformatics</source><pubdate>2005</pubdate><volume>2</volume><issue>4</issue><fpage>302</fpage><lpage>315</lpage><xrefbib><pubid idtype="doi">10.1109/TCBB.2005.48</pubid></xrefbib></bibl><bibl id="B4"><title><p>Genomic Distances Under Deletions and Insertions</p></title><aug><au><snm>Marron</snm><fnm>M</fnm></au><au><snm>Swenson</snm><fnm>KM</fnm></au><au><snm>Moret</snm><fnm>BME</fnm></au></aug><source>TCS</source><pubdate>2004</pubdate><volume>325</volume><issue>3</issue><fpage>347</fpage><lpage>360</lpage><xrefbib><pubid idtype="doi">10.1016/j.tcs.2004.02.039</pubid></xrefbib></bibl><bibl id="B5"><title><p>Genome Rearrangement by Reversals and Insertions/Deletions of Contiguous Segments</p></title><aug><au><snm>El-Mabrouk</snm><fnm>N</fnm></au></aug><source>Proc 11th Ann Symp Combin Pattern Matching (CPM00)</source><publisher>Berlin: Springer-Verlag</publisher><pubdate>2000</pubdate><volume>1848</volume><fpage>222</fpage><lpage>234</lpage><xrefbib><pubid idtype="doi">full_text</pubid></xrefbib></bibl><bibl id="B6"><title><p>Reconstructing the Evolutionary History of Complex Human Gene Clusters</p></title><aug><au><snm>Zhang</snm><fnm>Y</fnm></au><au><snm>Song</snm><fnm>G</fnm></au><au><snm>Vinar</snm><fnm>T</fnm></au><au><snm>Green</snm><fnm>ED</fnm></au><au><snm>Siepel</snm><fnm>AC</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au></aug><source>Proc 12th Int'l Conf on Research in Computational Molecular Biology (RECOMB)</source><pubdate>2008</pubdate><fpage>29</fpage><lpage>49</lpage></bibl><bibl id="B7"><title><p>DUPCAR: Reconstructing Contiguous Ancestral Regions with Duplications</p></title><aug><au><snm>Ma</snm><fnm>J</fnm></au><au><snm>Ratan</snm><fnm>A</fnm></au><au><snm>Raney</snm><fnm>BJ</fnm></au><au><snm>Suh</snm><fnm>BB</fnm></au><au><snm>Zhang</snm><fnm>L</fnm></au><au><snm>Miller</snm><fnm>W</fnm></au><au><snm>Haussler</snm><fnm>D</fnm></au></aug><source>Journal of Computational Biology</source><pubdate>2008</pubdate><volume>15</volume><issue>8</issue><fpage>1007</fpage><lpage>1027</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1089/cmb.2008.0069</pubid><pubid idtype="pmpid" link="fulltext">18774902</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Inferring Ancestral Gene Orders for a Family of Tandemly Arrayed Genes</p></title><aug><au><snm>Bertrand</snm><fnm>D</fnm></au><au><snm>Lajoie</snm><fnm>M</fnm></au><au><snm>El-Mabrouk</snm><fnm>N</fnm></au></aug><source>J Comp Biol</source><pubdate>2008</pubdate><volume>15</volume><issue>8</issue><fpage>1063</fpage><lpage>1077</lpage><xrefbib><pubid idtype="doi">10.1089/cmb.2008.0025</pubid></xrefbib></bibl><bibl id="B9"><title><p>On the Tandem Duplication-Random Loss Model of Genome Rearrangement</p></title><aug><au><snm>Chaudhuri</snm><fnm>K</fnm></au><au><snm>Chen</snm><fnm>K</fnm></au><au><snm>Mihaescu</snm><fnm>R</fnm></au><au><snm>Rao</snm><fnm>S</fnm></au></aug><source>Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)</source><publisher>New York, NY, USA: ACM</publisher><pubdate>2006</pubdate><fpage>564</fpage><lpage>570</lpage><xrefbib><pubid idtype="doi">full_text</pubid></xrefbib></bibl><bibl id="B10"><title><p>Reconstructing the Duplication History of Tandemly Repeated Genes</p></title><aug><au><snm>Elemento</snm><fnm>O</fnm></au><au><snm>Gascuel</snm><fnm>O</fnm></au><au><snm>Lefranc</snm><fnm>MP</fnm></au></aug><source>Mol Biol Evol</source><pubdate>2002</pubdate><volume>19</volume><issue>3</issue><fpage>278</fpage><lpage>288</lpage><xrefbib><pubid idtype="pmpid" link="fulltext">11861887</pubid></xrefbib></bibl><bibl id="B11"><title><p>Duplication and Inversion History of a Tandemly Repeated Genes Family</p></title><aug><au><snm>Lajoie</snm><fnm>M</fnm></au><au><snm>Bertrand</snm><fnm>D</fnm></au><au><snm>El-Mabrouk</snm><fnm>N</fnm></au><au><snm>Gascuel</snm><fnm>O</fnm></au></aug><source>J Comp Bio</source><pubdate>2007</pubdate><volume>14</volume><issue>4</issue><fpage>462</fpage><lpage>478</lpage><xrefbib><pubid idtype="doi">10.1089/cmb.2007.A007</pubid></xrefbib></bibl><bibl id="B12"><title><p>The Reconstruction of Doubled Genomes</p></title><aug><au><snm>El-Mabrouk</snm><fnm>N</fnm></au><au><snm>Sankoff</snm><fnm>D</fnm></au></aug><source>SIAM J Comput</source><pubdate>2003</pubdate><volume>32</volume><issue>3</issue><fpage>754</fpage><lpage>792</lpage><xrefbib><pubid idtype="doi">10.1137/S0097539700377177</pubid></xrefbib></bibl><bibl id="B13"><title><p>Whole Genome Duplications and Contracted Breakpoint Graphs</p></title><aug><au><snm>Alekseyev</snm><fnm>MA</fnm></au><au><snm>Pevzner</snm><fnm>PA</fnm></au></aug><source>SICOMP</source><pubdate>2007</pubdate><volume>36</volume><issue>6</issue><fpage>1748</fpage><lpage>1763</lpage></bibl><bibl id="B14"><title><p>Primate Segmental Duplications: Crucibles of Evolution, Diversity and Disease</p></title><aug><au><snm>Bailey</snm><fnm>J</fnm></au><au><snm>Eichler</snm><fnm>E</fnm></au></aug><source>Nat Rev Genet</source><pubdate>2006</pubdate><volume>7</volume><fpage>552</fpage><lpage>564</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nrg1895</pubid><pubid idtype="pmpid" link="fulltext">16770338</pubid></pubidlist></xrefbib></bibl><bibl id="B15"><title><p>Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution</p></title><aug><au><snm>Jiang</snm><fnm>Z</fnm></au><au><snm>Tang</snm><fnm>H</fnm></au><au><snm>Ventura</snm><fnm>M</fnm></au><au><snm>Cardone</snm><fnm>MF</fnm></au><au><snm>Marques-Bonet</snm><fnm>T</fnm></au><au><snm>She</snm><fnm>X</fnm></au><au><snm>Pevzner</snm><fnm>PA</fnm></au><au><snm>Eichler</snm><fnm>EE</fnm></au></aug><source>Nature Genetics</source><pubdate>2007</pubdate><volume>39</volume><fpage>1361</fpage><lpage>1368</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/ng.2007.9</pubid><pubid idtype="pmpid" link="fulltext">17922013</pubid></pubidlist></xrefbib></bibl><bibl id="B16"><title><p>Recurrent duplication-driven transposition of DNA during hominoid evolution</p></title><aug><au><snm>Johnson</snm><fnm>M</fnm></au><au><snm>Cheng</snm><fnm>Z</fnm></au><au><snm>Morrison</snm><fnm>V</fnm></au><au><snm>Scherer</snm><fnm>S</fnm></au><au><snm>Ventura</snm><fnm>M</fnm></au><au><snm>Gibbs</snm><fnm>R</fnm></au><au><snm>Green</snm><fnm>E</fnm></au><au><snm>Eichler</snm><fnm>E</fnm></au></aug><source>Proc Natl Acad Sci USA</source><pubdate>2006</pubdate><volume>103</volume><fpage>17626</fpage><lpage>17631</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1073/pnas.0605426103</pubid><pubid idtype="pmcid">1693797</pubid><pubid idtype="pmpid" link="fulltext">17101969</pubid></pubidlist></xrefbib></bibl><bibl id="B17"><title><p>Analysis of Segmental Duplications via Duplication Distance</p></title><aug><au><snm>Kahn</snm><fnm>CL</fnm></au><au><snm>Raphael</snm><fnm>BJ</fnm></au></aug><source>Bioinformatics</source><pubdate>2008</pubdate><volume>24</volume><fpage>i133</fpage><lpage>138</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btn292</pubid><pubid idtype="pmpid" link="fulltext">18689814</pubid></pubidlist></xrefbib></bibl><bibl id="B18"><title><p>A Parsimony Approach to Analysis of Human Segmental Duplications</p></title><aug><au><snm>Kahn</snm><fnm>CL</fnm></au><au><snm>Raphael</snm><fnm>BJ</fnm></au></aug><source>Pacific Symposium on Biocomputing</source><pubdate>2009</pubdate><fpage>126</fpage><lpage>137</lpage><xrefbib><pubid idtype="pmpid">19213134</pubid></xrefbib></bibl></refgrp>
</bm></art>