Skip to main content

DCJ-indel and DCJ-substitution distances with distinct operation costs

Abstract

Background

Classical approaches to compute the genomic distance are usually limited to genomes with the same content and take into consideration only rearrangements that change the organization of the genome (i.e. positions and orientation of pieces of DNA, number and type of chromosomes, etc.), such as inversions, translocations, fusions and fissions. These operations are generically represented by the double-cut and join (DCJ) operation. The distance between two genomes, in terms of number of DCJ operations, can be computed in linear time. In order to handle genomes with distinct contents, also insertions and deletions of fragments of DNA – named indels – must be allowed. More powerful than an indel is a substitution of a fragment of DNA by another fragment of DNA. Indels and substitutions are called content-modifying operations. It has been shown that both the DCJ-indel and the DCJ-substitution distances can also be computed in linear time, assuming that the same cost is assigned to any DCJ or content-modifying operation.

Results

In the present study we extend the DCJ-indel and the DCJ-substitution models, considering that the content-modifying cost is distinct from and upper bounded by the DCJ cost, and show that the distance in both models can still be computed in linear time. Although the triangular inequality can be disrupted in both models, we also show how to efficiently fix this problem a posteriori.

Background

The distance between two genomes is often computed using only the common markers, that occur in both genomes. Such distance allows rearrangements that change the organization of the genome, that is, the positions and orientations of markers, number and types of chromosomes. Inversions, translocations, fusions and fissions are some of these operations[1]. All these rearrangements can be generically represented as a double-cut-and-join (DCJ) operation[2]. The DCJ distance, which takes into consideration only DCJ operations, can be computed in linear time[3].

Nevertheless, genomes with the same content are rare, and differences in gene content may reflect important evolutionary aspects. In order to handle genomes with unequal contents, one has to take into consideration content-modifying operations, that change the contents of the genomes. These operations can be an insertion or a deletion of a piece of DNA. Insertions and deletions are also called indels. Some extensions of the classical approaches lead to models that handle genomes with unequal contents, but without duplicated markers, allowing rearrangements and indels. In 2001, El Mabrouk[4] extended the classical sorting by inversions approach[5] and developed a method to compare unichromosomal genomes with unequal contents, considering only inversions and indels. She provided an exact algorithm that deals with insertions and deletions asymmetrically, and a heuristic that handles the operations symmetrically. Then, in 2009, a model to sort multichromosomal genomes with unequal contents, using both DCJ and indel operations was introduced by Yancopoulos and Friedberg[6]. Later, Braga et al.[7] presented an exact formula for the DCJ-indel distance, that can be computed in linear time handling indels symmetrically.

Recently, in 2011, a more powerful content-modifying operation has also been considered: a substitution allows a piece of DNA to be substituted by another piece of DNA[8]. Observe that it is not suggested that a substitution occurs in a precise moment in evolution, but instead it represents a region that underwent continuous mutations (duplications, losses and gene mutations), so that a group of genes is transformed into a different group of genes (either of which may also be empty, allowing a substitution to represent an insertion or a deletion). Other studies also represent continuous mutations as a rearrangement event[9, 10]. By minimizing substitutions we are able to establish a relation between indels that could have occurred in the same position of the compared genomes, identifying genomic regions that could be subject to these continuous mutations. It has been shown that the DCJ-substitution distance can also be computed in linear time[8].

The approaches mentioned above[4, 68] assign the same cost to any rearrangement or content-modifying operation. However, during the evolution of many organisms, content-modifying operations are said to occur more often than rearrangements and, consequently, should be assigned to a lower cost. Examples are bacteria that are obligate intracellular parasites, such as Rickettsia[11]. The genomes of such intracellular parasites are observed to have a reductive evolution, that is, the process by which genomes shrink and undergo extreme levels of gene degradation and loss. In the present work, we refine the DCJ-indel[7] and the DCJ-substitution[8] models, by adopting a distinct content-modifying cost that is upper bounded by the DCJ cost. For simplicity, we assign a cost of 1 to DCJ and a positive cost of w≤1 to content-modifying operations. We are then able to give exact formulas for both the DCJ-indel and the DCJ-substitution distances, for any positive w≤1.

Content-modifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations[4, 68, 12]. In the case of the models we study here, it is possible to do an a posteriori correction, using an approach similar to the one described in[12].

This paper is an extension of[13] and is organized as follows. In the remainder of this section we give definitions and previous results used in this work. We will then present our results, including the formulas for the distances with distinct DCJ and content-modifying costs and the correction to establish the triangular inequality.

Genomes

We deal with models in which duplicated markers are not allowed. Given two genomes A and B, possibly with unequal content, letG,A andB be three disjoint sets, such thatG is the set of markers that occur both in A and B,A is the set of markers that occur only in A, andB is the set of markers that occur only in B. The markers in setsA andB are also called unique markers. We denote byu(A,B)=|A|+|B| the number of unique markers in genomes A and B.

Each marker g in a genome is a DNA fragment and is represented by the symbol g, if it is read in direct orientation, or by the symbol g ¯ , if it is read in reverse orientation. Each one of the two extremities of a linear chromosome is called a telomere, represented by the symbol . Each chromosome in a genome can be then represented by a string that can be circular, if the chromosome is circular, or linear and flanked by the symbols if the chromosome is linear. In general, a genome is either circular (composed of circular chromosomes) or linear (composed of linear chromosomes). As an example, consider the linear genomesA= bsu c ¯ av d ¯ e andB= awb x ¯ c , ydze , represented in Figure1. Here we haveG={a,b,c,d,e},A={s,u,v} andB={w,x,y,z}.

Figure 1
figure 1

Genomes A= bsu c ¯ av d ¯ e , composed of one single linear chromosome, andB={awb x ¯ c, y d z e }, composed of two linear chromosomes. The markers inG are represented in black, while the unique markers inA and inB are represented in red.

The DCJ model

In this section we will summarize the DCJ model, that allows the sorting of the common content of two genomes, also called DCJ-sorting. We will also show how the DCJ distance can be easily computed with the help of the adjacency graph.

Given two genomes A and B, we denote the two extremities of eachgG by gt (tail) and gh (head). Then, aG-adjacency or simply adjacency[7] in genome A (respectively in genome B) is a stringv= γ 1 γ 2 γ 2 ¯ γ 1 , such that each γ i can be a telomere or an extremity of a marker fromG and is a substring composed of the markers that are between γ1 and γ2 in A (respectively in B) and contains no marker that also belongs toG. The substring is the label of v. If is empty, the adjacency is said to be clean, otherwise it is said to be labeled. If a linear chromosome is composed only of unique markers, it is represented by an adjacency . Similarly, a circular chromosome composed only of unique markers is represented by a (circular) adjacency . For the linear genomes represented in Figure1, the set of adjacencies in A is {bt, bhs u ch, ctat, ahv dh, dtet, eh } and the set of adjacencies in B is a t , a h w b t , b h x ¯ c t , c h , y d t , d h z e t , e h .

Adjacency graph

Given two genomes A and B, the adjacency graph A G(A, B)[3] is the bipartite multigraph whose vertices are the adjacencies of A and of B and that has one edge for each common extremity of a pair of vertices. Each of the connected components of A G(A, B) alternate vertices in genome A and in genome B. Each component can be either a cycle, or an A B -path (that has one endpoint in genome A and the other in B), or an A A -path (that has both endpoints in genome A), or a B B -path (that has both endpoints in B). A special case of an A A or a B B-path is a linear singleton, that is a linear chromosome represented by an adjacency of type , where contains only unique markers. Paths occur when the genomes are linear. For circular genomes, the graph A G(A, B) is composed of cycles only, and may also have a special type of component composed of a single vertex, that corresponds to a circular chromosome composed only of markers that are not inG, called circular singleton. In Figure2 we show the adjacency graph built over the linear genomes represented in Figure1.

Figure 2
figure 2

For genomes A and B (Figure 1), the graph has one BB and two AB -paths

DCJ operations

A cut performed on a genome A separates two adjacent markers of A. A cut affects a single adjacency v in A: it is done between two symbols of v, creating two open ends. In general a cut can be performed between two markers of a label, but the DCJ-indel distance can be computed considering only cuts that do not “break” labels. A double-cut and join or DCJ applied on a genome A is the operation that performs cuts in two different adjacencies in A, creating four open ends, and joins these open ends in a different way. In other words, a DCJ rearranges two adjacencies in A, transforming them into two new adjacencies. As an example consider a DCJ applied to genome A (from Figure1), that rearranges the adjacencies ahv dh and dtet into the new adjacencies ahv dt and dhet. Observe that this operation corresponds to the inversion of marker d in genome A. Indeed, a DCJ operation can correspond to several rearrangements, such as an inversion, a translocation, a fusion or a fission[2].

DCJ-sorting and DCJ distance

Given two genomes A and B, the components of A G(A, B) with 3 or more vertices need to be reduced, by applying DCJ operations, to components with only 2 vertices, that can be cycles or A B-paths[14]. This procedure is called DCJ-sorting of A into B. The number of A B-paths in A G(A, B) is always even and a DCJ can be of three types[7]: it can either decrease the number of cycles by one, or the number of A B-paths by two (counter-optimal); or it does not affect the number of cycles and A B-paths (neutral); or it can either increase the number of cycles by one, or the number of A B-paths by two (optimal). The DCJ distance of A and B, denoted by d D C J (A, B), is the minimum number of steps required to do a DCJ-sorting of A into B, given by the following theorem.

Theorem 1 (from[3]). Given two genomes A and B , we have d DCJ (A,B)=|G|c b 2 , whereGis the set of common markers and c and b are, respectively, the number of cycles and of A B -paths in A G(A, B).

Internal DCJ operations and recombinations

Observe that a DCJ operation ρ acts on two different adjacencies, that can be in the same or in two distinct connected components of the graph. The components on which the cuts are applied are called sources and the components obtained after the joinings are called resultants of ρ. With respect to the adjacency graph, ρ can be of two types: internal, when ρ is applied to two adjacencies belonging to a single component; and recombination, when ρ is applied to adjacencies belonging to two distinct components.

Any recombination applied to a vertex of an A A-path and a vertex of a B B-path is optimal[14]. A recombination applied to vertices of two distinct A B-paths can be either neutral, when the resultants are also A B-paths, or counter-optimal, when the resultants are an A A-path and a B B-path. All other types of path recombinations are neutral and all recombinations involving at least one cycle are counter-optimal.

It is possible to do a separate DCJ-sorting in any component P of A G(A, B)[14] by applying DCJs internal to P. We denote by d D C J (P) the number of optimal DCJ operations used for DCJ-sorting P separately (d D C J (P) depends only on the number of vertices or, equivalently, the number of edges of P[14]). Thus, the DCJ distance can also be re-written in terms of the sum of the distance per component:

Lemma 1 (derived from[14]). Given two genomes A and B , we have d DCJ (A,B)= P A G ( A , B ) d DCJ (P).

Only optimal DCJs, counted in the equivalent formulas given by Theorem 1 and Lemma 1, are necessary to do a DCJ-sorting. Given a DCJ ρ, the DCJ variation of ρ, denoted by Δ D C J (ρ), is defined to be respectively 0, 1 and 2 depending whether ρ is optimal, neutral or counter-optimal.

Modifying the content of a genome

In the previous section, the unique markers appeared as labels of adjacencies, but the DCJ operations are only able to change the organization of the genomes. Here we introduce the operations that are applied to the labels and change the content of the genomes.

Indel operations

The most classical content-modifying operations are insertions and deletions of blocks of contiguous markers[4, 6]. We refer to insertions and deletions as indel operations. In the model we consider, an indel only affects the label of one single adjacency, by deleting or inserting contiguous markers in this label, with the restriction that an insertion cannot produce duplicated markers[7]. Thus, while sorting A into B, the indels are the steps in which the markers inA are deleted and the markers inB are inserted. At most one chromosome can be entirely deleted or inserted at once. We illustrate an indel with the following example: the deletion of markers su from adjacency bhs u ch of genome A (Figure2), which results into the clean adjacency bhch. The opposite operation would be an insertion.

Substitutions

Substitutions are more powerful content-modifying operations, that allow blocks of contiguous markers to be substituted by other blocks of contiguous markers[8]. In other words, a deletion and a subsequent insertion that occur at the same position of the genome can be modeled as a substitution, counting together for one single sorting step.

A substitution only affects the label of one single adjacency, by substituting contiguous markers in this label, with the restriction that it cannot produce duplicated markers[8]. An example is the substitution of markers su in adjacency bhs u ch by x ¯ , which results into adjacency b h x ¯ c h . At most one chromosome can be entirely substituted at once (but we do not allow the substitution of a linear by a circular chromosome nor vice-versa). As previously mentioned, insertions and deletions are special cases of substitutions. If a block of markers is substituted by the empty string, we have a deletion. Analogously, if the empty string is substituted by a block of markers, we have an insertion.

Runs, indel- and substitution-potentials

In this section we introduce some definitions and concepts that will help us to integrate the DCJ model with content-modifying operations. These concepts will be very useful in our results, when we will show how to use DCJ operations to minimize the number of content-modifying operations to be performed.

First, let us recall the concept of run, introduced in[7]. Given two genomes A and B and a component P of A G(A, B), a run is a maximal subpath of P, in which the first and the last vertices are labeled and all labeled vertices belong to the same genome (or partition). An example is given in Figure3. A run in genome A is also called anA-run, and a run in genome B is called aB-run. We denote by Λ(P) the number of runs in a component P. While a path can have any number or runs, a cycle has either 0, 1, or an even number of runs.

Figure 3
figure 3

An AB -path with 3 runs (extracted from Figure 2).

A set of labels of one genome can be accumulated with DCJs. For example, take the adjacencies dhz et and d t y ¯ from genome B (Figure3). A DCJ applied to these two adjacencies could result into dtet and d h z y ¯ , in which the labelz y ¯ resulted from the accumulation of the labels of the two original adjacencies. In particular, when we apply optimal DCJs internal to a single component of the adjacency graph, we can accumulate an entire run into a single adjacency[7].

Runs can be merged by DCJ operations. Consequently, during the optimal DCJ-sorting of a component P, we can reduce its number of runs. The indel-potential of P, denoted by λ(P), is defined in[7] as the minimum number of runs that we can obtain by DCJ-sorting P with optimal DCJ operations. An example is given in Figure4.

Figure 4
figure 4

Two optimal sequences for DCJ-sorting an AB -path with Λ =3 (the cuts of each DCJ in each sequence are represented by “ |”). In (i) the overall number of runs in the resulting components is three, while in (ii) the resulting components have only two runs. Indeed, in this case, the best we can have is the indel-potential λ=2.

The indel-potential of a component depends only on its number of runs:

Proposition 1 (from[7]).Given two genomes A and B and a component P of A G(A, B), the indel-potential of P is given byλ(P)= Λ ( P ) + 1 2 , if Λ(P)≥1. Otherwise, if Λ(P)=0, then λ(P)=0.

Similarly, the substitution-potential of a component P is the minimum number of substitutions that we can obtain by DCJ-sorting P with optimal DCJ operations. The substitution-potential is denoted by σ(P) and can be computed as follows:

Proposition 2 (from[8]). Given genomes A and B and a component P of A G(A, B), the substitution-potential of P is given byσ(P)= Λ ( P ) + 1 4 , if Λ(P)≥1. Otherwise, if Λ(P)=0, then σ(P)=0.

Results

In this section we show how to compute the DCJ-indel and the DCJ-substitution distances, considering that the content-modifying cost is distinct from and upper bounded by the DCJ cost. We assign the cost of 1 to each DCJ and a positive cost w≤1 to each content-modifying operation.

The DCJ-indel model with distinct operation costs

First we consider the case in which only indels are allowed as content-modifying operations. Given two genomes A and B, we define the DCJ-indel distance of A and B, denoted by d DCJ id (A,B), as the minimum cost of a DCJ-indel sequence of operations that sorts A into B. If w=1, the DCJ-indel distance corresponds exactly to the minimum number of steps required to sort A into B. To compute the distance in this case, a linear algorithm was given in[7]. Here we present a more general method to compute the DCJ-indel distance for any positive w≤1.

An upper bound for the DCJ-indel distance

We can obtain a good upper bound for the DCJ-indel distance by showing how to compute the DCJ-indel distance per component. Given a DCJ operation ρ, let λ0 and λ1 be, respectively, the sum of the indel-potentials for the components of the adjacency graph before and after ρ, and let Δ λ(ρ)=λ1λ0. If ρ is an optimal DCJ internal to a single component of the graph, the definition of indelpotential implies Δ λ(ρ)≥0. We also have Δ λ(ρ)≥0, if ρ is counter-optimal, and Δ λ(ρ)≥−1, if ρ is neutral[7]. Recall that Δ D C J (ρ) is, respectively, 0, 1 and 2, depending whether the DCJ ρ is optimal, neutral or counter-optimal. We define ΔD C J-λ(ρ)=Δ D C J (ρ) + w Δ λ(ρ).

We know that each component P of A G(A,B) can be DCJ-sorted separately, and the labels can then be easily sorted with indel operations. Let d DCJ id (P) be the DCJ-indel distance of P, that is the minimum cost of a DCJ-indel sequence of operations sorting P separately. This can be computed according to the following proposition.

Proposition 3. For each PA G(A, B), d DCJ id (P)= d DCJ (P)+(P).

Proof. By the definition of λ, the best we can do with optimal DCJs is d D C J (P)+w λ(P). From[7], we have ΔD C J-λ(ρ)≥2 if ρ is counter-optimal, thus we can only get more expensive sorting scenarios if we use such operation. We also know that, if ρ is neutral ΔD C J-λ(ρ)≥1−w≥0, for any positive w≤1. □

This allows us to get a good upper bound for the DCJ-indel distance with distinct operation costs:

Lemma 2. Given two genomes A and B and a positive indel cost w≤1, we have

d DCJ id ( A , B ) d DCJ ( A , B ) + w P A G ( A , B ) λ ( P ) .

Proof. If we sort the components separately we have d DCJ id (A,B) P A G ( A , B ) d DCJ id (P), which, according to Lemma 1 and Proposition 3, corresponds exactly to d DCJ (A,B)+w P A G ( A , B ) λ(P).

Recombinations and the exact DCJ-indel distance

Until this point, we have explored the possible effects of any DCJ that is internal to a single component of the graph. Now we will analyze the effect of recombinations, that have Δ λ≥−2[7]. We saw previously that any recombination involving cycles is counter-optimal. Since any counter-optimal recombination has ΔD C J-λ≥2−2w≥0, only path recombinations can have ΔD C J-λ <0.

Although the space of recombinations is not small, some observations allow us to explore it efficiently. Proposition 1 shows that the indel-potential increases of one when the number of runs increases of two. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indel-potential only if its initial number of runs is one or a multiple of two. However, the exact number of runs does not really matter. In the path recombination analysis, we only have to consider the following properties for each path:

  • whether it is an AA, or a BB, or an AB-path;

  • whether it has zero, or an odd or an even number of runs; and

  • whether its first run is in A or in B (by convention, an AB-path is always read from A to B).

An empty sequence (with no run) is represented by ε. For the benefit of the reader, for an integer i≥0, letA (respectivelyB) be a sequence with odd 2i+1 runs, starting and ending with anA-run (respectivelyB-run). Similarly, letAB (respectivelyBA), be a sequence with even 2i+2 runs, starting with anA-run (respectivelyB-run) and ending with aB-run (respectivelyA-run). Then each one of the notations A A ε ,A A A ,A A B ,A A A B A A B A , B B ε ,B B A ,B B B ,B B A B B B B A , A B ε ,A B A ,A B B ,A B A B andA B B A represents a particular type of path (AA, BB or AB) with a particular structure of runs (ε,A,B,AB orBA). An example of this notation is given in Figure5, which represents a neutral recombination possibly with ΔD C J-λ <0.

Figure 5
figure 5

Neutral recombination that has Δ D C J - λ =1−2 w (we represent only the labels of the adjacencies, the cuts of the recombination are represented by “ /” and “ ”).

Each type of recombination can lead to different resultants, depending on where the cuts are applied. However, it is always possible to choose the “best” resultants in each case: we take the recombination with the smallest Δ+D C J-λ, whose resultants can be better reused in further recombinations. The main observations to guide this task are: only recombinations of paths whose runs areAB orBA have Δ λ=−2 and only recombinations of type A A+B B are optimal and have Δ D C J =0. In Table1, we list all path recombinations that can have ΔD C J-λ<0, together with neutral recombinations that have ΔD C J-λ=1−w≥0, but produce anA A A B or aB B A B path. We denote by ∙ an AB-path that never appears as a source of a recombination in this table (these paths are A B ε ,A B A andA B B ).

Table 1 Path recombinations that are used to compute the DCJ-indel distance
The DCJ-indel distance formula

By analyzing the whole universe of operations, we could identify groups of recombinations, as listed in Table2. Since some resultants of recombinations can be used in other recombinations, the groups can have more than one recombination. GroupsP, S 1 and S 2 are composed of a single recombination, while groupsT, N 1 and N 2 are composed of two recombinations and groupsQ andM are composed of three recombinations. recombination is not an associative operation, thus, in column ‘DCJ seq.’ of Table2, we indicate how the sequence of DCJs must be applied in each group (the symbol separates preceeding and succeeding recombinations).

Table 2 All recombination groups that determine the deductions for computing the DCJ-indel distance

While in groupsQ andT the preceding recombinations have lower ΔD C J-λ, in groupsM, N 1 and N 2 we need to use operations of type n-1 in order to prepare better recombinations. Another important observation concerning groupsQ andT is that, although their ΔD C J-λ indicate thatQ could be applied for w>1/4 andT could be applied for w>1/3, the last operation of these groups is of type n-2 and actually increases ΔD C J-λ for w≤1/2. For this reason, we skip groupsQ andT for w≤1/2 (there is no loss with this approach, since their optimal operations are then counted in S 1 ).

The deductions shown in Table2 can be computed with an approach that greedily maximizes the number of occurrences inP,Q,T, S 1 , S 2 ,M, N 1 and N 2 in this order. The two groups inQ are mutually exclusive after maximizingP. The lines inT are subgroups of the lines inQ, that is, they are only computed when there are enough remaining components after maximizingQ. Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. With the results presented in this section we have an exact formula to compute the DCJ-indel distance:

Theorem 2. Given two genomes A and B and a positive indel cost w≤1,

d DCJ id ( A , B ) = d DCJ ( A , B ) + w P A G ( A , B ) λ ( P ) 2 w P ( 4 w 1 ) Q ( 3 w 1 ) T w S 1 ( 2 w 1 ) ( S 2 + 2 M + N 1 ) ( 3 w 2 ) N 2 ,

whereP,Q,T, S 1 , S 2 ,M, N 1 and N 2 are computed as described above.

As we mentioned before, the groupsQ andT are skipped (Q=T=0) for w≤1/2. Furthermore, we also have S 2 =M= N 1 =0 if w≤1/2 and N 2 =0 if w≤2/3. Although some groups have reusable resultants, those are actually never reused (if groups that are lower in the table use as sources resultants from higher groups, the sources of all referred groups would be previously consumed in groups that occupy even higher positions in the table). Due to this fact, the number of occurrences in each group depends only on w and the initial number of each type of component.

Observe that, for w=1, our formula is identical to the one proposed in[7]. Actually, for any 2/3<w≤1, the two formulas are equivalent, since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.

We illustrate the result of our formula with an example. Let A G(A,B) have only the following labeled paths: twoA A A B , oneB B A and oneB B B . In this case, there are no occurrences ofP, thus we haveP=0. If we takew> 1 2 , all labeled paths are consumed in one occurrence ofQ. We haveQ=1, while all other values are zero, resulting in ΔD C J-λ=1−4w. On the other hand, ifw 1 2 , we automatically setQ=T= S 2 =M= N 1 = N 2 =0. The labeled paths are consumed in two occurrences of S 1 , that is, S 1 =2, resulting in ΔD C J-λ=−2w. For sure, −2w≤1−4w only ifw 1 2 .

The DCJ-substitution model with distinct operation costs

Now we consider a different model in which substitutions are the content-modifying operations. Recall that substitutions include indels. Again we assign the cost of 1 to each DCJ and the cost of w≤1 to each substitution. The DCJ-substitution distance of genomes A and B, denoted by d DCJ sb (A,B), is then the minimum cost of a DCJ-substitution sequence that sorts A into B. If w=1, this corresponds exactly to the minimum number of steps required to sort A into B and can be computed in linear time[8]. Here we present a general method to compute the DCJ-substitution distance for any positive w≤1. Similarly to the approach used with the DCJ-indel model, we will first use internal DCJs to obtain a good upper bound and then analyze recombinations to compute the exact DCJ-substitution distance.

An upper bound for the DCJ-substitution distance

We can also obtain a good upper bound for the DCJ-substitution distance by showing how to compute the DCJ-substitution distance per component. Given a DCJ operation ρ, let σ0 and σ1 be, respectively, the sum of the substitution-potentials for the components of the adjacency graph before and after ρ, and let Δ σ(ρ)=σ1σ0. If ρ is an optimal DCJ internal to a single component of the graph, the definition of substitution-potential implies Δ σ(ρ)≥0. We also have Δ σ(ρ)≥0, if ρ is counter-optimal, and Δ σ(ρ)≥−1, if ρ is neutral[8]. We define ΔD C J-σ(ρ)=Δ D C J (ρ)+w Δ σ(ρ).

After DCJ-sorting a component P of A G(A,B), the remaining labels can be easily sorted with substitutions. Let d DCJ sb (P) be the DCJ-substitution distance of P, that is the minimum cost of a DCJ-substitution sequence of operations sorting P separately. This is given by the following proposition.

Proposition 4. For each PA G(A, B), d DCJ sb (P)= d DCJ (P)+(P).

Proof. Analogous to the proof of Proposition 3. □

If P is a singleton in A G(A,B), d DCJ sb (P)=w (the indel of the whole chromosome). A linear cannot be substituted by a circular singleton and vice-versa. However, a pair composed by a singleton in genome A and a singleton in genome B, such that both are linear or both are circular, can be sorted with one substitution (which saves one sorting step per pair). Let P L S and P C S be, respectively, the maximum number of disjoint pairs of linear and circular singletons in A G(A,B). Together with Proposition 4, these numbers give a good upper bound for the DCJ-substitution distance:

Lemma 3. Given genomes A and B and a positive substitution cost w≤1,

d DCJ sb ( A , B ) d DCJ ( A , B ) + w P A G ( A , B ) σ ( P ) w ( P L S + P C S ) ,

where P L S and P C S are the numbers of disjoint pairs of linear and circular singletons.

Proof. If we sort the components separately we have d DCJ sb (A,B) P A G ( A , B ) d DCJ sb (P), which, according to Lemma 1 and Proposition 4, corresponds exactly to d DCJ (A,B)+w P A G ( A , B ) σ(P). □

Recombinations and the exact DCJ-substitution distance

Now we also need to analyze the effect of path recombinations, that have Δ σ(ρ)≥−2[8], in the DCJ-substitution distance. Here the space of recombinations is even larger, but can still be efficiently explored. Proposition 2 shows that the substitution-potential increases of one when the number of runs increases of four. Furthermore, when we decrease the number of runs of a path by one, it will decrease the indel-potential only if its initial number of runs is one or a multiple of four. Again, the exact number of runs does not really matter. We have to consider the following properties for each path:

  • whether it is an AA, or a BB, or an AB-path;

  • whether it has zero, or a number of runs that is a multiple of four, or a multiple of four plus 1, or a multiple of four plus 2, or a multiple of four plus 3; and

  • whether its first run is in A or in B (by convention, an AB-path is always read from A to B).

Recall that an empty sequence (with no run) is represented by ε. For labeled paths we adopt a different meaning forA,B,AB,BA: for an integer i≥0, letA (respectivelyB) be a sequence with odd 4i+1 runs, starting and ending with anA-run (respectivelyB-run), and letAB (respectivelyBA), be a sequence with even 4i+2 runs, starting with anA-run (respectivelyB-run) and ending with aB-run (respectivelyA-run). Here we still have some additional cases: letABA (respectivelyBAB) be a sequence with odd 4i+3 runs, starting and ending with anA-run (respectivelyB-run), and letABAB (respectivelyBABA), be a sequence with even 4i+4 runs, starting with anA-run (respectivelyB-run) and ending with aB-run (respectivelyA-run). Then, for each type of path (AA, BB or AB) with a particular structure of runs (A,B,AB,BA,ABA,BAB,ABAB, orBABA), we have a particular notation. An example of this notation is given in Figure6, which represents a neutral recombination with ΔD C J-σ=1−w.

Figure 6
figure 6

Neutral recombination that has Δ D C J - σ =1 − w (we represent only the labels of the adjacencies, the cuts of the recombination are represented by “ /”).

Again, although each type of recombination can lead to different resultants, it is always possible to choose the “best” resultants in each case: we take the recombination with the smallest ΔD C J-σ, whose resultants can be better reused. In Table3, we list all recombinations that can have ΔD C J-σ<0, together with those that have ΔD C J-σ=1−w≥0, but produce an AA or a BB-path with runsABAB orA orB. We denote by ∙ an AB-path that never appears as a source in this table (these are all AB paths, with the exception ofA B A B A B andA B B A B A ).

Table 3 Path recombinations that are used to compute the DCJ-substitution distance
The DCJ-substitution distance formula

In Table4 we list groups of recombinations, which allow the computation of the exact DCJ-substitution distance, with an approach that greedily maximizes the number of occurrences inU,V,W, X 1 , X 2 ,Y, Z 1 and Z 2 in this order. The two groups inV are mutually exclusive after maximizingU, while those inW are subgroups ofV (they are only computed when there are enough remaining components after maximizingV). Similarly, each one of the remaining groups are computed when there are enough remaining components after maximizing the upper groups. As previously observed, the recombination is not associative, thus the column ‘DCJ seq.’ determines in which order the sequence of DCJs must be applied in each group. Here we also need to skip some recombinations depending on the value of w. In particular, although ΔD C J-σ indicates thatW could be applied for w>1/3 andV for w>1/4, the last operation of these groups is of type n -2 and increases ΔD C J-σ for w≤1/2. GroupsV andW are skipped for w≤1/2, and their optimal operations are then counted in X 1 .

Table 4 All recombination groups that determine the deductions for computing the DCJ-substitution distance

The recombinations allow us to obtain an exact formula for the DCJ-substitution distance:

Theorem 3

Given genomes A and B and a positive substitution cost w≤1,

d DCJ sb ( A , B ) = d DCJ ( A , B ) + w P A G ( A , B ) σ ( P ) 2 w U ( 4 w 1 ) V ( 3 w 1 ) W w X 1 ( 2 w 1 ) ( X 2 + 2 Y + Z 1 ) ( 3 w 2 ) Z 2 w ( P L S + P C S ) ,

whereU,V,W, X 1 , X 2 ,Y, Z 1 and Z 2 are computed as described above and P L S and P C S are the numbers of disjoint pairs of linear and circular singletons.

Observe that the number of occurrences in each group depends only on w and the initial number of each type of component and, for any 2/3<w≤1, our formula is equivalent to the one proposed in[8], since the same occurrences of groups of recombinations and an equivalent upper bound are taken into account.

Complexity

Both A G(A,B) and d D C J (A,B) can be computed in linear time[3]. The occurrences in each recombination group depends only on w and the initial components. The runs are obtained by a single walk through each path, thus the whole procedure takes linear time for both models.

Establishing the triangular inequality

We have presented two genomic distances that combine DCJ and content-modifying operations and can be computed in linear time. However, content-modifying operations are applied to pieces of DNA of any size, and a side effect of this fact is that the triangular inequality often does not hold for distances that consider these operations[4, 6]-[8, 12].

Let A, B and C be three genomes, with unequal contents, and consider, without loss of generality, that d DCJ id (A,B) d DCJ id (A,C) and d DCJ id (A,B) d DCJ id (B,C). The triangular inequality is then the property which guarantees that the inequality d DCJ id (A,B) d DCJ id (A,C)+ d DCJ id (B,C)also holds. Unfortunately this is not the case for the DCJ-indel distance, and also not the case for the DCJ-substitution distance. Take for example the genomes A={a b c d e},B={ac d ¯ be} and C={a e}[6]. While the cost of sorting A (or B) into C is w (one indel), the minimum number of DCJs (that are inversions in this case) required to sort A into B is three. We have d DCJ id (A,B)=3, d DCJ id (A,C)=w, d DCJ id (B,C)=w and the triangular inequality is disrupted.

Denote byA,B,C,D,E,F andG the disjoint sets of markers such that:A,B orC are the sets of markers that occur respectively only in genome A, B or C, the markers inD are common only to genomes A and B, the markers inE are common only to B and C, the markers inF are common only to A and C, and,Gis the set of markers that are common to all three genomes A, B and C. These sets are represented in Figure7.

Figure 7
figure 7

The set of markers of each genome is represented by a circle.

WhenD=, meaning that genomes A and B have no common marker that does not occur in C, the triangular inequality holds for both DCJ-indel and DCJ-substitution distances[12]. However, ifD, the triangular inequality can be disrupted for d DCJ id and d DCJ sb , and this may be an obstacle if one intends to use these distances to compute the median of three or more genomes and in phylogenetic reconstructions.

It is possible to establish the triangular inequality in our two models a posteriori, by adapting an approach proposed in[12]: we simply sum to each distance a surcharge that depends on the number of unique markers, as we will see in the following subsections.

We define the diameter as the maximum distance between any pair of genomes, usually as a function on the size of the genomes. We use this definition in the next results.

Correction for the DCJ-indel distance

For genomes A and B and a positive constant k, let m id (A,B)= d DCJ id (A,B)+k·u(A,B), where u(A,B) is the number of unique markers between A and B[7],[12]. We then have m id (A,B)= d DCJ id (A,B)+k(|A|+|F|+|B|+|E|), m id (A,C)= d DCJ id (A,C)+k(|A|+|D|+|C|+|E|) and m id (B,C)= d DCJ id (B,C)+k(|B|+|D|+|C|+|F|). From this definition we can derive a simpler inequality that can be used to determine the value of the constant k:

Proposition 5 (from[12]). Given three genomes A, B and C without duplicated markers, the inequality mid(A,B)≤mid(A,C)+mid(B,C) holds if, and only if, d DCJ id (A,B) d DCJ id (A,C)+ d DCJ id (B,C)+2k|D|, whereDis the set of markers common only to A and B.

The problem now is to find the minimum value of k for which the inequality of Proposition 5 holds. In order to accomplish this task, the first step is to determine the diameter of the DCJ-indel distance.

Lemma 4. Given a positive indel cost w≤1 and two genomes A and B with n common markers, then

d DCJ id ( A , B ) ( w + 1 ) n + w ( L A + S A + L B + S B ) ,

where L A , S A and L B , S B are, respectively, the number of linear chromosomes and circular singletons in genomes A and B.

Proof. Let |P| be the number of vertices in component P, that is DCJ-sorted with | P | 1 2 DCJs[14]. If |P| is even, P is sorted with | P | 2 1 DCJs andλ(P) | P | 2 +1 indels, then d DCJ id (P) | P | 2 1+w( | P | 2 +1)= ( w + 1 ) | P | 2 +w1. If |P| is odd, P is sorted with | P | 1 2 DCJs andλ(P) | P | + 1 2 indels, then d DCJ id (P) | P | 1 2 +w | P | + 1 2 = ( w + 1 ) | P | + w 1 2 . As w≤1 impliesw1 w 1 2 0, for any component P we have d DCJ id (P) ( w + 1 ) | P | + w 1 2 . Then, d DCJ id (A,B) P A G ( A , B ) d DCJ id (P) P A G ( A , B ) ( w + 1 ) | P | + w 1 2 = w + 1 2 P A G ( A , B ) |P|+ P A G ( A , B ) w 1 2 . Each linear chromosome corresponds to one path in A G(A, B), thus the number of components is at least (L A + S A + L B + S B ) and P A G ( A , B ) w 1 2 ( L A + S A + L B + S B ) ( w 1 ) 2 0. Furthermore, from[12] we know that P A G ( A , B ) |P|=2n+ L A + S A + L B + S B . □

We are ready to generalize the result of[12], and determine the minimum possible value of k.

Theorem 4. For any positive indel cost w≤1, the function midsatisfies the triangular inequality if and only ifk w + 1 2 .

Proof. Recall that, to prove the triangular inequality for mid, we only need to find a k such that d DCJ id (A,B) d DCJ id (A,C)+ d DCJ id (B,C)+2k|D| holds (Proposition 5). We know that the inequality holds whenD=[12]. It remains to examine the case in whichD. The worst case would be to have an empty genome C[12]. Let X A and X B be the number of chromosomes in A and B. Since C is empty, we know that d DCJ id (A,C)=w X A and d DCJ id (B,C)=w X B . From Lemma 4, we have d DCJ id (A,B)(w+1)|D|+w( L A + S A + L B + S B ). This gives(w+1)|D|+w( L A + S A + L B + S B )w( X A + X B )+2k|D|. Since L A + S A + L B + S B X A + X B , we have(w+1)|D|2k|D|, which holds for anyk w + 1 2 .

For the necessity, take A and B with n common markers and let each genome be composed of one circular chromosome, meaning that we have one adjacency per common marker in each genome (or n adjacencies per genome). Then let A G(A, B) have one single cycle with 2n vertices and let each vertex be labeled, so that the number of runs in the cycle is 2n and the number of unique markers in each genome is n. Thus, we have d DCJ id (A,B)=(n1)+w(n+1)=(w+1)n+(w1) and the corrected distance is mid(A, B)=(w + 1)n + (w − 1) + 2k n. Take C as an empty genome, so that d DCJ id (A,C)= d DCJ id (B,C)=w and mid(A, C)=mid(B, C)=w + 2k n. The inequality mid(A, B)≤mid(A, C) + mid(B, C) corresponds to (w + 1)n + (w − 1) + 2k n≤2w + 4k n or, equivalently, 2k n≥(w + 1)nw − 1, that isk w + 1 2 1 1 n , which holds for all n only ifk w + 1 2 . □

Correction for the DCJ-substitution distance

Similarly, in the case of the DCJ-substitution distance, for genomes A and B and a positive constant k, let m sb (A,B)= d DCJ sb (A,B)+ k ·u(A,B), where u(A, B) is the number of unique markers between A and B[7],[12]. We then have m sb (A,B)= d DCJ sb (A,B)+ k (|A|+|F|+|B|+|E|), m sb (A,C)= d DCJ sb (A,C)+ k (|A|+|D|+|C|+|E|) and m sb (B,C)= d DCJ sb (B,C)+ k (|B|+|D|+|C|+|F|). Again, from this definition we can derive a simpler inequality that can be used to determine the value of the constant k:

Proposition 6 (from[12]). Given three genomes A, B and C without duplicated markers, the inequality msb(A, B)≤msb(A, C)+msb(B, C) holds if, and only if, d DCJ sb (A,B) d DCJ sb (A,C)+ d DCJ sb (B,C)+2 k |D|, whereD is the set of markers common only to A and B.

In order to find the minimum value of k for which the inequality of Proposition 6 holds, we need to determine the diameter of the DCJ-substitution distance, that is given by the following lemma.

Lemma 5. If A and B are genomes with n common markers, then

d DCJ sb ( A , B ) ( w + 2 ) 2 n + w ( L A + S A + L B + S B ) ,

where L A , S A , L B and S B are, respectively, the number of linear chromosomes and circular singletons in genomes A and B.

Proof. Let |P| be the number of vertices in component P, that is DCJ-sorted with | P | 1 2 DCJs[14]. If |P| is even, then P can be DCJ-sorted with | P | 2 1 DCJs. We have to analyze two cases: (i) if |P|=4x+4, thenσ(P) | P | 4 +1 and d DCJ sb (P)( | P | 2 1)+w( | P | 4 +1)= ( w + 2 ) | P | 4 +w1; (ii) if |P|=4x+2, thenσ(P) | P | 2 4 +1 and d DCJ sb (P)( | P | 2 1)+w( | P | 2 4 +1)= ( w + 2 ) | P | 4 + w 2 2 . As w≤1 implies w 2 2 w10. If |P| is odd, then P is an AA- or a BB-path and can be DCJ-sorted with | P | 1 2 DCJs. Again, we have to analyze two cases: (i) if |P|=4x+3, thenσ(P) | P | + 1 4 and d DCJ sb (P) | P | 1 2 +w( | P | + 1 4 )= ( w + 2 ) | P | 4 + w 2 4 ; (ii) if |P|=4x+1, thenσ(P) | P | + 3 4 and d DCJ sb (P) | P | 1 2 +w( | P | + 3 4 )= ( w + 2 ) | P | 4 + 3 w 2 4 . In this last case we could have d DCJ sb (P)> ( w + 2 ) | P | 4 . Observe however that the numbers of AA- and BB-paths are bounded, respectively, by L A and L B . Then, d DCJ sb (A,B) P A G ( A , B ) d DCJ sb (P) P A G ( A , B ) ( w + 2 ) | P | 4 + ( 3 w 2 ) ( L A + L B ) 4 = w + 2 4 P A G ( A , B ) |P|+ ( 3 w 2 ) ( L A + L B ) 4 . From[12] we know that P A G ( A , B ) |P|=2n+ L A + S A + L B + S B . Therefore, d DCJ sb (A,B) w + 2 4 (2n+ L A + S A + L B + S B )+ ( 3 w 2 ) ( L A + L B ) 4 = ( w + 2 ) 2 n+w( L A + L B )+ ( w + 2 ) ( S A + S B ) 4 ( w + 2 ) 2 n+w( L A + L B + S A + S B ). □

We are ready to generalize the result of[12], and determine the minimum possible value ofk.

Theorem 5. For any positive substitution cost w≤1, the function msbsatisfies the triangular inequality if and only if k w + 2 4 .

Proof. Recall that, to prove the triangular inequality for msb, we only need to find a k such that d DCJ sb (A,B) d DCJ sb (A,C)+ d DCJ sb (B,C)+2 k |D| holds (Proposition 6). We know that the inequality holds whenD=[12]. It remains to examine the case in whichD. The worst case would be to have an empty genome C[12]. Let X A and X B be the number of chromosomes in A and B. Since C is empty, we know that d DCJ sb (A,C)=w X A and d DCJ sb (B,C)=w X B . From Lemma 5, we have d DCJ sb (A,B) ( w + 2 ) | D | 2 +w( L A + S A + L B + S B ). This gives ( w + 2 ) | D | 2 +w( L A + S A + L B + S B )w( X A + X B )+2 k |D|. Since L A + S A + L B + S B X A + X B , we have ( w + 2 ) | D | 2 2 k |D|, which holds for any k w + 2 4 .

For the necessity, take A and B with n common markers, for n even, and let each genome be composed of one circular chromosome, meaning that we have one adjacency per common marker in each genome (or n adjacencies per genome). Then let A G(A, B) have one single cycle with 2n vertices and let each vertex be labeled, so that the number of runs in the cycle is 2n and the number of unique markers in each genome is n. Thus, we have d DCJ sb (A,B)=(n1)+w( n 2 +1)= ( w + 2 ) n 2 +(w1) and the corrected distance is m sb (A,B)= ( w + 2 ) n 2 +(w1)+2 k n. Take C as an empty genome, so that d DCJ sb (A,C)= d DCJ sb (B,C)=w and msb(A, C)=msb(B, C)=w + 2kn. The inequality msb(A, B)≤msb(A, C) + msb(B, C) corresponds to ( w + 2 ) n 2 +(w1)+2 k n2w+4 k n or, equivalently,2 k n( w + 2 2 )nw1, that is k w + 2 4 w + 1 2 n , which holds for all n only if k w + 2 4 . □

Conclusions

In this work we have presented methods to compute in linear time the DCJ-indel and DCJ-substitution distances between two genomes without duplicated markers, when the content-modifying cost is distinct from and upper bounded by the DCJ cost. Content-modifying operations can be applied to pieces of DNA of any size, and a side effect of this property is that the triangular inequality does not hold for our distance formulas. However we have shown that an a posteriori correction can be applied to establish the triangular inequality in both DCJ-indel and DCJ-substitution distances.

References

  1. Hannenhalli S, Pevzner P: Transforming men into mice (polynomial algorithm for genomic distance problem). 36th Annual IEEE Symposium on Foundations of Computer Science. 1995, 581-592.

    Google Scholar 

  2. Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005, 21: 3340-3346.

    Article  CAS  PubMed  Google Scholar 

  3. Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 4175. Springer,2006, 163-173.

    Google Scholar 

  4. El-Mabrouk N: Sorting signed permutations by reversals and insertions/deletions of contiguous segments. J Discrete Algorithms. 2001, 1: 105-122.

    Google Scholar 

  5. Hannenhalli S, Pevzner P: Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM. 1999, 46: 1-27. 10.1145/300515.300516.

    Article  Google Scholar 

  6. Yancopoulos S, Friedberg R: DCJ path formulation for genome transformations which include insertions, deletions, and duplications. J Comput Biol. 2009, 16 (10): 1311-1338.

    Article  CAS  PubMed  Google Scholar 

  7. Braga MDV, Willing E, Stoye J: Double Cut and Join with Insertions and Deletions. J Comput Biol. 2011, 18 (9): 1167-1184. (a preliminary version appeared in proceedings of WABI 2010, LNBI vol. 6293, p. 90–101)

    Article  CAS  PubMed  Google Scholar 

  8. Braga MDV, Machado R, Ribeiro LC, Stoye J: Genomic distance under gene substitutions. BMC Bioinformatics. 2011, 12 (Suppl 9): S8.

    Article  PubMed Central  PubMed  Google Scholar 

  9. Boore JL: The duplication/random loss model for gene rearrangement exemplified by mitochondrial genomes of deuterostome animals. Comparative Genomics, Volume 1.Springer,2000, 133-147.

    Chapter  Google Scholar 

  10. Moritz C, Dowling TE, Brown WM: Evolution of animal mitochondrial DNA: relevance for population biology and systematics. Annual Review of Ecology and Systematics, Volume 18. 1987, 269-292. Annual Reviews Inc

    Google Scholar 

  11. Blanc G, Ogata H, Robert C: Reductive genome evolution from the mother of Rickettsia. PLoS Genet. 2007, 3: e14.

    Article  PubMed Central  PubMed  Google Scholar 

  12. Braga MDV, Machado R, Ribeiro LC, Stoye J: On the weight of indels in genomic distances. BMC Bioinformatics. 2011, 12 (Suppl 9): S13.

    Article  PubMed Central  PubMed  Google Scholar 

  13. da Silva PH, Braga MDV, Machado R, Dantas S: DCJ-indel distance with distinct operation costs. Algorithms in Bioinformatics, Lecture Notes in Computer Science, Volume 7534. Springer,2012, 378-390.

    Google Scholar 

  14. Braga MDV, Stoye J: The solution space of sorting by DCJ. J Comput Biol. 2010, 17 (9): 1145-1165.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

This research was partially supported by the Brazilian research agencies CNPq and FAPERJ.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone Dantas.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

PHS, RM, SD and MDVB have elaborated the model, proved the results and written the paper. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

da Silva, P.H., Machado, R., Dantas, S. et al. DCJ-indel and DCJ-substitution distances with distinct operation costs. Algorithms Mol Biol 8, 21 (2013). https://doi.org/10.1186/1748-7188-8-21

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1748-7188-8-21

Keywords