Email updates

Keep up to date with the latest news and content from Algorithms for Molecular Biology and BioMed Central.

Open Access Research

A mixed integer linear programming model to reconstruct phylogenies from single nucleotide polymorphism haplotypes under the maximum parsimony criterion

Daniele Catanzaro1*, Ramamoorthi Ravi2 and Russell Schwartz3

Author Affiliations

1 Graphes et Optimisation Mathématique (G.O.M.), Computer Science Department, Université Libre de Bruxelles (U.L.B.), Boulevard du Triomphe, CP 210/01, B-1050, Brussels, Belgium

2 Tepper School of Business, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213-3890

3 Department of Biological Sciences and Lane Center for Computational Biology, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, PA, 15213

For all author emails, please log on.

Algorithms for Molecular Biology 2013, 8:3  doi:10.1186/1748-7188-8-3

The electronic version of this article is the complete one and can be found online at: http://www.almob.org/content


Received:13 June 2012
Accepted:2 January 2013
Published:23 January 2013

© 2013 Catanzaro et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background

Phylogeny estimation from aligned haplotype sequences has attracted more and more attention in the recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from medical research, to drug discovery, to epidemiology, to population dynamics. The literature on molecular phylogenetics proposes a number of criteria for selecting a phylogeny from among plausible alternatives. Usually, such criteria can be expressed by means of objective functions, and the phylogenies that optimize them are referred to as optimal. One of the most important estimation criteria is the parsimony which states that the optimal phylogeny Tfor a set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M1">View MathML</a>of n haplotype sequences over a common set of variable loci is the one that satisfies the following requirements: (i) it has the shortest length and (ii) it is such that, for each pair of distinct haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M2">View MathML</a>, the sum of the edge weights belonging to the path from hi to hj in T is not smaller than the observed number of changes between hi and hj. Finding the most parsimonious phylogeny for <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M3">View MathML</a>involves solving an optimization problem, called the Most Parsimonious Phylogeny Estimation Problem (MPPEP), which is <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M4">View MathML</a>-hard in many of its versions.

Results

In this article we investigate a recent version of the MPPEP that arises when input data consist of single nucleotide polymorphism haplotypes extracted from a population of individuals on a common genomic region. Specifically, we explore the prospects for improving on the implicit enumeration strategy of implicit enumeration strategy used in previous work using a novel problem formulation and a series of strengthening valid inequalities and preliminary symmetry breaking constraints to more precisely bound the solution space and accelerate implicit enumeration of possible optimal phylogenies. We present the basic formulation and then introduce a series of provable valid constraints to reduce the solution space. We then prove that these constraints can often lead to significant reductions in the gap between the optimal solution and its non-integral linear programming bound relative to the prior art as well as often substantially faster processing of moderately hard problem instances.

Conclusion

We provide an indication of the conditions under which such an optimal enumeration approach is likely to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, although with stricter limits on numbers of variable sites. The work thus provides methodology suitable for provably optimal solution of some harder instances that resist all prior approaches.

Keywords:
Combinatorial optimization; Exact algorithms; Mixed integer programming; Phylogeny estimation; Haplotype estimation; Maximum parsimony; Single nucleotide polymorphism; Computational biology

Background

Molecular phylogenetics studies the hierarchical evolutionary relationships among species, or taxa, by means of molecular data such as DNA, RNA, amino acid or codon sequences. These relationships are usually described through a weighted tree, called a phylogeny, whose leaves represent the observed taxa, internal vertices represent the intermediate ancestors, edges represent the estimated evolutionary relationships, and edge weights represent measures of the similarity between pairs of taxa.

Accurately characterizing evolutionary relationships between organisms and their genomes is the basis of comparative genomic methods for interpreting meaning in sequence data, and for this reason the use of molecular phylogenetics has become widely used (and sometimes indispensable) in a multitude of research fields such as systematics, medical research, drug discovery, epidemiology, and population dynamics [3]. For example, the use of molecular phylogenetics was of considerable assistance in predicting the evolution of human influenza A [4], understanding the relationships between the virulence and the genetic evolution of HIV [5,6], identifying emerging viruses as SARS [7], recreating and investigating ancestral proteins [8], designing neuropeptides causing smooth muscle contraction [9], and relating geographic patterns to macroevolutionary processes [10].

The literature on molecular phylogenetics proposes a number of criteria for selecting the phylogeny of a set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M5">View MathML</a>of haplotypes extracted from n taxa from among plausible alternatives. The criteria can usually be quantified and expressed in terms of objective functions, giving rise to families of optimization problems whose general paradigm can be stated as follows [11]:

Problem 1

The Phylogeny Estimation Problem (PEP)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M6">View MathML</a>

where T a phylogeny of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M7">View MathML</a><a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M8">View MathML</a>the set of all possible phylogenies of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M9">View MathML</a><a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M10">View MathML</a> a function modeling the selected criterion of phylogeny estimation, and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M11">View MathML</a> is a characteristic function equal to one if a phylogeny T is compatible (according to the selected criterion) for the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M12','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M12">View MathML</a> A specific optimization problem is completely characterized by defining the functions f and g, and the phylogeny T that optimizes f and satisfies g is referred to as optimal.

One of the classic criteria for phylogeny estimation is the parsimony criterion, which assumes that one taxon evolves from another by means of small changes and that the most plausible phylogeny will be that requiring the smallest number of changes. That evolution proceeds by small rather than smallest changes is due to the fact that the neighborhood of possible alleles that are selected at each instant of the life of a taxon is finite and, perhaps more important, that the selective forces acting on the taxon may not be constant throughout its evolution [12,13]. Over the long term (periods of environmental change, including the intracellular environment), the accumulation of small changes will not generally correspond to the smallest possible set of changes consistent with the observed final sequences. Nevertheless, it is plausible to believe, at least for well-conserved molecular regions where mutations are reasonably rare and unlikely to have occurred repeatedly at any given variant locus, that the process of approximating small changes with smallest changes could provide a good approximation to the true evolutionary process of the observed set of taxa [14]. Such an assumption is likely to be reasonable, for example, in intraspecies phylogenetics, where few generations have elapsed since the observed taxa shared a common ancestor and thus the expected number of mutations per locus is much less than one. When such assumtions hold, a phylogeny of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M13','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M13">View MathML</a>is defined to be optimal under the parsimony criterion if it satisfies the following requirements: (i) it has the shortest length, i.e., the minimum sum of the edge weights, and (ii) it is such that, for each pair of distinct haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M14">View MathML</a> the sum of the edge weights belonging to the path from hi to hj in T is not smaller than the observed number of changes between hi and hj[11]. The first condition imposes the assumption that the smallest number of mutations consistent with the observed sequences is a good approximation to the true accumulated set of mutations; the second condition correlates the edge weights to the observed data.

The parsimony assumption can be considered accurate in the limit of low mutation rates or short time scales, making it a reasonable model for situations such as analysis of intraspecies variation where little time is presumed to have elapsed since the existence of a common ancestor of all observed taxa. Maximum parsimony also remains valuable as a model for novel methodology development in phylogenetics because of its relative simplicity and amenity to theoretical analysis. Novel computational strategies, such as those developed in this paper, might therefore productively be developed and analyzed in the context of maximum parsimony before being extended to more complicated models of phylogenetics.

Finding the phylogeny that satisfies the parsimony criterion involves solving a specific version of the PEP, called the Most Parsimonious Phylogeny Estimation Problem (MPPEP). Some of the variants of the MPPEP, see e.g., [15,16], can be solved in polynomial time, however, in the most general case, the problem is <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M15">View MathML</a>-hard [11,17] and this fact has justified the development of a number of exact and approximate solution approaches, such those described in [11,17,18]. Some recent versions of the MPPEP, such as the Most Parsimonious Phylogeny Estimation Problem from SNP haplotypes (MPPEP-SNP) investigated in this article, play a fundamental role in providing predictions of practical use in several medical bioinformatics applications, such as disease association studies [19] or reconstruction of tumor phylogenies [20,21]. In these contexts, it would be highly desirable to have the most accurate inferences possible for specific applications, but this in turn would imply to have algorithms able to exactly solve instances of such versions. As regards the MPPEP-SNP, the literature describes some (rare) circumstances for which it is possible to solve the problem in polynomial time (see Section Methods). Unfortunately, in the general case the MPPEP-SNP is <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M16">View MathML</a>-hard and solving provably to optimality therefore generally requires the use of exact approaches based on implicit enumeration algorithms, similar to the mixed integer programming strategies described in [1,2,22].

In this article, we explore the prospects for improving on the implicit enumeration strategy of [1,2] using a novel problem formulation and a series of additional constraints to more precisely bound the solution space and accelerate implicit enumeration of possible optimal phylogenies. We present a formulation for the problem based on an adaptation of [23]’s mixed integer formulation for the Steiner tree problem extended with a number of preprocessing techniques and reduction rules to further decrease its size. We then show that it is possible to exploit the high symmetry inherent in most instances of the problem, through a series of strengthening valid inequalities, to eliminate redundant solutions and reduce the practical search space. We demonstrate through a series of empirical tests on real and artificial data that these novel insights into the symmetry of the problem often leads to significant reductions in the gap between the optimal solution and its non-integral linear programming bound relative to the prior art as well as often substantially faster processing of moderately hard problem instances. More generally, the work provides an indication of the conditions under which such an optimal enumeration approach is likely to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, although with stricter limits on numbers of variable sites. The work thus provides methodology suitable for provably optimal solution of some harder instances that resist all prior approaches. In future work, it may provide useful guidance for strategies and prospects of similar optimization methods for other variants of phylogeny inference.

Methods

Notation and problem formulation

Before proceeding, we shall introduce some notation and definitions that will prove useful throughout the article. The human genome is divided in 23 pairs of chromosomes, i.e., organized structures of DNA that contain many genes, regulatory elements and other nucleotide sequences. When a nucleotide site of a specific chromosome region shows a variability within a population of individuals then it is called a Single Nucleotide Polymorphism (SNP). Specifically, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the minor allele) while for the rest of the population a different nucleotide is observed (the major allele). The minor allele, or mutant type[24], is generally encoded as ‘1’, as opposed to the major allele, or wild type[24], generally encoded as ‘0’. A haplotype is a set of alleles, or more formally, a string of length m over an alphabet Σ = {0,1} [25].

Given a set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M17">View MathML</a>of n haplotypes, denote <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M18">View MathML</a> as the set of alleles and hi(s), <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M19">View MathML</a>, as the s-th allele of haplotype <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M20">View MathML</a>. Given two distinct haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M21">View MathML</a> we denote <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M22">View MathML</a> as the subset of different alleles between hi and hj, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M23">View MathML</a> as the distance between hi and hj, and we say that hi and hj are adjacent if <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M24">View MathML</a>. From a biological point of view, the adjacency between a pair of distinct haplotypes means that one of the two haplotypes evolved from the other by mutation of a specific SNP over time.

Consider a graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M25">View MathML</a> having a vertex for each haplotype in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M26">View MathML</a>and an edge for each pair of adjacent haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M27">View MathML</a> Then, a phylogeny T of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M28">View MathML</a>is a spanning tree of G, i.e., an acyclic subgraph of G in which a pair of vertices <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M29">View MathML</a> is adjacent in T if <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M30">View MathML</a>. It is worth noting that, according to the definition of the edge set E, in general a phylogeny of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M31">View MathML</a>may not exist as the graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M32">View MathML</a> might not be connected. To always ensure the existence of a phylogeny for <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M33">View MathML</a> we introduce the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M34">View MathML</a> which consists of the minimum number of haplotypes that should be added to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M35">View MathML</a>in such a way that, defined <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M36">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M37">View MathML</a>, the graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M38">View MathML</a> is connected. From a biological point of view, the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M39">View MathML</a> represents the set of haplotypes that are ancestors of the observed ones but either had gone extinct or just were not observed in that sample (also called Steiner nodes).

Denote <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M40">View MathML</a> as a phylogeny of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M41">View MathML</a>, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M42">View MathML</a> as the edge set of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M43">View MathML</a>, and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M44">View MathML</a> as the length of the phylogeny <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M45">View MathML</a>, i.e., the sum of the distances <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M46">View MathML</a>, for all <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M47">View MathML</a>. Then, the problem of finding a phylogeny of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M48">View MathML</a>that satisfies the parsimony criterion consists of solving the following optimization problem:

Problem 2

The Most Parsimonious Phylogeny Estimation Problem from SNP haplotypes (MPPEP-SNP).Given a set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M49">View MathML</a>of n haplotypes having m alleles each, find the minimum cardinality haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M50">View MathML</a> to be added to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M51">View MathML</a>so that the phylogeny <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M52">View MathML</a> has minimum length.

If the haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M53">View MathML</a> is empty, i.e., if <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M54">View MathML</a> is connected, then MPPEP-SNP can be solved in polynomial time as the minimum spanning tree is a (optimal) solution to the MPPEP-SNP. Similarly, if the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M55">View MathML</a>satisfies the perfect phylogeny condition i.e., the requirement that each allele changes only once throughout the optimal phylogeny (see [19]), then the MPPEP-SNP can be still solved in polynomial time [26-28]. Unfortunately, it is possible to prove that in the general case the MPPEP-SNP is <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M56">View MathML</a>-hard (see [1,22]). In fact, the binary nature of the SNP haplotypes allows us to interpret a generic haplotype <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M57">View MathML</a> as a vertex of a m-dimensional unit hypercube, its s-th allele as the s-th coordinate of the vertex hi, and the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M58">View MathML</a> as the set of Steiner vertices of the unit hypercube. Hence the MPPEP-SNP can be seen as particular case of the Steiner tree problem in a graph, a notorious <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M59">View MathML</a>-hard combinatorial optimization problem [29].

Finding the optimal solutions to the MPPEP-SNP is fundamental to validating the parsimony criterion, i.e., to verify whether, for a given instance of MPPEP-SNP, the results predicted by the criterion match the experimental ones. Unfortunately, the <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M60">View MathML</a>-hardness of the MPPEP-SNP limits the size of the instances analyzable to the optimum, which in turn affects the ability to validate the parsimony criterion, hence the practical utility of the predictions themselves. In order to address this concern, in the following section we shall develop an integer programming model able to provide optimal solutions to real instances of the MPPEP-SNP.

A mixed integer programming model for the MPPEP-SNP

Let <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M61">View MathML</a> the set of potential vertices of a phylogeny <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M62">View MathML</a> of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M63">View MathML</a>and assume the convention to denote the n haplotypes in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M64">View MathML</a>as the first n vertices in V. The first task in designing an integer programming model for the MPPEP-SNP that uses a polynomial-size number of variables consists of characterizing V, i.e., determining an upper and a lower bound on the cardinality of the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M65">View MathML</a>. In fact, observe that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M66">View MathML</a> contains potentially an exponential number of haplotypes, namely all vertices of the unit hypercube that belong to the set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M67">View MathML</a>. However, we can easily bound the cardinality of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M68">View MathML</a> by means of the following approach. Consider the complete graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M69">View MathML</a>, where <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M70">View MathML</a>, and construct a minimum spanning tree <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M71">View MathML</a> of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M72">View MathML</a> Denote <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M73">View MathML</a> as the set of edges (hi,hj) of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M74">View MathML</a>. Then, an upper bound UB on the overall number of Steiner vertices of the optimal phylogeny <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M75">View MathML</a> can be obtained by computing the sum

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M76">View MathML</a>

Similarly, denote <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M77">View MathML</a>, a lower bound LB on the overall number of Steiner vertices of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M78">View MathML</a> can be obtained as [30,31]:

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M79">View MathML</a>

Denote ui, i ∈ V, as a decision variable equal to 1 if the i-th vertex of V is considered in the optimal solution to the MPEPP-SNP and 0 otherwise; <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M80','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M80">View MathML</a> as a decision variable equal to 1 if in the optimal solution to the MPPEP-SNP the s-th coordinate of the vertex ui, i ∈ V, is 1 and 0 otherwise; <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M81','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M81">View MathML</a> as a decision variable equal to 1 if in the optimal solution to the MPPEP-SNP the pair of distinct vertices i,j ∈ V has a change at s-th coordinate, and 0 otherwise; and yij as a decision variable equal to 1 if the pair of distinct vertices i,jVis adjacent in the optimal solution to the MPPEP-SNP and 0 otherwise. Finally, let <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M82','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M82">View MathML</a>, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M83','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M83">View MathML</a>, and Q = {1,2,…,n + LB}. Then, a valid formulation for the MPPEP-SNP is the following:

Formulation 1

Basic Model

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M84','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M84">View MathML</a>

(1a)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M85','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M85">View MathML</a>

(1b)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M86','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M86">View MathML</a>

(1c)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M87','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M87">View MathML</a>

(1d)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M88','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M88">View MathML</a>

(1e)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M89','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M89">View MathML</a>

(1f)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M90','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M90">View MathML</a>

(1g)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M91','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M91">View MathML</a>

(1h)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M92','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M92">View MathML</a>

(1i)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M93','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M93">View MathML</a>

(1j)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M94','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M94">View MathML</a>

(1k)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M95','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M95">View MathML</a>

(1l)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M96','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M96">View MathML</a>

(1m)

The objective function (1a) aims at minimizing the length of the optimal phylogeny. Constraints (1b) impose that the coordinates of the first n vertices in V are exactly the ones of the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M97','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M97">View MathML</a> Constraints (1c) impose that the s-th coordinate of vertex ui, i ∈ V, can assume value 1 only if vertex ui is considered in the optimal solution to the problem. Constraints (1d)-(1e) force variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M98','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M98">View MathML</a> to be one if in the optimal solution to the problem there exist a pair of adjacent vertices ij ∈ V having a different value at the s-th coordinate. Constraints (1f) impose that in an optimal solution to the problem two distinct vertices ij ∈ V can be adjacent only if <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M99','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M99">View MathML</a>. Constraints (1g)-(1h) impose that in the optimal solution to the problem variable yij may assume value 1 only if both vertices i and j are considered. Vice versa, constraints (1i) impose that if in the optimal solution to the problem a vertex ui, i ∈ V, is considered then at least one variable yij must assume value 1. Constraints (1j) and (1k) impose the Generalized Subtour Elimination Constraints (GSEC) [23]. Finally, constraints (1l) impose that the first n + LBvertices in V have to be considered in the optimal solution to the problem.

Note that Formulation 1 can be easily extended to the case in which the haplotypes are represented by multi-character data, i.e., sequences over an alphabet Σ = {0,1,2,…,γ}. In fact, in such a case it is sufficient to replace each character c in the haplotype by a binary γ -vector ν such that the s-th coordinate of ν is equal to 1 if the character c is equal to s, s ∈ Σ, and 0 otherwise. For example, if the generic haplotype were, for example, the string 〈AACGT〉, then it could be represented as 〈1000 1000 0100 0010 0001〉.

Reducing model size

Formulation 1 is characterized by a polynomial number of variables and an exponential number of constraints. Its implementation can be performed by means of standard branch-and-cut approaches that use GSEC separation oracles such as those described in [32].

It is worth noting that variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M100','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M100">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M101','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M101">View MathML</a> can be relaxed in Formulation 1c)-(1e) and the convexity constraint (1f) suffice to guarantee their integrality in any optimal solution to the problem. Moreover, Formulation 1 can be reduced in size by removing those variables that are redundant or whose value is known in the optimal solution to the problem. For example, variables yij can be removed from Formulation 1 as it is easy to realize that they are redundant. Similarly, all variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M102','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M102">View MathML</a> such that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M103','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M103">View MathML</a> and dij > 1 do not need to be defined as their value will be always zero for any <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M104','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M104">View MathML</a> and in any feasible solution to the problem. Variables ui, i ∈ Q, do not need to be declared as their value will be always 1 any feasible solution to the problem. Finally, variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M105','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M105">View MathML</a>, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M106','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M106">View MathML</a>, can be removed as their value is univocally assigned by the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M107','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M107">View MathML</a> The reduction process can be further combined with the preprocessing strategies described in [1] to obtain even smaller formulations. Such strategies allow one to remove alleles from the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M108','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M108">View MathML</a>without altering the optimal solution to the problem. For example, suppose that the haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M109','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M109">View MathML</a>is such that there exists an allele <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M110','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M110">View MathML</a> such that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M111','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M111">View MathML</a>, for all <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M112','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M112">View MathML</a>; then it is easy to realize that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M113','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M113">View MathML</a>can be removed from <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M114','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M114">View MathML</a>as in any feasible solution to the problem the <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M115','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M115">View MathML</a> coordinate of any vertex in the phylogeny would be characterized by having xiŝ = 1. A similar situation occurs when there exists an allele <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M116','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M116">View MathML</a> such that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M117','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M117">View MathML</a>, for all <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M118','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M118">View MathML</a>. Analogously, suppose that the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M119','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M119">View MathML</a>is characterized by equal alleles, i.e., by the existence of two alleles, say <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M120','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M120">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M121','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M121">View MathML</a>, such that <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M122','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M122">View MathML</a>, for all <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M123','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M123">View MathML</a>. Then it is easy to realize that if one removes one of the two alleles from <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M124','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M124">View MathML</a> say <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M125','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M125">View MathML</a>, and multiplies the <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M126','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M126">View MathML</a>-th coordinate by 2 does not alter neither the structure nor the value of the optimal solution to the problem. Describing all the preprocessing techniques for shrinking the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M127','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M127">View MathML</a>is beyond the scope of the present article. The interested reader will find a systematic discussion of this topic in [1].

By applying the previously cited reduction strategies to Formulation 1 and denoting <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M128','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M128">View MathML</a>as the set of alleles resulting from the application of the preprocessing strategies described in [1], ws as the number of alleles in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M129','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M129">View MathML</a>equal to the s-th, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M130','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M130">View MathML</a>, Z as the set for which variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M131','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M131">View MathML</a> are defined, R = {n + LB + 1,n + LB + 2,…,n + UB}, and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M132','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M132">View MathML</a>, for any C ⊂ V, the following reduced formulation for the MPPEP-SNP can be obtained:

Formulation 2

Reduced Model

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M133','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M133">View MathML</a>

(2a)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M134','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M134">View MathML</a>

(2b)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M135','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M135">View MathML</a>

(2c)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M136','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M136">View MathML</a>

(2d)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M137','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M137">View MathML</a>

(2e)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M138','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M138">View MathML</a>

(2f)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M139','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M139">View MathML</a>

(2g)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M140','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M140">View MathML</a>

(2h)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M141','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M141">View MathML</a>

(2i)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M142','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M142">View MathML</a>

(2j)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M143','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M143">View MathML</a>

(2k)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M144','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M144">View MathML</a>

(2l)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M145','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M145">View MathML</a>

(2m)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M146','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M146">View MathML</a>

(2n)

Note that in Formulation 2 variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M147','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M147">View MathML</a> and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M148','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M148">View MathML</a> cannot be relaxed anymore.

Strengthening valid inequalities

By exploiting the integrality of variables ui, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M149','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M149">View MathML</a>, and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M150','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M150">View MathML</a>, a number of valid inequalities can be developed to strengthen Formulation 2.

Proposition 1

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M151','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M151">View MathML</a>

(3)

are valid for Formulation 2.

Proof

In a feasible solution to the problem variable ui, i ∈ V∖(Q ∪ {n + UB}), can assume only value 0 or 1. If ui = 0, constraint (3) reduces to ui + 1 ≤ 0 which is trivially valid for Formulation 2. If ui = 1, constraint (3) reduces to ui + 1 ≤ 1 which is again valid. □

Constraints (3) impose an ordering on the variables ui, i ∈ R, so that vertex ui + 1 can be considered in the optimal solution to the problem only if vertex ui has been already considered.

Proposition 2

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M152','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M152">View MathML</a>

(4)

are valid for Formulation 2.

Proof

In a feasible solution to the problem a vertex ui, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M153','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M153">View MathML</a>, cannot be a terminal vertex. In fact, if such a condition held, a cheaper solution could be obtained by dropping ui from <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M154','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M154">View MathML</a>, contradicting the optimality of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M155','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M155">View MathML</a> itself. Hence, the degree of any vertex in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M156','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M156">View MathML</a> must be at least 2. Now, in a feasible solution to the problem variables ui ∈ {0,1}. If ui = 0, constraint (4) reduces to

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M157','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M157">View MathML</a>

which is trivially valid. Vice versa, if ui = 1, constraint (4) reduces to

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M158','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M158">View MathML</a>

which is again valid for the above arguments. □

Proposition 3

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M159','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M159">View MathML</a>

(5)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M160','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M160">View MathML</a>

(6)

are valid for Formulation 2.

Proof

As observed in the previous proposition, in a feasible solution to the problem <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M161','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M161">View MathML</a>, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M162','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M162">View MathML</a>, i,jZ, can assume only value 0 or 1. If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M163','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M163">View MathML</a>, then constraint (5) (respectively constraint (6)) reduces to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M164','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M164">View MathML</a> (respectively <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M165','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M165">View MathML</a>), which is trivially valid due to the integrality of variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M166','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M166">View MathML</a>. If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M167','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M167">View MathML</a>, then either <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M168','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M168">View MathML</a> or <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M169','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M169">View MathML</a>. If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M170','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M170">View MathML</a> then constraint (5), (respectively constraint (6)) reduces to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M171','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M171">View MathML</a> (respectively <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M172','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M172">View MathML</a>), which is trivially valid. If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M173','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M173">View MathML</a> then constraint (5) (respectively constraint (6)) reduces to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M174','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M174">View MathML</a> (respectively <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M175','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M175">View MathML</a>), which is again valid. □

Similar arguments can be used to prove the following proposition:

Proposition 4

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M176','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M176">View MathML</a>

(7)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M177','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M177">View MathML</a>

(8)

are valid for Formulation 2.

Given an input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M178','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M178">View MathML</a>and a pair of non-adjacent haplotypes hi and hj, there exit multiple equivalent paths that may connect hi and hj in the unary hypercube. This characteristic constitutes the principal class of symmetries for the MPPEP-SNP and may lead to poor relaxations for the problem. For example, suppose that the input haplotype set <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M179','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M179">View MathML</a>is constituted by haplotypes h1 = 〈00〉 and h2 = 〈11〉. Then a possible solution to the instance may consist either of a star centered in haplotype h3 = 〈10〉 or a star centered in haplotype h3 = 〈01〉(see Figure 1). Note that both solutions are feasible and optimal for the specific instance. A possible strategy to break this class of symmetries consists of imposing the following constraints:

thumbnailFigure 1. An example of two symmetric paths linking haplotypes00and11.

Proposition 5

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M180','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M180">View MathML</a>

(9)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M181','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M181">View MathML</a>

(10)

are valid for Formulation 2.

Proof

The statement trivially follows from the integrality of variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M182','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M182">View MathML</a> and from constraints (2b). □

Constraints (9)-(10) impose an ordering on the coordinates of the vertices in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M183','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M183">View MathML</a> by means of the smallest big-M possible, i.e., a power of 2. Note that the distinction between constraints (9) and (10) is necessary, as in principle vertices in R may not be needed in the optimal solution to the problem.

Proposition 6

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M184','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M184">View MathML</a>

(11)

are valid for Formulation 2.

Proof

In a feasible solution to the problem, the sum <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M185','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M185">View MathML</a>, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M186','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M186">View MathML</a>, i,jZ, can assume only value 0 or 1. If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M187','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M187">View MathML</a>, constraint (11) reduces to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M188','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M188">View MathML</a> which is trivially valid. Vice versa, If <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M189','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M189">View MathML</a>, constraint (11) reduces to <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M190','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M190">View MathML</a> which is again valid due to Propositions 1 and 2. □

Proposition 6 forces vertices in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M191','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M191">View MathML</a> to be sorted according to a decreasing degree order. In this way, it is possible to avoid the occurrence of symmetric solutions to the problem differing just for the degree of the Steiner vertices (see e.g., Figure 2).

thumbnailFigure 2. An example of two symmetric solution to the MPPEP-SNP.

Let <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M192','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M192">View MathML</a> and k ∈ V, k ∉ Q3. Then the following proposition holds:

Proposition 7

Constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M193','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M193">View MathML</a>

(12)

are valid for Formulation 2.

Proof

In a feasible solution to the problem the path between two distinct haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M194','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M194">View MathML</a> cannot be shorter than the distance <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M195','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M195">View MathML</a>. Hence, if the distance between hiand hjis greater or equal to three, vertices i and j cannot be adjacent to a same vertex k, i.e., only one of the two sums <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M196','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M196">View MathML</a> or <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M197','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M197">View MathML</a> can be equal to 1. □

Note that if k ∈ R then (12) can be strengthened by replacing the right-hand-side by uk. Moreover, Proposition 7 can be generalized as follows. Consider the sets <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M198','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M198">View MathML</a>, d ∈ {3,4,…,m}, C ⊂ V such that 2 ≤ |C| ≤ d − 1 and C ∩ Qd = , and a path p that involves only vertices in C. Denote pk the k-th vertex in p. Then the following proposition holds:

Proposition 8

The family of constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M199','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M199">View MathML</a>

(13)

called forbidden path constraints, are valid for Formulation 2.

Proof

Similarly to Proposition 7, in a feasible solution to the problem the path p between two distinct haplotypes <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M200','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M200">View MathML</a> cannot be shorter than the distance <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M201','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M201">View MathML</a>. Hence, if the distance between hi and hj is greater or equal to d, at most |C| vertices can belong to p. □

Experiments

In this section we analyze the performance of our model to solve the MPPEP-SNP. Our experiments were motivated by a twofold reason, namely: to evaluate, with respect to Formulation 1, the benefits obtained by the removal of the redundant variables and by the inclusion of the valid inequalities previously described; and to allow the analysis of larger datasets with respect to the ones analyzable by means of [1]’s algorithm, currently the best known exact approach to solution of the MPPEP-SNP.

Similar to [1], we emphasize that the experiments aim simply to evaluate the runtime performance of our model for solving MPPEP-SNP. We neither attempt to study the efficiency of MPPEP-SNP for phylogeny estimation nor compare the accuracy of our algorithm to phylogeny estimation solvers that do not use the parsimony criterion. The reader interested in a systematic discussion about such issues is referred to [19,33].

Implementation

We implemented Formulations 1 and 2 by means of Mosel 64 bit 3.2.0 of Xpress-MP, Optimizer version 22, running on a Pentium 4, 3.2 GHz, equipped with 2 GByte RAM and operating system Gentoo release 7 (kernel linux 2.6.17). In both formulations, we computed the overall solution time to solve a generic instance of the problem as the sum of the preprocessing time due to the implementation of [22]’s reduction rules plus the solution time taken by the Optimizer to find the optimal solution to the instance. In preliminary experiments, we observed that Formulation 2 has two main advantages with respect to Formulation 1, namely: it requires much less memory to load the model (at least 27% RAM less in the analyzed instances) and it is characterized by faster linear relaxations at each node of the search tree. As result, Formulation 2 allows potentially the analysis of larger instances than Formulation 1 and may be characterized by faster solution times. Hence, we preferred to use Formulation 2 in our experiments.

We considered two different implementations of Formulation 2, namely: the GESC-based Reduced Model (GSEC-RM) and the Flow-based Reduced Model (Flow- RM). The GESC-RM consists of Formulation 2 strengthened by the valid inequalities previously described. The Flow-RM consists of Formulation 2 strengthened by the valid inequalities and such that the GSEC are replaced by a multi-commodity flows. Specifically, by denoting <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M202','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M202">View MathML</a> as a decision variable equal to one if there exists a flow from vertex 1 to vertex <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M203','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M203">View MathML</a> passing through edge <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M204','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M204">View MathML</a> and 0 otherwise, the Flow-RM can be obtained by replacing constraints 2l) with:

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M205','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M205">View MathML</a>

(14)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M206','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M206">View MathML</a>

(15)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M207','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M207">View MathML</a>

(16)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M208','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M208">View MathML</a>

(17)

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M209','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M209">View MathML</a>

(18)

In preliminary experiments we observed that the Flow-RM outperforms the GESC-RM in terms of solution time. This fact is mainly due to the computational overhead introduced by the GSEC separation oracle which seems to be not compensated by the size of the analyzed instances. Hence, we did not consider the GESC-RM any further in our experiments.

During the runtime, we enabled the Xpress-MP automatic cuts and the Xpress-MP pre-solving strategy. Moreover, we also tested a number of generic primal heuristics for the Steiner tree problem to generate a first primal bound to the MPPEP-SNP (see, e.g., [34]). Unfortunately, in preliminary experiments we observed that the use of such heuristics interferes negatively with the Xpress Optimizer, by delaying the solution time of the analyzed instances. Hence, we disabled the used of the generic primal heuristics and enabled the use of the Xpress-MP primal heuristic instead. The source code of the algorithm can be downloaded at http://homepages.ulb.ac.be/~dacatanz/Site/Software_files/iMPPEP.zip webcite.

Separation oracle for the forbidden path constraints

When using the Flow-RM, the valid inequalities (3)-(12) are loaded together with the reduced model. On the contrary, the valid inequalities (13) are dynamically generated by means of a separation oracle working as follows. Before loading the reduced model, we precompute the sets Qd, for all d ∈ {3,4,…,m}. Let <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M210','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M210">View MathML</a> be the current fractional solution at a given node of the search tree and, for all d ∈ {3,4,…,m}, consider a pair of vertices i,jQd. Then, the forbidden path constraints (13) are violated if there exists a path p having internal vertices in C ⊂ V, 2 ≤ |C| ≤ d − 1, C ∩ Qd = , and such that

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M211','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M211">View MathML</a>

(19)

Note that searching for the most violated constraint (19) is in general <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M212','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M212">View MathML</a>-hard as it involves solving a longest path problem on the weighted graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M213','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M213">View MathML</a>, i.e., the graph <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M214','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M214">View MathML</a> whose edges are weighted by variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M215','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M215">View MathML</a> and whose vertex set is restricted to (VQd) ∪ {i,j}. In order to have a fast separation oracle for the forbidden path constraints we do not solve exactly (19) but we use a heuristic approach instead. Specifically, we first sort edges of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M216','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M216">View MathML</a> in decreasing order according to their weights and we select two distinct vertices v1,v2 ∈ VQd such that edge (v1,v2) has the largest weight. Subsequently, we set C = {v1,v2}, mark v1 and v2 as visited, and build a simple path from vertex i to vertex j passing by v1 and v2. If p is such that (19) is satisfied then we add the constraint

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M217','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M217">View MathML</a>

(20)

to the formulation; otherwise, we select a different pair of vertices in VQd and iterate this procedure until either 10 distinct paths have been generated or all possible pairs of vertices in VQd have been selected. If all vertices have been selected but less than 10 distinct paths have been generated, then we select a larger subset of VQd (e.g., a triplet of vertices) and we iterate again the previous steps. It is easy to realize that this procedure may potentially generate all the possible paths violating (13). However, we stop the procedure after generating 10 paths or after considering subset C containing more than 5 vertices as we observed in preliminary experiments that this strategy provides the best trade-off between speed and tightness of the cut.

Branching strategies

In preliminary experiments we observed that the standard branching strategy implemented in the Xpress-MP Optimizer is not appropriate for the problem as it is not able to exploit the dissimilarity of the weights ws in the objective function. This inconveniently leads to formulations characterized by slow solution times. To improve this aspect we implemented a different strategy consisting of branching on the following constraints:

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M218','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M218">View MathML</a>

(21)

or

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M219','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M219">View MathML</a>

(22)

where α ∈ {1,2,…,q} and <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M220','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M220">View MathML</a>. Constraints (21)-(22) limit the number of changes along a phylogeny with respect to a given coordinate <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M221','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M221">View MathML</a> and tend to be more effective when the weights ws are very dissimilar among them. This branching strategy can be implemented by introducing a decision variable

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M222','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M222">View MathML</a>

for all <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M223','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M223">View MathML</a> and α ∈ {1,2,…,q}, and by adding the following constraints

<a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M224','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M224">View MathML</a>

We observed that even better runtime performance can be obtained by sorting the coordinates of the input haplotypes in decreasing way according to the weights ws and by branching first on variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M225','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M225">View MathML</a>, then on variables ui, and subsequently on variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M226','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M226">View MathML</a> and finally on variables <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M227','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M227">View MathML</a>.

Performance analysis

In order to obtain a measure of the performance of the Flow-RM, we compared [1]’s polynomial-size formulation versus the Flow-RM on [1]’s real instances of the MPPEP-SNP, namely: Human chromosome Y constituted by 150 haplotypes having 49 SNPs each; bacterial DNA constituted by 17 haplotypes having 1510 SNPs each; Chimpanzee mitochondrial DNA constituted by 24 haplotypes having 1041 SNPs each; Chimpanzee chromosome Y constituted by 24 haplotypes having 1041 SNPs each; and a set of four Human mitochondrial DNA from HapMap [35] constituted by 40 haplotypes having 52 SNPs each, 395 haplotypes having 830 SNPs each, 13 haplotypes having 390 SNPs each, and 44 haplotypes having 405 SNPs each, respectively. Such instances consist only of non-recombining data (Y chromosome, mitochondrial, and bacterial DNA) and can be downloaded at http://homepages.ulb.ac.be/~dacatanz/Site/Software_files/iMPPEP.zip webcite.

Table 1 shows the results obtained by such comparison. Specifically, the fourth and fifth columns refer to the gaps (expressed in percentage) of the respective formulations, i.e., to the difference between the optimal value to a specific instance and the value of linear relaxation at the root node of the search tree, divided by the optimal value. The table shows that, excluding the cases in which the solution to a specific instance was trivially a minimum spanning tree (see e.g., Human chromosome Y, Chimpanzee mtDNA, and Chimpanzee chromosome Y), the Flow-RM is always characterized by (sometimes dramatically) smaller gaps. This fact derives on the one hand from the tightness of the Flow-RM with respect to [1]’s polynomial-size formulation and on the other hand from the efficiency of the strengthening valid inequalities previously described. The poor relaxations of their formulation led [1] to propose an alternative and faster exact approach to solution of the MPPEP-SNP based on the brute-force enumeration of all possible Steiner vertices necessary to solve a specific instance of the problem. To speed up the computation, the brute-force enumeration algorithm makes use of a set of reduction rules based on Buneman graph enumeration to decrease the number of Steiner vertices to be considered. Interestingly, despite the differences in terms of implementation language between the two programs (namely, Mosel for the Flow-RM and C++ for [1]’s brute-force enumeration algorithm), the Flow-RM proved to be competitive with [1]’s enumeration algorithm, being able to solve almost all the considered instances within 1 second computing time. Only in two cases, namely Human mtDNA 40×52 and Human mtDNA 395×830, the Flow-RM needed more than 5 minutes to find the corresponding optimal solutions. The deterioration of the runtime performance in those instances is mainly due to the overhead necessary to load the formulation (that in both cases is considerably bigger than in the other instances) and to an intensive use of the separation oracle for the forbidden path constraints. Possibly, a more thorough implementation of the separation oracle and the use of more performing languages (e.g., C++) could help in speeding up computations in those instances at least.

Table 1. Comparison between the gap of [[1]]’s polynomial size integer programming model for the MPPEP-SNP versus the gap of the flow-based reduced model and its strengthening valid inequalities

Interestingly, sometimes in real applications the number of haplotypes can be much bigger than the number of SNPs. Hence, it is important to test the ability of an exact algorithm to tackle instances of the MPPEP-SNP containing e.g., hundreds haplotypes. [1] observed that their brute force enumeration algorithm is able to tackle instances of the problem containing up to 270 haplotypes having up to 9 SNPs each within 12 hours computing time. Unfortunately, the authors also observed that their algorithm is unable to solve larger instances of the MPPEP-SNP, no matter the maximum runtime considered. In this context, the Flow-RM makes the difference, being able to tackle instances of the MPPEP-SNP having up to 300 haplotypes and 10 SNPs within 3 hours computing time. To show this result, we considered a number of random instances of the problem containing 100, 150, 200, 250, and 300 haplotypes, respectively. Fixing the number of haplotypes n∈{100,150,200,250,300}, we created an instance of the problem by generating at random n strings of length 10 over the alphabet Σ={0,1}. During the generation process, we randomly selected the number of SNPs equal to 1 in a given haplotype, and subsequently we randomly chose the sites of the haplotype to be set to 1. We iterated the instance generation process 10 times for a fixed value of n, obtaining eventually an overall number of 50 random instances of the MPPEP-SNP downloadable at http://homepages.ulb.ac.be/~dacatanz/Site/Software_files/iMPPEP.zip webcite.

The results obtained in our experiments are shown in Table 2. Specifically, the column “Time” refers to the solution time (expressed in seconds) necessary to solve exactly a specific instance of the MPPEP-SNP. Analogously, the column “Nodes” refers to the number of explored nodes in the search tree needed to solve exactly the instance. The table does not report on the performance of [1]’s enumeration algorithm, as their algorithm never found the optimal solution to the analyzed instances within the limit runtime of 3 hours. As a general trend, the table shows that the considered instances can be exactly solved within 1 hour computing time. The only exceptions are constituted by the 7th instance of the group 150×10, the 9th instance of the group 200×10, the 2th instance of the group 250×10, and 3th instance of the group 300×10which needed 8719.65, 4600.69, 7757.98, and 5371.05 seconds, respectively, to be solved. These instances are much more sparse than the others, are characterized by smaller reduction ratios, and tend to have more degenerate relaxations than the other instances. The presence of these factors might explain the loss of performance of the Flow-RM.

Table 2. Performances of the Flow-RM on a set of random instances of the MPPEP-SNP

The results showed that the integrality gaps are usually very low, ranging from 0% to 4.63% and assuming in average a value about 1%, confirming once again the tightness of the Flow-RM and the efficiency of the strengthening valid inequalities.

Finally, we also tested the performance of the Flow-RM on a set of 5 HapMap Human mitochondrial DNA instances of the MPPEP-SNP that were not solvable by using [1]’s brute-force enumeration algorithm, namely: f1 constituted by 63 haplotypes having 16569 SNPs each, i2 constituted by 40 haplotypes having 977 SNPs each, k3 constituted by 100 haplotypes having 757 SNPs each, m4 constituted by 26 haplotypes having 48 SNPs each, and p5 constituted by 21 haplotypes having 16548 SNPs each. Such instances can be downloaded at the same address and consist only of non-recombining data (Y chromosome, mitochondrial, and bacterial DNA).

A part from m4, all the remaining instances gave rise to too large formulations (several hundreds Mbytes RAM) to be handled by the Xpress Optimizer. Hence, instead of analyzing entirely each instance we decomposed it into contiguous SNP blocks and analyzed the most difficult block separately. In more in detail, we define <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M232','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M232">View MathML</a> to be the haplotype matrix obtained by the application of [1]’s reduction rules, we sorted the columns of <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M233','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M233">View MathML</a> according to an increasing ordering of the weights ws, <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M234','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M234">View MathML</a>; subsequently, we considered the submatrices obtained by taking k contiguous SNPs (or k-block) in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M235','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M235">View MathML</a>k ∈ {10,13,15}. We did not consider greater values for k as we observed that k = 15 was already a threshold after which the haplotype submatrix gave rise to too large formulations. For each k-block <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M236','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M236">View MathML</a>in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M237','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M237">View MathML</a> we considered the hamming distance <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M238','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M238">View MathML</a> between each pair of distinct haplotypes in <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M239','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M239">View MathML</a>, and chose the k-block maximizing the sum <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M240','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M240">View MathML</a>. Finally, we assumed three hours as maximum runtime per instance.

Table 3 shows the results obtained in our experiments. As for Table 2, the columns “Time” and “Nodes” refer to the solution time (expressed in seconds) and to the number of nodes in the search tree necessary to solve exactly a specific instance of the MPPEP-SNP, respectively. In such a case, the values in the columns “Gap” refers to the gap between the best primal bound found within the limit time and the root relaxation and “Nodes” refers to the number of nodes explored in the tree search within the limit time.

Table 3. Performances of the Flow-RM on a set of real instances of the MPPEP-SNP

Table 3 shows that, apart from the instances f1 and m4, the Flow-RM was unable to exactly solve, within the limit time, the considered instances for values of k ∈ {13,15}. Specifically, The Flow-RM exactly solved in less than a minute the instance f1 when considering values of k ∈ {10,13} ; in 20 minutes the instance i2 when considering k = 10 ; in less than 3 minutes the instance k3 when considering k = 10; and the instance m4 in 5 seconds. In contrast, the Flow-RM was unable to solve the instance p5, regardless of the value of k considered. In fact, already when considering k = 10, the Xpress Optimizer took more than 12 hours to exactly solve the instance p5 and explored over 10 million nodes in the search tree. A more detailed analysis of the instance showed that, despite the presence of the strengthening valid inequalities, p5 is characterized by highly fractional relaxations. This fact implies the existence of equivalent optimal solutions to the instance that, on the one hand, delay the finding of a primal bound and, on the other hand, force the Optimizer to explore many more nodes in the tree search. This situation in more pronounced in p5 but also occurs in the instances i2 and k3. To improve the tightness of the formulation we tried to include in the Flow-RM also classical facets and strengthening valid inequalities for the Steiner tree problem in a graph (see [23,36-38]). However, we did not observe any benefit from the inclusion. We suspect that the presence of highly fractional solutions to the problem could be caused both by poor lower bounds on the number of Steiner vertices considered in the Flow-RM and by the existence of a number of non trivial classes of symmetries still present in the problem. Investigating such issues warrants future research efforts.

In order to measure the performance of the model on multi-state character data we also considered [2] set of instances of the MPPEP-SNP. Specifically, we considered the following instances: a set of 41 sequences of O.rufipogon DNA (red rice) having 1043 sites each; 80 human mtDNA sequences having 245 sites each; 50 HIV-1 reverse transcriptase amino acid sequences having 176 sites each; a set of 500 sequences of mtDNA from the NCBI BLASTN best aligned taxa having 3000 sites each; a set of 500 sequences of mtDNA from the NCBI BLASTN best aligned taxa having 5000 sites each; and a set of 500 sequences of mtDNA from the NCBI BLASTN best aligned taxa having 10000 sites each. When running the same experiments described in [2] we registered a very poor performance for the Flow-RM, mainly due to the large dimension of the considered instances and the presence of symmetries despite the use of constraints (13)-(15). We observed that the combination of these two factors increased the runtime performance of the Flow-RM of 2-3 orders of magnitude with respect to [2] approach. However, we also observed that when performing [2]’s “window analysis” (i.e., when decomposing into blocks of 10 SNPs the input matrix) the Flow-RM performed better than [2]’s, being characterized by an average solution time of 8.27 seconds. This fact suggests that, when considering instances constituted by less than a dozen sites, an exact approach entirely based on integer programming may perform better than the implicit enumeration of the vertices of the generalized Buneman graph. Vice-versa, for larger instances the implicit enumeration of the vertices of the generalized Buneman graph appears more suitable.

Conclusion

In this article we investigated the Most Parsimonious Phylogeny Estimation Problem from Single Nucleotide Polymorphism (SNP) haplotypes (MPPEP-SNP), a recent version of the phylogeny estimation problem that arises when input data consist of SNPs extracted from a given population of individuals. The MPPEP-SNP is <a onClick="popup('http://www.almob.org/content/8/1/3/mathml/M241','MathML',630,470);return false;" target="_blank" href="http://www.almob.org/content/8/1/3/mathml/M241">View MathML</a>-hard and this fact has justified the development of exact and approximate solution approaches such as those described in [1,19,22,26-28]. We explored the prospects for improving on the strategy of [1,2] using a novel problem formulation and a series of additional constraints to more precisely bound the solution space and accelerate implicit enumeration of possible optimal phylogenies. We present a formulation for the problem based on an adaptation of [23]’s mixed integer formulation for the Steiner tree problem extended with a number of preprocessing techniques and reduction rules to further decrease its size. We then show that it is possible to exploit the high symmetry inherent in most instances of the problem, through a series of strengthening valid inequalities, to eliminate redundant solutions and reduce the practical search space. We demonstrate through a series of empirical tests on real and artificial data that these novel insights into the symmetry of the problem often leads to significant reductions in the gap between the optimal solution and its non-integral linear programming bound relative to the prior art as well as often substantially faster processing of moderately hard problem instances. More generally, the work provides an indication of the conditions under which such an optimal enumeration approach is likely to be feasible, suggesting that these strategies are usable for relatively large numbers of taxa, although with stricter limits on numbers of variable sites. The work thus provides methodology suitable for provably optimal solution of some harder instances that resist all prior approaches. Our results may provide also useful guidance for strategies and prospects of similar optimization methods for other variants of phylogeny inference. In fact, if appropriately adapted, some of the results we presented here (e.g., symmetry breaking strategies) can be generalized with respect to other phylogenetic estimation criteria (e.g., the likelihood criterion) and provide important computational benefits.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The authors equally contributed to conceive the work and write the article. DC implemented the algorithms and performed computations. All authors read and approved the final manuscript.

Acknowledgements

The first author acknowledges support from the Belgian National Fund for Scientific Research (FNRS) of which he is Chargé de Recherches, the Belgian American Educational Foundation (BAEF) of which he is honorary fellow, and the Carnegie Mellon University (CMU). Part of this work has been developed while Dr. Catanzaro was visiting the Tepper School of Business and the Department of Biological Sciences of the CMU during the academic years 2010-2011 and 2011-2012. This work was supported in part by U.S. National Institutes of Health awards 1R01CA140214 and 1R01AI076318.

References

  1. Sridhar S, Lam F, Blelloch GE, Ravi R, Schwartz R: Mixed integer linear programming for maximum parsimony phylogeny inference.

    IEEE/ACM Trans Comput Biol Bioinformatics 2008, 5(3):323-331. OpenURL

  2. Misra N, Blelloch GE, Ravi R, Schwartz R: Generalized Buneman pruning for inferring the most parsimonious multi-state phylogeny.

    J Comput Biol 2011, 18(3):445-457. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  3. Pachter L, Sturmfels B: The mathematics of phylogenomics.

    SIAM Rev 2007, 49:3-31. Publisher Full Text OpenURL

  4. Bush RM, Bender CA, Subbarao K, Cox NJ, Fitch WM: Predicting the evolution of human influenza A.

    Science 1999, 286(5446):1921-1925. PubMed Abstract | Publisher Full Text OpenURL

  5. Ross HA, Rodrigo AG: Immune-mediated positive selection drives human immunodeficency virus type 1 molecular variation and predicts disease duration.

    J Virol 2002, 76(22):11715-11720. PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL

  6. Ou CY, Ciesielski CA, Myers G, Bandea CI, Luo CC, Korber BTM, Mullins JI, Schochetman G, Berkelman RL, Economou AN, Witte JJ, Furman LJ, Satten GA, Maclnnes KA, Curran JW, Jaffe HW: Molecular epidemiology of HIV transmission in a dental practice.

    Science 1992, 256(5060):1165-1171. PubMed Abstract | Publisher Full Text OpenURL

  7. Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YS, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL: The genome sequence of the SARS-associated coronavirus.

    Science 2003, 300(5624):1399-1404. PubMed Abstract | Publisher Full Text OpenURL

  8. Chang BSW, Donoghue MJ: Recreating ancestral proteins.

    Trends Ecol Evol 2000, 15(3):109-114. PubMed Abstract | Publisher Full Text OpenURL

  9. Bader DA, Moret BME, Vawter L: Industrial applications of high-performance computing for phylogeny reconstruction. In SPIE ITCom 4528. SPIE, Denver; 2001:159-168. OpenURL

  10. Harvey PH, Brown AJL, Smith JM, Nee S: New Uses for New Phylogenies. Oxford University Press, Oxford; 1996. OpenURL

  11. Catanzaro D: Estimating phylogenies from molecular data. In Mathematical Approaches to Polymer Sequence Analysis and Related Problems. Edited by Bruni R.. Springer, New York; 2011. OpenURL

  12. Beyer WA, Stein M, Smith T, Ulam S: A molecular sequence metric and evolutionary trees.

    Math Biosci 1974, 19:9-25. Publisher Full Text OpenURL

  13. Waterman MS, Smith TF, Singh M, Beyer WA: Additive evolutionary trees.

    J Theor Biol 1977, 64:199-213. PubMed Abstract | Publisher Full Text OpenURL

  14. Albert VA: Parsimony, Phylogeny, and Genomics. Oxford University Press, Oxford; 2005. OpenURL

  15. Ding Z, Filkov V, Gusfield D: A linear time algorithm for Perfect Phylogeny Haplotyping (PPH) problem.

    J Comput Biol 2006, 13(2):522-553. PubMed Abstract | Publisher Full Text OpenURL

  16. Bonizzoni P: A linear time algorithm for the Perfect Phylogeny Haplotype problem.

    Algorithmica 2007, 48(3):267-285. Publisher Full Text OpenURL

  17. Catanzaro D: The minimum evolution problem: Overview and classification.

    Networks 2009, 53(2):112-125. Publisher Full Text OpenURL

  18. Felsenstein J: Inferring Phylogenies. Sinauer Associates, Sunderland; 2004. OpenURL

  19. Bafna V, Gusfield D, Lancia G, Yooseph S: Haplotyping as perfect phylogeny: A direct approach.

    J Comput Biol 2003, 10:323-340. PubMed Abstract | Publisher Full Text OpenURL

  20. Pennington G, Smith CA, Shackney S, Schwartz R: Reconstructing tumor phylogenies from heterogeneous single-cell data.

    J Bioinformatics Comput Biol 2006, 5(2a):407-427. OpenURL

  21. Riester M, Attolini CSO, Downey RJ, Singer S, Michor F: A differentiation-based phylogeny of cancer subtypes.

    PLoS Comput Biol 2010, 6:e100077. OpenURL

  22. Sridhar S, Dhamdhere K, Blelloch GE, Halperin E, Ravi R, Schwartz R: Algorithms for efficient near-perfect phylogenetic tree reconstruction in theory and practice.

    IEEE/ACM Trans Comput Biol Bioinformatics 2007, 4(4):561-571. OpenURL

  23. Goemans MX, Myung YS: A catalog of Steiner tree formulations.

    Networks 1993, 23:19-28. Publisher Full Text OpenURL

  24. Zhang XS, Wang RS, Wu LY, Chen L: Models and algorithms for haplotyping problem.

    Curr Bioinformatics 2006, 1:105-114. Publisher Full Text OpenURL

  25. Catanzaro D, Labbé M: The pure parsimony haplotyping problem: Overview and computational advances.

    Int Trans Oper Res 2009, 16(5):561-584. Publisher Full Text OpenURL

  26. Gusfield D: Efficient algorithms for inferring evolutionary trees.

    Networks 1991, 21:19-28. Publisher Full Text OpenURL

  27. Argawala R, Fernandez-Baca D: A polynomial time algorithm for the perfect phylogeny problem when the number of character states is fixed.

    SIAM J Comput 1994, 23:1216-1224. Publisher Full Text OpenURL

  28. Kannan S, Warnow T: A fast algorithm for the computation and enumeration of perfect phylogenies.

    SIAM J Comput 1997, 26:1749-1763. Publisher Full Text OpenURL

  29. Garey MR, Johnson DS: Computers and Intractability. Freeman, New York; 2003. OpenURL

  30. Du DZ, Smith JM, Rubinstein JH: Advances in Steiner trees. Kluwer Academic Publisher, Boston; 2000. OpenURL

  31. Cheng X, Du DZ: Steiner Trees in Industry. Kluwer Academic Publishers, Boston; 2001. OpenURL

  32. Fischetti M, Salazar-Gonzáles JJ, Toth P: A branch and cut algorithm for the symmetric generalized traveling salesman problem.

    Oper Res 1997, 45(30):378-395. OpenURL

  33. Halldórsson BV, Bafna V, Edwards N, Lippert R: Combinatorial problems arising in SNP and haplotype analysis. In Discrete Mathematics and Theoretical Computer Science, Volume 2731 of Lecture Note in Computer Science. Edited by Calude CS. Springer-Verlag, Berlin; 2003:26-47. OpenURL

  34. Robins G, Zelikovsky A: Tighter bounds for graph Steiner tree approximation.

    SIAM J Discrete Math 2005, 19:122-134. Publisher Full Text OpenURL

  35. The International HapMap Consortium: The international hapmap project.

    Nature 2003, 426(18):789-796. PubMed Abstract | Publisher Full Text OpenURL

  36. Chopra S, Rao MR: The Steiner tree problem I: Formulations, compositions and extension of facets.

    Math Programming 1994, 64(1-3):209-229. Publisher Full Text OpenURL

  37. Chopra S, Rao MR: The Steiner tree problem II: Properties and classes of facets.

    Math Programming 1994, 64(1-3):231-246. Publisher Full Text OpenURL

  38. Koch T, Martin A, Voß S:

    SteinLib: An Updated Library on Steiner Tree Problems in Graphs.

    Berlin: Tech. Rep. ZIB-Report 00-37, Konrad-Zuse-Zentrum für Informationstechnik Berlin, Takustr. 7; 2000. [http://elib.zib.de/steinlib webcite]

    OpenURL