Abstract
Background
Understanding the evolution of biological networks can provide insight into how their modular structure arises and how they are affected by environmental changes. One approach to studying the evolution of these networks is to reconstruct plausible common ancestors of presentday networks, allowing us to analyze how the topological properties change over time and to posit mechanisms that drive the networks’ evolution. Further, putative ancestral networks can be used to help solve other difficult problems in computational biology, such as network alignment.
Results
We introduce a combinatorial framework for encoding network histories, and we give a fast procedure that, given a set of gene duplication histories, in practice finds network histories with close to the minimum number of interaction gain or loss events to explain the observed presentday networks. In contrast to previous studies, our method does not require knowing the relative ordering of unrelated duplication events. Results on simulated histories and real biological networks both suggest that common ancestral networks can be accurately reconstructed using this parsimony approach. A software package implementing our method is available under the Apache 2.0 license at http://cbcb.umd.edu/kingsfordgroup/parana webcite.
Conclusions
Our parsimonybased approach to ancestral network reconstruction is both efficient and accurate. We show that considering a larger set of potential ancestral interactions by not assuming a relative ordering of unrelated duplication events can lead to improved ancestral network inference.
Keywords:
Network evolution; Arsimony; Ancestral network reconstruction; Interaction networks; Regulatory networksBackground
Highthroughput experiments have revealed thousands of regulatory and proteinprotein interactions that occur in the cells of presentday species. To understand why these interactions take place, it is necessary to view them from an evolutionary perspective. In analogy with ancestral genome reconstruction [1], we consider the problem of predicting the topology of the common ancestor of pathways, complexes, or regulatory programs present in multiple extant species.
Reconstructing plausible ancestral networks can help answer many natural questions about how presentday networks have evolved. For example, joint histories can be used to compare the conservation and the route to divergence of corresponding processes in two species. This allows us to more finely quantify how modularity has changed over time [2] and how interactions within a protein complex may have reconfigured across species starting from a single shared state [3]. Such analysis can also be integrated to develop better network alignment algorithms and better networkbased phylogenies [48], and it can be used to study robustness and evolvability [911]. Further, inferred changes in metabolic networks can be linked to changes in the biochemical environment in which each species has evolved, and this can reveal novel mechanisms of ecological adaptation [12,13]. Finally, comparing network histories inferred using different model parameters can be used to estimate the likelihoods of various evolutionary events [14,15].
There has been some recent work on reconstructing ancestral interactions. Gibson and Goldberg [16] presented a framework for estimating ancestral protein interaction networks that handles gene duplication and interaction loss using gene trees reconciled against a species phylogeny. However, their approach assumes that interaction losses occur immediately after duplication and does not support interaction gain outside of gene duplication. These assumptions are limiting because interaction loses may occur well after duplication, and independent gains are believed to occur at nontrivial rates [17]. Dutkowski and Tiuryn [6] provided a probabilistic method for inferring ancestral interactions with the goal of improved network alignment. Their approach is based on constructing a Bayesian network with a tree topology where binary random variables represent existence or nonexistence of potential interactions. A similar graphical model was proposed by Pinney et al. [18], who applied it to inferring ancestral interactions between bZIP proteins. In the former method, interaction addition and deletion is assumed to occur only immediately following a duplication or speciation event. Further, both methods assume the relative ordering of duplication events is known even between events in unrelated homology groups. Pinney et al. [18] also explore a parsimonybased approach [19] and find it to work well; however, it too assumes a known ordering of unrelated duplication events. The main drawback of these approaches is that the assumed ordering comes from sequencederived branch lengths, which do not necessarily agree with rates that would be estimated based on network evolution [20]. This motivates an approach such as we describe below that does not use branch lengths as input.
Zhang and Moret [20,21] use a maximumlikelihood method to reconstruct ancestral regulatory networks as a means to improve estimation of regulatory networks in extant species. Mithani et al. [22] study the evolution of metabolic networks, but they only model the gain and loss of interactions amongst a fixed set of metabolites, whereas we also consider node duplication and loss encoded by a tree. Navlakha and Kingsford [15] present greedy algorithms for finding highlikelihood ancestral networks under several assumed models of network growth. They applied these methods to a yeast protein interaction network and a social network to estimate relative arrival times of nodes and interactions and found that the inferred histories matched many independently studied properties of network growth. This attests to the feasibility of using networks to study evolution. The authors, however, only consider a single network at a time, and there is no guarantee that independent reconstruction of two networks will converge to a common ancestor.
Here, we introduce a combinatorial framework for representing histories of network evolution that can encode gene duplication, gene loss, interaction gain and interaction loss at arbitrary times and does not assume a known total ordering of duplication events. We show that almost parsimonious histories of interaction gain and loss can be computed in practice quickly given a duplication history. In simulated settings, we show that these parsimonious histories can be used to accurately reconstruct a common ancestral regulatory network of two extant regulatory networks. We also show that our approach can infer, with high accuracy, the interactions among the bZIP family of proteins in several ancestral organisms.
Methods
A framework for representing network histories
Any natural model of network evolution will include events for gene duplication, gene loss, interaction gain, and interaction loss. Many such growth models have been studied (e.g. [9,21,2326]). We describe below how these events can be encoded in a history graph. We note that there are other evolutionary events that affect the growth and structure of biological networks. For example, TollRiera et al. [27] provide evidence for de novo gene birth originating from noncoding genomic regions. While such events play a role in shaping the evolutionary history and current structure of biological networks; they are less common than the gene duplication and loss and interaction gain and loss, and are not explicitly modeled in the current framework.
Consider a set V of proteins or genes (henceforth “nodes”) descended from a common ancestor by duplication events. Those duplication events can be encoded in a binary duplication treeT with the items of V as the leaves. An internal node u in T represents a duplication event of u into its left and right children, u_{L} and u_{R}. In this representation, after a duplication event, the node represented by u conceptually does not exist anymore and has been replaced by its two children. The leaves of a duplication tree are labeled Present or Absent. Absent leaves represent products of duplication events that were subsequently lost. A collection of such trees is a duplication forest F.
The gain and loss of interactions can be represented with additional nontree edges placed on a duplication forest. A nontree edge {u,v} represents an edge flip event, where the interaction between u and v is created if the interaction is currently absent or removed if the interaction is currently present. Let P_{u} and P_{v} be the paths from nodes u and v to the root. An interaction exists between u and v if there are an odd number of such flip nontree edges between nodes in P_{u} and P_{v}. Every nontree edge between P_{u} and P_{v}, therefore, represents alternatively interaction creation or deletion between nodes u and v in the evolution of the biological network.
A graph H consisting of the union of a duplication forest and flip nontree edges is a network history. A history Hconstructs a graph G when the Present leaves of the duplication forest in H correspond to the nodes of G and the flip edges of H imply an interaction between u and v if and only if {u,v} is an interaction in G. See Figure 1 for an example history.
Figure 1. Framework Overview. A duplication forest (solid edges at top) with the nontree edges (dashed) necessary to construct G_{1} and G_{2} (shown at bottom). Nodes 1, 2, and 3 represent the 3 homology groups present in the ancestral graph. Node 14 was lost. As an example of the connectivity induced by the nontree edges, consider edge (27,18) in G_{2} which is implied by the directed nontree edge from (3,2). However, the reverse edge (18,27), which is implied by (2,3), does not exist because its state is flipped by (8,20).
Not all placements of nontree edges lead to a valid network history. The interaction histories have to be consistent with some temporal embedding of the tree. Let and be respectively the time of creation and duplication of node u. Naturally, , if u is a Present leaf, and if v is the child of u, then by definition we have
If {u,w} is a flip edge, then the time t_{{u,w}} of appearance of this edge must satisfy
because an event between u and w can only occur when both u and w exist. A history graph H is said to be valid if there exist for every node u such that conditions (1) and (2) are satisfied for every nontree edge.
Whether a particular history is valid can be checked combinatorially using the following alternative characterization of validity. A kblocking loop is a set of flip edges {{u_{i},v_{i}}}_{0≤i<k} such that u_{i + 1} is an ancestor of v_{i} in the tree for 0 ≤ i < k (where the index i + 1 is taken modulo k). See Figure 2 for examples. Blocking loops are not permitted in valid histories and, conversely, the nonexistence of blocking loops implies that a history is valid, as shown in Prop. 11.
Figure 2. Blocking Loops. Blocking loops of size 1, 2 and 3. The solid lines represent a subset of the tree T. The dashed lines are nontree edges representing interaction flip events.
Proposition 1
A history graph H is valid if and only if it does not have any blocking loop of any length.
Proof
Suppose there is a kblocking loop. Using the same notation as above, we have the inequalities
which is a contradiction. Hence, to not have any blocking loops is necessary.
Conversely, suppose that H does not have any blocking loops. We assign times to the nodes and nontree edges using a modified depthfirst search (DFS) algorithm following the tree edges only. First, the root of the tree is given a creation time of 0. During DFS, just before calling DFS recursively on the left and right children of a node u, we set the duplication time , where the second max is taken over all nontree edges adjacent to u. Also, we set the creation time of the children .
When DFS visits a node u with some nontree edge {u,v} where v has not been assigned a creation time, u is added to a set Q and DFS is not called recursively on the children of u. The main loop consists of calling DFS again on all the nodes in Q until this set is empty. By construction, the algorithm assigns times which satisfy conditions (1) and (2). Therefore, if the algorithm terminates, H is a valid history.
At each main iteration, the nodes in the set Q are all the nodes u for which is set but is not set. It suffices to show that at each such iteration, at least one of the nodes in the set Q will not be added again to Q by a call to DFS. In other words, for at least one node u ∈ Q, every nontree edge {u,v} has set. For a contradiction, suppose not. Take u_{1} ∈ Qand {u_{1},v_{1}} with not set. There is necessarily an ancestor of v_{1}, call it u_{2}, which is in Q. Similarly, take {u_{2},v_{2}} with not set and its ancestor u_{3} ∈ Q, and so on. Because Q is finite, u_{j }= u_{i }for some j > i, and we constructed a blocking loop. Hence, the algorithm must terminate. □
Parsimonious reconstruction of a network history
Traditional phylogenetic inference algorithms and reconciliation between gene and species trees can be used to obtain duplication and speciation histories [2830]. What remains is the reconstruction of interaction gain and loss events. This leads to the following problem:
Problem 1
(Minimum Flips) Given a duplication forest F and an extant network G, find H, a valid history constructing G, with a minimum number of flip edges.
We will show that nearly optimal solutions to this problem for a large range of instances can be solved in polynomial time in practice. Whether Problem 1 is NPhard or admits a polynomialtime algorithm for all instances remains open.
A fast heuristic algorithm
The challenge of Problem 1 comes from avoiding the creation of blocking loops. A polynomialtime algorithm can find a minimum set of flip edges that reconstructs a graph G and does not contain 1 and 2blocking loops but allows longer blocking loops. We define an interaction encoding of G = (V,E) as a function f_{G}: V × V → {0,1} such that f_{G }(u,v) = 1 if {u,v} is an interaction in G and f_{G }(u,v) = 0 otherwise. We omit the subscript on f_{G }if G is clear from the context.
The following intertwined dynamic programming recurrences find the minimum number of flip edges required for H to construct a given graph G if blocking loops of length ≥ 3 are allowed. First, S (u,f) finds the minimum number of flip edges for the subtree rooted at u and interaction encoding f :
The expression A (u,v,f) gives the minimum number of flip edges that should be placed between the subtree rooted at u and the subtree rooted at v. This can be computed using the recurrence:
In the above, if one of u or v is a leaf but the other is not, the options that look at nonexistent children are disallowed.
The function in Eqn. (4) is defined as 1−f and thus represents a function such that has opposite parity from f(x) for all x. The A recurrence considers two possible options: (1) We connect u and v with a nontree edge, this costs us 1 and flips the parity of all interactions going between the subtree rooted at u and the subtree rooted at v; or (2) We do not connect u and v with a flip edge. This costs 0 and keeps the parity requirement the same. Regardless of the choice to create an edge, because we are not allowed to have a 2blocking loop, either (a) we possibly connect u to some descendant of v (and do not connect v to a descendant of u) or (b) we possibly connect v to some descendant of u (and do not connect u to a descendant of v).
The base case for the S recurrence when u is a leaf and the base case for the A recurrence when u and v are leaves are:
The minimum number of flip edges needed to turn a duplication forest F into a history constructing G (allowing blocking loops of ≥ 3) is then given by ∑_{r}S(r,d_{G}) + ∑_{r,q}A(r,q,d_{G}), where d_{G} is the interaction encoding of G, and the sums are over roots r ,q of the trees in F. Standard backtracking can be used to recover the actual minimum edge set. If n is the number of nodes in the forest, the dynamic program runs in O(n^{2}) time and space because only two functions f are ever considered: d_{G}, and . This yields ≈ n × n × 2 subproblems, each of which can be solved in constant time.
The heuristic also can be extended to handle different costs for interaction addition and deletion by changing the constants in the recurrences to be a function of the parity of each flip. Only two values of f (_{dG} and ) are ever considered, and every flip switches f between these two states. Thus, by examining f , and determining if its current states corresponds to d_{G}or , one can determine if an odd or even number of flips have occurred, and thus, whether the current flip corresponds to the addition or deletion of an interaction. If the current flip represents the addition of an interaction, then it incurs the cost c_{add}. Otherwise, the flip encodes the loss of an interaction, and incurs the loss cost c_{loss}.
Identifying and removing blocking loops
To identify blocking loops, we use a modified depthfirst search procedure in which tree edges are traversed according to their direction (i.e away from the root) while nontree edges can be traversed in either direction. Whenever a node is encountered twice during the depth first search, a cycle has been discovered and is checked for the blocking loop condition given above. If the cycle is not blocking loop, we can safely ignore it. Otherwise, one of the nontree edges of this loop is chosen at random, and we forbid that edge from appearing in the solution and rerun the dynamic program. Because there are O(n^{2}) possible nontree edges, iterating this procedure will terminate in polynomial time. We repeat the process of identifying blocking loops and forbidding nontree edges until a valid solution is obtained. In the worst case, one may obtain a solution where all nontree edges are placed at leaves, but in practice long blocking loops do not often arise, and the obtained solutions are close to optimal (see section below).
Reconstruction of a common ancestor of two graphs
Given extant networks of several species, in addition to the reconstructed history, we seek a parsimonious estimate for their common ancestor network. Specifically, given extant networks G_{1} and G_{2}, with interaction encodings d_{1} and d_{2}, and their duplication forests F_{1} and F_{2}, we want to find an ancestral network X = (V_{X}, E_{X}) such that the cost of X evolving into G_{1} and G_{2} after speciation is minimized. V_{X }is the set of roots of the homology forests. We assume that the networks of the two species evolved independently after speciation. Therefore, we can use the recurrence above applied to F_{1} and F_{2} to compute and independently for r,q ∈ V_{X}, and then select interactions in X as follows. E_{X} of X is given by the pairs r, q ∈V_{X }× V_{X } for which creating an interaction leads to a lower total cost than not creating an interaction. Formally, we place an interaction {r,q} in E_{X} if
Rule (5) creates an interaction in X if doing so causes the cost of parsimonious histories inferred for G_{1} and G_{2} between the homology groups associated with r and q to be smaller than if no interaction was created.
Modifications for selfloops
Selfloops (homodimers) can be accommodated by modifying recurrence (3):
The intuition here is that paying cost 1 to create a selfloop on node u creates (or removes) interactions, including selfloops, among all the descendants of u.
Modifications for directed graphs
The algorithm can be modified to handle evolutionary histories of directed graphs. For this, only the recurrence A need be modified. When computing A^{′ }(u,v,f), a nontree edge can be included from u to v, from v to u, both, or neither. Each of these cases modifies the function f in a different way. Specifically:
where the vertical ellipsis indicates the symmetric cases involving v_{L} and v_{R}, and where are defined, depending on u and v, as follows:
with defined analogously to . Here, ST(u) indicates the set of nodes in the subtree rooted at u.
Accounting for phylogenetic branch lengths
One of the strengths of our proposed method is that it does not require the user to specify the lengths of the edges in a duplication history. The estimation of such phylogenetic branch lengths relies on the molecular clock assumption, and these lengths can easily be misestimated, especially those for distant ancestors.
However, previous approaches [18,19] relied crucially upon the phylogenetic branch lengths to impose a specific ordering on the set of potential ancestral interactions. Small errors in the estimates of phylogenetic branch lengths can lead these approaches to disallow potentially high probability or high parsimony ancestral interactions.
Yet, the branch lengths in the duplication history do encode potentially useful information. For example, two ancestral proteins for which the intervals of existence are separated by a significant amount of time are unlikely to have interacted, even if branch length estimates are imprecise. The algorithm we defined above can be further modified to account for branch lengths, using them to penalize unlikely ancestral states without explicitly disallowing potentially important interactions. This can be achieved by modifying the recurrence as follows:
where
The analogous modification applies to the directed recurrence as well. Here, α δ(·,·) is a function that assigns a cost to a pair of nodes {u,v} that is proportional to the distance between the existence intervals of these nodes (and is 0 if they overlap). The constant, α, is provided as input to the algorithm and can be interpreted as the factor by which interactions are penalized between nodes which do not overlap in time according to the inferred phylogenetic branch lengths. At α = ∞, branch lengths become hard constraints, and proteins between which the existence intervals do not overlap are not allowed to interact; this αalso prohibits the formation of blocking loops. However, results tend to be better (higher F1score) when one allows some constraints from branch lengths to be violated. This approach allows our algorithm to take phylogenetic branch lengths into account in a way that incorporates the information they encode without suffering from the potential issues that occur when considering these lengths as hard constraints.
Results and discussion
We analyze the performance of our parsimonybased approach to ancestral network reconstruction on both simulated and real biological data. To generate simulated data, we consider a number of plausible models of network evolution and show that the parsimony approach is able to reconstruct ancestral networks reasonably well over a wide range of model parameters. Further, following the experiment of Pinney et al. [18], we evaluate the performance of our approach on reconstructing the state of several ancestral network states of the bZIP family of proteins. We observe that our parsimonybased approach obtains high precision and recall, even on fairly distant ancestral networks.
Generating plausible simulated histories
We use a degreedependent model (DDM) to simulate the evolutionary path from a putative ancestral network to its extant state. The model simulates node duplication, node deletion, independent interaction gain, and independent interaction loss with given probabilities P_{ndup}, P_{nloss}, P_{egain} and P_{eloss}, respectively. The nodes or edges involved in a modification are chosen probabilistically based on their degrees (as in [31]) according to the following expressions:
where is the outdegree of a node u, and k_{u }is the total degree. At each time step, the distribution of possible modifications to the graph is calculated as P(modification) = P_{operation} P(object ∣ operation). Nodes with outdegree of 0 are removed. Varying parameters P_{ndup}, P_{nloss}, P_{egain} and P_{eloss} can produce a wide variety of densities and sizes. We also consider a degreeindependent model (DIM) in which the four conditional probabilities in Eqns. (11) and (12) are all equal.
The DDM model is theoretically capable of producing evolutionary trajectories between any two networks while incorporating preferential attachment to the source node and random uniform choice of the target node. Furthermore, choosing a node for duplication or loss in inverse proportion to its degree favors an event in inverse relation to its expected disruption of the network.
We also consider a model of regulatory network evolution by Foster et al. [32], which is based on gene duplication, with incoming and outgoing interactions kept after duplication as in other models (P_{inkeep} and P_{outkeep} probabilities respectively). New edges are added with probability P_{innovation}.
In all of the network evolution models, we started with a random connected seed graph that has 10 nodes and 25 interactions. We evolved it to X by 200 operations after which we introduce a speciation event, and then both G_{1} and G_{2} evolve from X by an additional 200 operations each. To generate more biologically plausible ancestral graphs, instances were kept only if the ancestral graph X had an indegree that fit an exponential distribution with parameter between 1.0 and 1.2 or an outdegree that was scalefree with parameter between 1.8 and 2.2.
Reconstructing simulated networks
Optimality of loop breaking
The greedy procedure to break blocking loops produces histories that are very close to optimal. We generated 1400 networks using the DDM model with the range of parameters shown on the xaxis of Figure 3a. In the vast majority of cases (1325 out of 1400), either no loop breaking is required, or the solution discovered after greedily breaking all loops has the same cost as the original solution. In these cases, therefore, the method returned a provably maximally parsimonious set of interaction modification events. In the remaining 75 cases (5.4%), greedily removing blocking loops increased the number of interaction modifications by no more than 10 (< 2% of the initial number of interaction modification events). Since the initial solution provides a lower bound on the optimal, we can verify that the greedy procedure always found a solution within 2% of the optimal (and perhaps even better). Thus, it seems that in practice, while blocking loops occur, the greedy procedure does a good job of eliminating them without increasing the number of events significantly.
Figure 3. Synthetic Performance. (ac) Effect of model parameters on reconstruction accuracy under three different models. “Prob” in (c) is P_{innovation}. (d) Effect of evolutionary distance (number of network modification operations) on the quality of the ancestral network reconstruction. In both plots, boxes show 1st and 3rd quartile over 100 networks with median indicated by a line. Pentagons show the median if interactions incident to nodes lost in both lineages are not considered.
Effect of growth model and its parameters
Modeling the evolutionary dynamics of a regulatory network is still an active topic of research. We therefore experimented with three different network models. Despite their differences, high precision and recall (implied from the F1 score) can be obtained for all of them for many choices of their parameters (Figure 3ac). We measure the precision (defined as true positive/(true positive + false positive)), the recall (defined as true positive/(true positive + false negative)) and compute the F1score (the harmonic mean of precision and recall: 2 · precision ·recall/(precision + recall)). Very good performance can be achieved under the general model presented above whether degree distributions are taken into account (Figure 3a) or not (Figure 3b) when selecting nodes and interactions to modify. In these cases, for most parameter choices, precision is close to 1.0, meaning every interaction predicted to be in the ancestor, in fact, was. Recall is often lower. The Foster et al. [32] model, with its heavy reliance on duplication events and lack of node loss events, tends to be the simplest under which to reconstruct the ancestral graph (Figure 3c).
The largest factor leading to poorer performance is lower recall caused by gene losses. If all descendants of a gene are lost in both extant networks, it is not possible to reconstruct interactions incident to it. If these interactions are excluded from the computation of recall, the F1 score often improves dramatically. Median F1 scores excluding these interactions are shown as pentagons in Figure 3.
Robustness to evolutionary divergence
Naturally, the ability to recover the ancestral network degrades as time passes and the extant networks diverge. However, the degradation is slow (Figure 3d, using the degreedependent model with parameters fixed at P_{ndup} = 0.35, P_{nloss} = 0.05, P_{egain} = 0.3, and P_{eloss} = 0.3). When the distance is small (measured as the number of events separating them), we are almost always able to recover the ancestral network well, as illustrated by the high F1scores and small interquartile ranges in Figure 3d. Even when the distance between the ancestral and extant networks is large (300) compared to the average ancestral network size (55), we obtain an F1score of 0.72 (0.77 when homology groups lost in both lineages are not considered).
Reconstructing ancestral bZIP networks
We also repeated the test performed by Pinney et al. [18] by using our method to reconstruct ancestral interactions among the bZIP family of transcription factors. The interactions between dimerizing bZIP transcription factors are strongly mediated by their coiledcoil leucine zipper domains, and the strength of these interactions can be computationally predicted with high sensitivity and specificity using sequence alone [33]. This sequencebased method was used to predict both the interaction strength between extant bZIP proteins and inferred ancestral protein sequences. These interactions were used as the ground truth [18]. The duplication history relating the bZIP proteins is built atop the extant networks of 4 relatively distant species, D. rerio, T. rubripes, H. sapiens, and C. intestinalis. From the interactions in these extant networks and the structure of the duplication history of the constituent proteins, we reconstruct 3 ancestral networks: the Teleost (ancestor of D. rerio and T. rubripes), Vertebrata (ancestor of D. reo, T. rubripes and H. sapiens) and Chordate (ancestor of D. rerio, T. rubripes, H. sapiens, and C. intestinalis) networks.
Table 1 compares the relative performance of our parsimonybased approach and the probabilistic method described by Pinney et al. [18] Our results were generated using a ratio of 11.4 : 1 for the cost of interaction creation to interaction deletion (the same ratio as was used in the probabilistic method). Furthermore, we choose not to penalize interactions based on phylogenetic branch length (i.e. α = 0 in δ_{α}), thus allowing our algorithm to explore the entire solution space. We note that our approach outperforms the probabilistic method, particularly on the Teleost and Vertebrata networks. One explanation for the improved performance of our method is that it considers a larger set of ancestral interactions by not explicitly disallowing parsimonious interactions based solely on potentially misleading phylogenetic branch lengths.
Table 1. bZIP Reconstruction Performance
We corroborated this hypothesis by measuring the reconstruction performance of our approach for increasing values of α, and noticed a very slow but steady decrease in performance as α increases. Nonetheless, at α = ∞(using branch lengths as hard constraints as Pinney et al. do), our method still outperforms the probabilistic method on the Teleost network (F1 score of 0.84 vs 0.77). This experiment suggests that, at least on this family of protein interactions, relying on the phylogenetic branch lengths to aid inference does not improve — and potentially harms — performance.
A visual inspection (see Figure 4) of the inferred ancestral networks revealed no strong patterns among the interactions predicted based on sequence versus those predicted using our parsimony approach. However, if a protein is involved in a disagreement, it is often involved in more than one.
Figure 4. bZIP Reconstructions. The inferred networks of the Teleost, Vertebrata and Chordata ancestors. Edges drawn in gray were inferred by both our parsimonybased approach and by the sequencebased approach. Red edges were inferred based on sequence but not by the parsimony method, and the blue edges were inferred by the parsimony method but not based on sequence.
Conclusion
We have presented a novel framework for representing network histories involving gene duplications, gene loss, and interaction gain and loss for both directed and undirected graphs. We also provide a combinatorial characterization for valid histories. Our experiments demonstrate that a fast heuristic can recover optimal histories in a large majority of instances. We further provide evidence that, even with a probabilistic, weighted, generative model of network growth, a parsimony approach can recover accurate ancestral networks (F1 scores ≥ 0.8 for a wide range of parameters under several different models). Finally, we show that our method accurately reconstructs a number of ancestral networks for the bZIP family of proteins. Interestingly, we observe that we obtain the highest accuracy in ancestral network reconstruction when we do not impose a particular ordering on unrelated duplication events (as implied by phylogenetic branch lengths). This suggests that the ability of our approach to explore a larger space of potential solutions than previous work can provide practical benefits. In future work, it will be interesting to explore topological properties of the ancestral networks, such as modularity and degree distribution, and to analyze how these properties may have changed over time. We would also like to extend the evolutionary history framework and inference algorithm to handle de novo gene birth events, which are known to contribute to network growth [27].
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
The history encoding framework and algorithms were devised by RP, ES, JM, GM, SN and CK. The generalized regulatory models were devised and implemented by JM and RP. The algorithms were implemented by RP and ES. The synthetic tests were performed by RP and ES and the bZIP tests were performed by RP. The manuscript was written by RP, ES, JM, GM, SN and CK. All authors participated in the discussions. All authors read and approved the final manuscript.
Acknowledgements
This work was partially supported by National Science Foundation grants EF0849899, IIS0812111, and CCF1053918 and National Institutes of Health grant 1R21AI085376 to C.K. G.M. was partially supported by Agriculture and Food Research Initiative Competitive grants 200804049 and 20101573901 from the USDA National Institute of Food and Agriculture and NIH grant R01HG002945. The authors thank John Pinney for providing us with the bZIP data used in their manuscript and with the predictions of their probabilistic approach. We also wish to thank Geet Duggal and Darya Filippova for helpful discussions.
References

Pachter L: An introduction to reconstructing ancestral genomes.

Kreimer A, Borenstein E, Gophna U, Ruppin E: The evolution of modularity in bacterial metabolic networks.
Proc Natl Acad Sci USA 2008, 105(19):69766981. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

PereiraLeal JB, Levy ED, Kamp C, Teichmann SA: Evolution of protein complexes by duplication of homomeric interactions.
Genome Biol 2007, 8(4):R51. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Flannick J, Novak A, Srinivasan BS, McAdams HH, Batzoglou S: Graemlin: general and robust alignment of multiple large interaction networks.
Genome Res 2006, 16(9):11691181. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Singh R, Xu J, Berger B: Pairwise global alignment of protein interaction networks by matching neighborhood topology.
Proc. Intl. Conf. on Research in Computational Molecular Biology (RECOMB) 2007, 1631.

Dutkowski J, Tiuryn J: Identification of functional modules from conserved ancestral protein–protein interactions.
Bioinformatics 2007, 23(13):i149i158. PubMed Abstract  Publisher Full Text

Erten S, Li X, Bebek G, Li J, Koyuturk M: Phylogenetic analysis of modularity in protein interaction networks.
BMC Bioinformatics 2009, 10:333. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kuchaiev O, Milenkovic T, Memisevic V, Hayes W, Przulj N: Topological network alignment uncovers biological function and phylogeny.
J R Soc Interface 2010, 7(50):13411354. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Aldana M, Balleza E, Kauffman S, Resendiz O: Robustness and evolvability in genetic regulatory networks.
J Theor Biol 2007, 245(3):433448. PubMed Abstract  Publisher Full Text

EspinosaSoto C, Martin OC, Wagner A: Phenotypic robustness can increase phenotypic variability after nongenetic perturbations in gene regulatory circuits.
J Evol Biol 2011, 24(6):12841297. PubMed Abstract  Publisher Full Text

Raman K, Wagner A: Evolvability and robustness in a complex signalling circuit.
Mol Biosyst 2011, 7(4):10811092. PubMed Abstract  Publisher Full Text

Borenstein E, Kupiec M, Feldman MW, Ruppin E: Largescale reconstruction and phylogenetic analysis of metabolic environments.
Proc Natl Acad Sci USA 2008, 105(38):1448214487. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Borenstein E, Feldman MW: Topological signatures of species interactions in metabolic networks.
J Comput Biol 2009, 16(2):191200. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Middendorf M, Ziv E, Wiggins CH: Inferring network mechanisms: the Drosophila melanogaster protein interaction network.
Proc Natl Acad Sci USA 2005, 102(9):31923197. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Navlakha S, Kingsford C: Network archaeology: uncovering ancient networks from presentday interactions.
PLoS Comput Biol 2011, 7(4):e1001119. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Gibson TA, Goldberg DS: Reverse engineering the evolution of protein interaction networks.

Levy ED, PereiraLeal JB: Evolution and dynamics of protein interactions and networks.
Curr Opin Struct Biol 2008, 18(3):349357. PubMed Abstract  Publisher Full Text

Pinney JW, Amoutzias GD, Rattray M, Robertson DL: Reconstruction of ancestral protein interaction networks for the bZIP transcription factors.
Proc Natl Acad Sci USA 2007, 104(51):2044920453. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mirkin BG, Fenner TI, Galperin MY, Koonin EV: Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes.
BMC Evol Biol 2003, 3:2. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhang X, Moret BM: Boosting the performance of inference algorithms for transcriptional regulatory networks using a phylogenetic approach.
Proc. Intl. Workshop on Algorithms in Bioinformatics (WABI) 2008, 245258.

Zhang X, Moret B: Refining transcriptional regulatory networks using network evolutionary models and gene histories.
Alg Mol Biol 2010, 5:1. BioMed Central Full Text

Mithani A, Preston G, Hein J: A stochastic model for the evolution of metabolic networks with neighbor dependence.
Bioinformatics 2009, 25(12):15281535. PubMed Abstract  Publisher Full Text

Chung F, Lu L, Dewey TG, Galas DJ: Duplication models for biological networks.
J Comp Biol 2003, 10(5):677687. Publisher Full Text

Teichmann SA, Babu MM: Gene regulatory network growth by duplication.
Nat Genetics 2004, 36(5):492496. Publisher Full Text

PastorSatorras R, Smith E, Sole R: Evolving protein interaction networks from gene duplication.
J Theor Biol 2003, 222:199210. PubMed Abstract  Publisher Full Text

Ispolatov I, Krapivsky PL, Yuryev A: Duplicationdivergence model of protein interaction network.

TollRiera M, Bosch N, Bellora N, Castelo R, Armengol L, Estivill X, Mar Alba M: Origin of primate orphan genes: a comparative genomics approach.
Mol Biol Evol 2009, 26(3):603612. PubMed Abstract  Publisher Full Text

Chen K, Durand D, FarachColton M: NOTUNG: a program for dating gene duplications and optimizing gene family trees.
J Comput Biol 2000, 7(34):429447. PubMed Abstract  Publisher Full Text

Durand D, BV Halldórsson, Vernot B: A hybrid micromacroevolutionary approach to gene tree reconstruction.
J Comp Biol 2006, 13(2):320335. Publisher Full Text

Arvestad L, Berglund AC, Sennblad B: Bayesian gene/species tree reconciliation and orthology analysis using MCMC.
Bioinformatics 2003, 19(Suppl 1):i7i15. PubMed Abstract  Publisher Full Text

Stewart AJ, Seymour RM, Pomiankowski A: Degree dependence in rates of transcription factor evolution explains the unusual structure of transcription networks.
Proc Biol Sci 2009, 276(1666):24932501. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Foster DV, Kauffman SA, Socolar JES: Network growth models and genetic regulatory networks.

Fong JH, Keating AE, Singh M: Predicting specificity in bZIP coiledcoil protein interactions.
Genome Biol 2004, 5(2):R11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text