Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Evolution is a lens that allows us to study and understand a lot of phenomena in molecular biology [8]. The prototypical representation of any evolutionary history is a phylogeny, that is, a labeled tree whose leaves are extant species, or individuals, or simply data that we are currently able to analyze [11]. Phylogenetics is the research area of computational biology devoted to computing phylogenies. In this field, the focus has shifted over the years. The initial developments date back to the pioneering work of Cavalli-Sforza and Edwards [6, 9] in the 1960s, where some fundamental ideas in the study of phylogenies were introduced, namely the fact that evolution is a branching process where characters change, that an intuitive approach is to find the minimum total number of evolutionary events compatible with the available data, and the idea of maximizing the likelihood of the proposed interpretation.

The limited computational resources at the time, together with the kind of data available (phenotypical data were much more frequent than genomic data), initially put the emphasis on maximum parsimony character-based approaches. Subsequent advances, including some in the statistical modeling of evolution [11], made approaches based on inferring maximum likelihood phylogenies more attractive.

More recently, the pendulum has swung again, as parsimony methods have found new relevance, mostly as a result of new applications. The perfect phylogeny model, which is conceptually the simplest, is based on the infinite-sites assumption; that is, no character can mutate more than once in the whole tree. Although this assumption is quite restrictive, bordering on plainly wrong in some cases, the perfect phylogeny model has turned out to be splendidly coherent in the context of the haplotyping problem [3, 18], where we want to distinguish between the two haplotypes present in each individual when given only genotype data. More precisely, the interest here is in computing a set of haplotypes and a perfect phylogeny such that the haplotypes (i) label the vertices of the perfect phylogeny and (ii) explain the input set of genotypes. This context has been studied deeply in the last decade, giving rise to a number of beautiful algorithms [2, 7]. Those algorithms (and others on the same topics) exploit a number of nice combinatorial properties of perfect phylogenies and graphs. In [2], a graph-theoretical characterization of genotype matrices admitting a tree representation was given by using properties of partial orders and Dilworth’s Theorem [12]. In its original formulation, the haplotyping problem, under the perfect phylogeny model [7, 18], has revealed an interesting connection with the graph realization problem [29], a well-known graph problem used to decide whether a matroid is graphic.

Still, the perfect phylogeny model and the assumptions that have been central in previous decades cannot be employed without adaptations or improvements. One of the main open problems regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data. Following this research direction, we explore here some extensions of the perfect phylogeny model that are capable of modeling some processes whose study has been motivated by some recent applications.

In particular, we present two recent applications that can find only a partial solution in perfect phylogenies. The first application is to carcinogenesis, i.e., the factors and mechanisms that cause the onset of cancer in cells. Carcinogenesis can result from many combinations of mutations, but only a few sequences of mutations, called progression pathways, seem to account for most human tumors [28]. The main issue here is characterizing the common progression pathways as a first step towards identifying therapeutic targets and reliable diagnostic tests. The natural observation that tumors are evolving cell populations leads to phylogeny-based studies. At the same time, the intrinsic nature of cancer cells, that is, cells that proliferate quickly and in a degenerate way, results in a relatively high number of sites with multiple mutations (in violation of the infinite-sites assumption).

The second application concerns the study of protein domains. A protein domain is a part of the sequence and structure of a protein that can evolve, function, and exist independently of the rest of the protein chain. Many proteins consist of several structural domains, and a domain may appear in a variety of different proteins. In this case, it is quite frequent for a protein to acquire a domain and then to lose it (this is much more frequent than acquiring and then losing a whole gene). Again, the infinite-sites assumption can be violated.

In this survey, we pay special attention to an approach proposed in [23], based on the notion of persistent characters in the perfect phylogeny model and on its use to exclude some characters from the construction of the phylogeny. The general focus will be on computational issues, such as efficient algorithms.

2 Maximum Parsimony and the Perfect Phylogeny

Parsimony models, just like all models, are characterized by specific constraints that are based on biological assumptions. The first basic assumption states that each species or taxon is described by a set of attributes, called characters, where each character is inherited independently, and each character can assume one of a finite set of values, called states. Alternatively, the input is a matrix whose rows are the taxa and the columns are the characters. Another basic assumption about the evolution of characters, called homology, assumes that characters that are present in more than one species must be inherited from a common ancestor.

The natural computational problem has, as input, a matrix M with n rows and m columns, where each row can be viewed as an m-vector over the set of states of characters. The matrix describes a set of n taxa (species or individuals), corresponding to the rows of M, and a set of m characters, corresponding to the columns of M, and we seek a minimum-cost tree that explains the input matrix M. In a tree T explaining a matrix M, (i) the nodes are labeled by vectors of states, of length m (ii) each row of M labels exactly one node of T, (iii) the leaves are labeled by some rows of M, and (iv) each edge (r 1, r 2) of T is labeled by the character c of M whose state in r 1 differs from that in r 2 (see Fig. 1 for an example).

Fig. 1
figure 1

Example of perfect phylogeny over a binary matrix M of five characters and four species

The cost of a tree is the number of mutations in the tree or, more formally, the sum over all edges of the tree of the cost of each edge, given by the number of characters with different states in the two nodes that make up the edge. In binary parsimony models – the most widely used – characters can take only the values (or states) zero and one, usually interpreted as the presence or absence of an attribute in the taxon.

We will now discuss how computing the maximum parsimony phylogeny can be framed as a Steiner tree problem, which is one of the most widely studied problems in operations research. Recall that each input taxon is viewed as a binary vector of length m. The set of all possible binary vectors of length m forms a hypercube H, whose edges are exactly the pairs of vertices (u, v) where u and v differ in exactly one position. Let S be the set of input species in the phylogeny problem, and notice that S is also a subset of the vertices of the hypercube H. Then the Steiner tree problem asks for a minimum-cost subtree T of H such that all vertices in S are also in T. The cost of the solution T is the number of edges of T. The Steiner tree problem is NP-hard [21], even in the case of a binary alphabet with the metric induced by the Hamming distance [14], which is a restriction derived from the reduction from the maximum parsimony phylogeny to the Steiner tree on a hypercube. Extensive recent work, both experimental and theoretical, has focused on the binary character set with the Hamming metric [25, 27].

We can now introduce some specific parsimony models, starting from the simplest: the perfect phylogeny. A tree is called a perfect phylogeny if each character i mutates exactly once (i.e., there is exactly one edge such that its vertices are labeled by vectors differing in position i). Note that a perfect phylogeny (if it exists) minimizes the overall cost, as any perfect phylogeny has cost m. We call a perfect phylogeny directed or rooted if there is a distinguished node corresponding to the vector [0, , 0]. It can be noticed immediately that we can transform a perfect phylogeny into a rooted perfect phylogeny by choosing an arbitrary node x and flipping (for each species) the state of each character that initially has value 1 in x (those characters are called active in x). There is a well-known linear-time algorithm for computing a binary perfect phylogeny [17], if it exists, and some more complicated fixed-parameter algorithms for the general perfect phylogeny problem, where the parameter is the maximum number of states for each character [1, 20]. In the following, unless specified differently, we mean by “perfect phylogeny” a rooted perfect phylogeny, that is, characters mutate only from state zero to one.

3 The Dollo Parsimony Model and Its Variants

Unfortunately, there are some evolutionary phenomena, such as homoplasy, that violate the fundamental assumptions of perfect phylogeny [11]. Two kinds of homoplasy are recurrent mutations and back mutations. A recent mutation occurs when a character changes state along divergent branches of the tree, and a back mutation implies that a character may go back to the ancestral state in descendant species after changing state. These two types of events justify the introduction of different models, differing mainly in the allowed homoplasies. Although the perfect phylogeny model does not allow any homoplasy, some extended models have been introduced to allow recurrent or back mutations.

One extended model is the Camin–Sokal parsimony model [5] (Fig. 2), where characters are directed; that is, only changes from zero to one are possible on any path from the root to a leaf. This fact means that the root is assumed to be labeled by the ancestral state with all zeros, and no back mutation is allowed, but any character can be acquired more than once, i.e., recurrent mutations are possible.

Fig. 2
figure 2

Example of a Camin–Sokal parsimony model over the same set of characters as in Fig. 3. Observe that character c 1 is gained twice in the tree

Another possible way of extending the perfect phylogeny model is the Dollo parsimony model, which allows any character to change state from zero to one only once, but puts no restriction on the number of times that it mutates from one to zero [11] (Fig. 3); that is, back mutations are allowed, but recurrent mutations are not. The definition of the Dollo parsimony model implies that characters are acquired at most once in the tree, but may be lost multiple times.

Fig. 3
figure 3

Example of a Dollo parsimony model over a matrix of five characters. Observe that character c 1 is the only one that is lost in the tree

An interesting application of the Dollo parsimony model is to the analysis of dynamic protein interactions [31], which has also shown an interesting connection with graph theory. Protein networks are graphs that model protein interactions. More precisely, the nodes of the graph are the proteins studied and the edges represent the interactions. A functional module is a subset of the proteins that have a common biological function. Usually, a functional module is not a generic graph, as it is made of overlapping cliques or quasi-cliques (which are called functional groups or complexes). It is possible to represent the interactions of those functional groups and complexes by a tree, called a tree of complexes, whose nodes are the functional groups (to be identified as cliques or quasi-cliques of an original protein network) and are such that the set of nodes consisting of the functional groups containing any given common protein is connected.

Let us denote each complex or protein with a distinct symbol from an alphabet Σ. Then the tree-of-complexes (TC) problem, over an instance consisting of a set \(A =\{ a_{1},a_{2},\cdots \,,a_{m}\}\) of subsets of Σ, asks for a tree T, if it exists, whose nodes are the input sets and are such that for each σ ∈ Σ, the set of nodes to which σ belongs is a subtree of T. Clearly, the TC problem admits several solutions that may explain a set S. The following property has never been, to the best of our knowledge, explicitly pointed out previously.

Lemma 1.

Let \(A =\{ a_{1},a_{2},\cdots \,,a_{m}\}\) be an instance of the TC problem admitting a tree of complexes T. Then T is compatible with the Dollo parsimony model (i.e., no two characters are acquired more than once).

Proof.

Let σ be a generic symbol, and let N(σ) be the set of nodes of T with σ. By the definition of the tree of complexes, N(σ) induces a connected subtree of T. It is not a limitation to assume that | N(σ) | > 1. Let x be the least common ancestor of N(σ). We claim that the incoming arc in x is the only one where σ is acquired. By the definition of the least common ancestor in a tree, (i) only nodes that are descendants of x can have the symbol σ, and (ii) there are two nodes \(v_{1},v_{2} \in N(\sigma )\) such that all paths in T connecting v 1 and v 2 pass through x; therefore σ is active in x, for otherwise N(σ) would be disconnected. Consequently, σ is acquired in x. Assume now, on the contrary, that σ is also acquired in node x 1, which is a descendant of x, and let x 2 be the parent of x 1. Since σ is acquired in node x 1, σ is active in x 1 but not in x 2. Since all paths in T connecting x and x 1 must pass through x 2, N(σ) is disconnected, contradicting the hypothesis that T is a tree of complexes.

The connection between trees of complexes and graph theory is deeper. For instance, when S is the set of cliques of a chordal graph, the tree of complexes can be obtained from the clique tree associated with the chordal graph [31]. In fact, chordal graphs are exactly those that admit a clique tree representation. Recall that a graph is chordal if the only vertex-induced subgraphs that are also cycles have exactly three vertices [16].

One of the main open questions in [31] is how to provide a characterization of the protein networks that admit a tree-of-complexes representation. Lemma 1 shows the equivalence of this open problem to the question of finding the protein networks that admit an evolutionary representation of functional groups compatible with a Dollo parsimony model.

As pointed out in the introduction, the perfect phylogeny model is too restrictive for some applications, since it cannot explain the evolution of characters in the presence of homoplasy events. On the other hand, the optimization problems associated with the Dollo and Camin–Sokal parsimony models are NP-hard [11]. Moreover, these models are too general to be useful in practical applications where interesting characters are usually affected by only a few back mutations or recurrent mutations. Therefore, research activity has focused on finding models that couple computational tractability with the capability to adequately model actual phenomena, for example in the context of proteomics when one is analyzing the properties of multidomain proteins [23, 24].

Notice that, unlike a perfect phylogeny, a Dollo phylogeny always exists. This can be seen by assuming a special internal node [1, , 1] that is also the least common ancestor of all leaves. This fact implies that a rooted Dollo parsimony model always exists for any input matrix. Since there is no restriction on mutations from 1 to 0, any binary vector can be generated. Although there is no guarantee that such a tree is optimal, it suffices to prove the existence of a Dollo phylogeny. However, such a tree makes no sense from a biological point of view, because it implies the existence of an ancestral taxon that has all of the characters in the extant taxa.

We have already pointed out that the problem of constructing a maximum parsimony tree is a special case of the well-studied problem of a Steiner tree on a hypercube, but the set of allowed homoplasies can influence in a fundamental way the computational complexity of the resulting problem. An initial effort towards describing new, relevant variants of the Dollo parsimony model has been the introduction of the conservative Dollo and static Dollo parsimony models [24].

The static Dollo parsimony model is a Dollo parsimony model where, for each node x and for each active character c in x, there exists a leaf l that is a descendant of x and where c is active. The conservative Dollo parsimony model is a Dollo parsimony model where, for each node x and for each pair c 1 and c 2 of active characters, there exists a leaf l that is a descendant of x and where both c 1 and c 2 are active. Notice that both of these models forbid the presence of an ancestral active character that is not shared with some extant species. The main motivations for those models arise in the study of multidomain protein evolution in terms of domain insertions and losses. A protein domain is a part of the sequence and structure of a protein that can evolve, function, and exist independently of the rest of the protein chain; the approach followed represents the domain structure as taxa, and the domains are the characters. A character that is, active for a certain taxon represents the fact that a domain is part of a given architecture. Hence, a state change from 0 to 1 corresponds to the addition of a domain, and a change from 1 to 0 corresponds to a domain loss. A conservative Dollo parsimony model for a protein family is a history where each domain pair that is observed in an extant taxon has been generated from a single merge event. Since the simultaneous presence of two domains in one protein often enhances the functionality of that protein, the model suggests it is highly unlikely that such a pair has been separated (and its enhanced functionality has not survived) in all extant species.

Although optimization problems associated with the static and conservative Dollo parsimony model, where the number of back mutations is minimized, are both NP-hard, there are two fast algorithms for testing if such a phylogeny exists [24]. However, an experimental analysis [24] shows that a sizable minority of multidomain protein superfamilies do not admit a static Dollo parsimony model (and, a fortiori, a conservative Dollo parsimony model). Hence an even less restrictive model is necessary to successfully model those cases.

4 Persistent Phylogeny

An important ingredient that may affect the applicability and success of parsimony methods is the set of characters used to infer the phylogeny. The issue of selecting characters was addressed in [23], where the notion of a persistent or stable character was proposed. Such characters are allowed to violate the properties of a perfect phylogeny, as a persistent character is gained exactly once but can be lost at most once in the tree.

Based on this notion, a different model, which is intermediate between the perfect phylogeny and the Dollo parsimony models, called the persistent phylogeny has been proposed [4]. Notice that a persistent perfect phylogeny is also a Dollo phylogeny, and even a static Dollo parsimony model. In fact, a persistent phylogeny is a static Dollo parsimony model where all but at most one of the descendants of a species with any given character must retain that character. Moreover, differently from the Dollo parsimony model, some matrices may not admit a persistent perfect phylogeny. Therefore, the main computational problem that we will discuss in this section is to compute (if it exists) a persistent perfect phylogeny compatible with a given matrix M. The computational complexity of this problem is still unsettled; there exists an algorithm that is exponential in the number of characters but polynomial in the number of species [4]. This time complexity makes the algorithm of practical interest for the biological applications discussed above, as usually the number of species is large, whereas the number of characters is bounded.

The notion of an overlap graph, which is a graph whose nodes are the characters and where two characters are adjacent if and only if there exists a species with both characters, is useful in this context. In fact, if a matrix M admits a persistent phylogeny, then the corresponding overlap graph is chordal [23].

One of the first applications of the persistent phylogeny model was to the study of introns, which are sequences of noncoding DNA in eukaryotic genes. In fact, the Dollo parsimony model has led to an incorrect evolutionary tree for such data, whereas assuming the persistent phylogeny model has resulted in an evolutionary tree consistent with the Coelomata hypothesis, that is, that there is a clade comprising arthropods and chordates. In contrast, an analysis of more variable introns favored the Ecdysozoa topology, that is, a clade of arthropods and nematodes [30]. The controversy about the Coelomata and Ecdysozoa topologies is one of the most discussed and persistent problems in animal phylogeny.

For the sake of completeness, we recall here the definition of a persistent phylogeny given in [4]. Let M be a binary matrix of size n × m. The persistent phylogeny for M is a rooted tree T that satisfies the following properties:

  1. 1.

    Each node x of T is labeled by a vector l x of length m.

  2. 2.

    The root of T is labeled by a vector of all zeros, and for each node x of T the value \(l_{x}[j] \in \{ 0,1\}\) is the state of character c j at this node.

  3. 3.

    For each character c j , there are at most two edges e = (x, y) and e′ = (u, v) such that l x [j] ≠ l y [j] and l u [j] ≠ l v [j] (representing a change in the state of c j ) and such that e, e′ occur along the same path from the root of T to a leaf of T; if e is closer to the root than e′, then the edge e where c j changes from 0 to 1 is labeled c j +, and while edge e′ is labeled c j .

  4. 4.

    Each row of M labels exactly one leaf of T.

Thus the main problem investigated in this section, called the persistent phylogeny problem, is this: given a binary matrix M as input, find a persistent phylogeny for M if such a tree exists.

We will devote the remainder of the section to the discussion of the algorithm presented in [4] for determining whether an input matrix M admits a persistent phylogeny and, if that is the case, for computing such a phylogeny (although the solution computed might not be the most parsimonious).

First of all, we recall that there exists a very simple test to determine if M admits an unrooted perfect phylogeny. Two characters c 1 and c 2 are in conflict in the matrix M if and only if the two corresponding columns of M contain the four possible rows (0, 0), (0, 1), (1, 1), (1, 0), called the four gametes. A matrix M has an unrooted perfect phylogeny if and only if no two of its characters are in conflict. The test for a matrix M in the rooted case consists of verifying that M has no induced matrix consisting of the three configurations (0, 1), (1, 1), (1, 0).

Conflicting characters in a matrix can be represented by an undirected conflict graph \(G_{\mathrm{c}} = (C,E \subseteq C \times C)\), where the nodes are the characters and two characters are adjacent if they are in conflict in M. Clearly, having an edgeless conflict graph is a necessary but not a sufficient condition for having a rooted perfect phylogeny, but it implies that in that case a persistent phylogeny exists [4]. Moreover the conflict graph is also a measure of the complexity of an instance of the reconstruction of the persistent perfect phylogeny.

4.1 A Graph Theoretical Solution of the Persistent Phylogeny Problem

We can associate an extended matrix M e to the input matrix M, by replacing each column c of M by a pair of columns \(({c}^{+},{c}^{-})\), where c + is called the positive character and c is called the negated character. Moreover for each row s of M, \(M_{\mathrm{e}}[s,{c}^{+}] = 1\) and \(M_{\mathrm{e}}[s,{c}^{-}] = 0\) whenever M[s, c] = 1, and \(M_{\mathrm{e}}[s,{c}^{+}] = M_{\mathrm{e}}[s,{c}^{-}] =?\) otherwise. We want to complete the extended matrix M e, obtaining a new matrix M f which is equal to M e for all species s and characters c such that M e[s, c] = 1, while \(M_{\mathrm{f}}[s,{c}^{+}] = M_{\mathrm{f}}[s,{c}^{-}]\) whenever \(M_{\mathrm{e}}[s,c] = 0\) (in this case we can interpret \(M_{\mathrm{f}}[s,{c}^{-}] = 1\) as the fact that the species s does not have the character c, but some of its ancestors used to have it). The idea of completing a matrix with missing data in order to obtain a perfect phylogeny was introduced in [22], but in our case the completion has some constraints, making the algorithm of [22] inapplicable. Finding such a matrix M f that admits a perfect phylogeny is equivalent to computing a persistent phylogeny on the original matrix M. The following theorem was proved in [4].

Theorem 1.

Let M be a binary matrix and let M e be the extended matrix associated with M. Then M admits a persistent phylogeny if and only if there exists a completion M f of M e admitting a perfect phylogeny.

Figure 4b provides an example of an extended matrix M e, with respect to Fig. 4a, whose conflict graph is given in Fig. 5.

Fig. 4
figure 4

An example of a binary matrix M which is the input of the persistent phylogeny problem, and its associated extended matrix. (a) Binary matrix M. (b) Extended matrix M

Fig. 5
figure 5

The conflict graph G c associated with the binary matrix M of Fig. 4a

4.1.1 The Red–Black Graph and the Realization of a Character

In order to find a completion of the input matrix M e, another graph representation of the input matrix, called the red–black graph, denoted by G RB, can be used. The latter consists of the edge-colored graph (V, E), where \(V = C \cup S\), with \(C =\{ c_{1},\cdots \,,c_{m}\}\) and \(S =\{ s_{1},\cdots \,,s_{n}\}\) being the sets of positive characters and species of the matrix M e, and E is defined as follows: (s, c) ∈ E is a black edge if and only if M e[s, c] = 1 and \(M_{\mathrm{e}}[s,{c}^{-}] = 0\). The algorithm for finding a persistent phylogeny basically determines a sequence of character realizations, which are represented as very specific operations on the red–black graph. The graph operation is called a realization of a character and consists of removing black edges and adding or removing red edges.

Let c be a character, and let \(\mathcal{C}(c)\) be the connected component of the graph G RB containing the node c. Then realizing the character c on G RB consists of the following steps:

  1. (i)

    Adding the red edges (c, s) for all species \(s \in \mathcal{C}(c)\) such that (c, s) is not an edge of G RB,

  2. (ii)

    Removing all black edges (c, s) (in this case c is called active);

  3. (iii)

    If an active character c 1 is connected by red edges to all species on \(\mathcal{C}(c_{1})\), then all edges incident of c 1 are deleted and c is called free.

Realizing a character c is associated with a canonical completion of c in the matrix M e by completing incomplete pairs of characters \({c}^{+},{c}^{-}\) as \(M_{\mathrm{f}}({c}^{+},s) = M_{\mathrm{f}}({c}^{-},s) = 1\) for each species \(s \in \mathcal{C}(c)\), while \(M_{\mathrm{f}}({c}^{+},s) = M_{\mathrm{f}}({c}^{-},s) = 0\) for the other species – we recall that in a completion, \(M_{\mathrm{f}}({c}^{+},s) = M_{\mathrm{e}}({c}^{+},s)\) and \(M_{\mathrm{f}}({c}^{-},s) = M_{\mathrm{e}}({c}^{-},s)\) if \(M_{\mathrm{e}}({c}^{+},s)\neq M_{\mathrm{e}}({c}^{-},s)\).

Consequently, any ordering \(\langle c_{i_{1}},\ldots,c_{i_{m}}\rangle\) of the character set represents a possible solution, obtained by realizing the characters according to the ordering. Not all orderings lead to an actual feasible solution, though, but only those whose resulting red–black graph is edgeless [4]. Nevertheless, the fundamental result of [4] is the following.

Theorem 2.

Let M be a binary matrix and G RB be the red–black graph for the matrix M. Then M admits a persistent phylogeny if and only if there exists an ordering of the characters of M such that the realization of characters in that ordering in the graph G RB results in an edgeless red–black graph.

The main consequence of Theorem 2 is that one algorithm for finding a persistent phylogeny, if it exists, is to enumerate all possible orderings of the character set and to compute the red–black graph resulting from realizing the characters in each such order. In fact, the algorithm of [4] builds a decision tree that explores all orderings of the set C of characters. An experimental analysis of the computational performance of this algorithm in building a persistent phylogeny has been presented in [4].

Example 1.

Consider the matrix M given in Fig. 4a. In Fig. 6 shows an example of a realization of characters in the red–black graph according to the ordering ⟨d, c, e, b, a⟩. The binary matrix M has associated with the conflict graph G c represented in Fig. 5. The pairs of characters in conflict are (a, c), (b, c), (c, d), (a, e), (b, e), and (d, e). The ordering \(\langle d,c,e,b,a\rangle\) leads to the canonical completion M′ shown in Fig. 7. The perfect phylogeny compatible with M′ is also a persistent phylogeny for the input matrix M and is represented in Fig. 8.

Fig. 6
figure 6

The realization of ⟨d, c, e, b, a⟩ on the red–black graph G R,B

Fig. 7
figure 7

A completion M′ of the extended matrix M e of Fig. 4b

Fig. 8
figure 8

Realizing the characters in the ordering ⟨d, c, e, b, a⟩ results in a persistent phylogeny for M

5 The Near-Perfect Phylogeny

We recall that there are instances M of the perfect phylogeny problem that cannot be solved, motivating the need for different models. Another approach to the construction of a most likely phylogeny in accordance with the input data is to move towards an optimization problem, such as identifying a largest subset of characters that admits a perfect phylogeny or, equivalently, removing the minimum number of columns from the input matrix M so that the resulting matrix has a perfect phylogeny. This problem is also called the character compatibility problem. Unfortunately, these optimization problems are intractable, as identifying the largest subset of characters admitting a perfect phylogeny is equivalent to Max Clique [15], and also shares its inapproximability [19]. Consequently, different versions must be sought.

An interesting problem stems from the observation that the perfect phylogeny is the minimum-cost Steiner tree where the set of species sharing a common state for any given character forms a connected subtree, and that the minimum-cost is exactly equal to m (i.e., the number of characters). The near-perfect phylogeny problem (NPPP) has a matrix M as input and asks for a minimum-cost Steiner tree whose leaves are taken from the species (i.e., the rows of M) and in which all species label some vertices of the tree. By the previous argument about the optimum, the cost of any solution can be expressed as m + q, where q (which is always positive) is called the penalty. Note that the penalty can be related to a back mutation or a recurring mutation, since we have no way to distinguish or prioritize these.

The first result in this setting was an \(O(n{m}^{q}{2}^{{q}^{2}{r}^{2} })\) time algorithm [13] which draws upon some of the ideas of the first fixed-parameter algorithm for the perfect phylogeny problem [1] to find a solution with a penalty of at most q, if such a tree exists. Unfortunately, such a time complexity makes the algorithm impracticable; in particular, the m q factor limits its usefulness to very small values of q. From a theoretical point of view, the main question left open in [13] was whether the NPPP admits a fixed-parameter (FPT) algorithm [10] when the parameters are q and r, r being the maximum number of states of any character.

The question was answered positively in [27] for the binary perfect phylogeny, that is, in the case r = 2. This case is especially important in both theory and practice. In fact, a study of the perfect phylogeny problem shows that ideas originating from the two-state case have, in time, percolated up to three-state and four-state cases and then up to the r-state case, for any fixed r. Therefore, the binary-state algorithm is a strong hint that an FPT algorithm exists for any fixed r. From a practical point of view, most of the available data are binary or can be transformed into binary characters via opportune clustering, and therefore the algorithm of [27], which has \(O(7{2}^{q} + {8}^{q}n{m}^{2})\) time complexity, can be applied.

We will briefly sketch the main ideas of the algorithm of [27], which follows a randomized divide-and-conquer approach where, at each stage, a conflicting character c is picked at random, and then c is allowed to mutate only once in the tree. Since c mutates only once, the Steiner tree instance is partitioned into two subtrees T 0 and T 1, according the state that the species assumes in c. Then two vertices r 0 and r 1 are chosen at random from T 0 and T 1, respectively (note that r 0 and r 1 might be Steiner vertices, so we cannot sample directly from the leaves or the species). A new edge (r 0, r 1) is created and labeled by the character c. Then the algorithm operators recursively on T 0 and T 1, by guessing no more than q edges overall and checking that, at the end, the conflict graph is sufficiently small to be solved via exhaustive enumeration.

The correctness of the algorithm derives mainly from the observation that at most q characters can mutate more than once. Therefore, when the conflict graph is large, the random choice of c has a high probability of being correct (i.e., there exists a solution where c mutates once), whereas if the conflict graph is small, then the optimal solution can be computed via brute force. The analysis of the time complexity is quite involved, as computing the vertices r 0, r 1 requires efficiently some combinatorial properties of Buneman graphs [26] (which are related to Steiner trees). The aforementioned \(O(7{2}^{q} + {8}^{q}n{m}^{2})\) time complexity is for a derandomized version of the algorithm. If we settle for finding the optimal solution with probability at least 8q, then the time complexity can be lowered to \(O(1{8}^{q} + 8n{m}^{2})\).

A related problem studied in [25] is H(p, q)-NPP, where the input is a set of genotypes and we want to compute a phylogeny where the vertices are labeled with haplotypes so that (i) at most p sites can mutate, each at most q times (i.e., have at most q homoplasy events), and (ii) the set of haplotypes labeling the vertices is able to explain the input genotypes. An algorithm for H(1, q)-NPP, that is, when only one character is allowed to have at most q recurrent mutations, was presented in [25]. That algorithm nicely complements that of [27], where no restriction on the number of characters affected exists, and is based on an analysis of the conflict graph, the main point being the property that the character with recurrent mutations must be the only one with two adjacent characters in the conflict graph.

6 Open Problems

This chapter has presented some generalizations of the perfect phylogeny model motivated by recent biological applications in which evolution was investigated as a character-based process. The availability of a large amount of genomic and proteomic data makes the use of genetic attributes or biological markers quite appealing in evolution analysis, thus giving even more importance to applying computationally efficient parsimony models. On the other hand, there is a huge gap between tractable and NP-hard parsimony models that needs to be filled. In fact, one extreme is the perfect phylogeny model, which has a linear time solution but only a few specific biological applications. On the other hand, we have models such as the Dollo and Camin–Sokal parsimony models, which are often too generic from a biological viewpoint and computationally impracticable. A middle ground is occupied by the persistent perfect phylogeny model, for which some efficient, practical algorithms have recently been presented [4], and for which some specific applications such as the analysis of protein networks and domains have been found [23, 31]. However this research direction still needs to be explored. In particular, finding a polynomial time algorithm for the persistent phylogeny model is still an open problem, and the novelty of the algorithm of [4] hints that even more practical approaches are possible, even for some optimization versions of the problem that deserve to be investigated. It must be pointed out that the persistent phylogeny model is useful for detecting persistent characters that can be excluded from the evolutionary reconstruction process. In fact, having computational tools to detect characters that should or should not be included in a parsimony model analysis can improve the correctness of the tree that is built from such characters [23]. From a theoretical point of view, the investigation of variants of the perfect phylogeny model and restrictions of the Dollo parsimony model other than those presented here is still an important research direction. In particular, the tree-of-complexes problem discussed in this chapter reveals that there may be interesting, strong connections between graph theory and parsimony models representing the evolutionary relationships between functional modules in a protein network. To conclude, characterization of the structural properties of protein networks and of the overlap graphs of characters seems to be a promising novel direction for building parsimony models in a more efficient and biologically meaningful way.