Abstract
Orthology detection from sequence similarity remains a difficult and computationally expensive problem for gene families with large numbers of gene duplications and losses. REvolutionH-tl implements a new graph-based approach to identify orthogroups, orthology, and paralogy relationships first, and it uses this information in a second step to infer event-labeled gene trees and their reconciliation with an inferred species tree. It avoids using gene trees and species trees upon input and settles for a maximal subtree reconciliation in cases where noise or horizontal gene transfer precludes a global reconciliation. The accuracy of the tool is comparable to competing tools at substantially reduced computational cost. REvolutionH-tl is freely available at https://pypi.org/project/revolutionhtl/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10(3), 405–421 (1981). https://doi.org/10.1137/0210030
Altschul, S.F., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997). https://doi.org/10.1093/nar/25.17.3389
Bininda-Emonds, O.: Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life. Computational Biology, Springer, Dordrecht (2004). https://doi.org/10.1007/978-1-4020-2330-9
Buchfink, B., Reuter, K., Drost, H.G.: Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x
Dress, A., Huber, K.T., Koolen, J., Moulton, V., Spillner, A.: Basic Phylogenetic Combinatorics. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/CBO9781139019767
Emms, D.M., Kelly, S.: OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 1–14 (2019)
Fitch, W.: Homology: a personal view on some of the problems. Trends Genet. 16, 227–231 (2000). https://doi.org/10.1016/S0168-9525(00)02005-9
Fuentes, D., Molina, M., Chorostecki, U., Capella-Gutiérrez, S., Marcet-Houben, M., Gabaldón, T.: PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies. Nucleic Acids Res. 50(D1), D1062–D1068 (2021). https://doi.org/10.1093/nar/gkab966
Gabaldón, T., Koonin, E.V.: Functional and evolutionary implications of gene orthology. Nat. Rev. Genet. 14(5), 360–366 (2013)
Geiß, M., et al.: Best match graphs. J. Math. Biol. 78(7), 2015–2057 (2019). https://doi.org/10.1007/s00285-019-01332-9
Geiß, M.: Best match graphs and reconciliation of gene trees with species trees. J. Math. Biol. 80(5), 1459–1495 (2020)
Hellmuth, M.: Biologically feasible gene trees, reconciliation maps and informative triples. Algorithms Mol. Biol. 12(1), 23 (2017). https://doi.org/10.1186/s13015-017-0114-z
Hellmuth, M., Stadler, P.F.: The theory of gene family histories. arXiv preprint arXiv:2304.11826 (2023)
Hellmuth, M., Wieseke, N., Lechner, M., Lenhof, H.P., Middendorf, M., Stadler, P.F.: Phylogenomics with paralogs. Proc. Natl. Acad. Sci. U.S.A. 112, 2058–2063 (2015). https://doi.org/10.2307/2412448
Hernandez-Rosales, M., Hellmuth, M., Wieseke, N., Huber, K.T., Moulton, V., Stadler, P.F.: From event-labeled gene trees to species trees. BMC Bioinform. 13(19), S6 (2012). https://doi.org/10.1186/1471-2105-13-S19-S6
Huerta-Cepas, J., Dopazo, H., Dopazo, J., Gabaldón, T.: The human phylome. Genome Biol. 8, R109 (2007)
Kerfeld, C.A., Scott, K.M.: Using BLAST to teach “E-value-tionary’’ concepts. PLoS Biol. 9(2), e1001014 (2011). https://doi.org/10.1371/journal.pbio.1001014
Klemm, P., Stadler, P.F., Lechner, M.: Proteinortho6: pseudo-reciprocal best alignment heuristic for graph-based detection of (co-) orthologs. Front. Bioinform. 3, 1322477 (2023)
Kristensen, D., Wolf, Y., Mushegian, A., Koonin, E.: Computational methods for gene orthology inference. Brief. Bioinform. 5(12), 399–420 (2019)
Kundu, S., Bansal, M.S.: SaGePhy: an improved phylogenetic simulation framework for gene and subgene evolution. Bioinformatics 35(18), 3496–3498 (2019). https://doi.org/10.1093/bioinformatics/btz081
Le, S.Q., Gascuel, O.: An improved general amino acid replacement matrix. Mol. Biol. Evol. 25, 1307–1320 (2008). https://doi.org/10.1093/molbev/msn067
Lechner, M., Findeiß, S., Steiner, L., Marz, M., Stadler, P.F., Prohaska, S.J.: Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinform. 12(1), 124 (2011). https://doi.org/10.1186/1471-2105-12-124
Python Software Foundation: Python language reference (2023). http://www.python.org
Schaller, D., et al.: Corrigendum to “Best match graphs". J. Math. Biol. 82(6), 47 (2021). https://doi.org/10.1007/s00285-021-01601-6
Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Best match graphs with binary trees. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds.) AlCoB 2021. LNCS, vol. 12715, pp. 82–93. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74432-8_6
Schaller, D., Geiß, M., Hellmuth, M., Stadler, P.F.: Heuristic algorithms for best match graph editing. Algorithms Mol. Biol. 16(1), 19 (2021). https://doi.org/10.1186/s13015-021-00196-3
Schaller, D., Geiß, M., Stadler, P.F., Hellmuth, M.: Complete characterization of incorrect orthology assignments in best match graphs. J. Math. Biol. 82(3), 20 (2021). https://doi.org/10.1007/s00285-021-01564-8
Semple, C., Steel, M., Steel, B.: Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications, Oxford University Press, Oxford (2003)
Stadler, P.F., et al.: From pairs of most similar sequences to phylogenetic best matches. Algorithms Mol. Biol. 15(1), 1–20 (2020). https://doi.org/10.1186/s13015-020-00165-2
Wu, B.Y.: Constructing the maximum consensus tree from rooted triples. J. Comb. Optim. 8(1), 29–39 (2004). https://doi.org/10.1023/B:JOCO.0000021936.04215.68
Zhang, C., Mirarab, S.: ASTRAL-Pro 2: ultrafast species tree reconstruction from multi-copy gene family trees. Bioinformatics 38(21), 4949–4950 (2022)
Zmasek, C.M., Eddy, S.R.: A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17(9), 821–828 (2001)
Acknowledgments
This work was supported in part by CINVESTAV-University of California (UC Alianza MX) joint project and by the German Research Foundation (DFG, STA 850/49-1). KAP (CVU:227919) and JARR (CVU:1147711) received financial support from CONAHCyT. We express our gratitude to Marisol Navarro Miranda, Erika Viridiana Cruz Bonilla, and Luis Fernando Flores Lopez for their valuable contributions to the design of the methodology figure for REvolutionH-tl.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
Authors have no competing interests to declare that are relevant to the content of this article.
A Appendix
A Appendix
1.1 A.1 Notation
A graph \(G=(V(G),E(G))\) consists of two sets: a non-empty set of objects V(G), called nodes, and a set E(G) of edges. Each edge, noted as \(e=uv\), connects a pair of nodes \(u,v\in V(G)\). The edge is called an arrow when this connection has a direction. In such cases, v is an out-neighbor of u. When we count the number of connections to a node v, we refer to this count as the degree of the node, denoted as \(\deg _G(v)\). Furthermore, the out-degree \(\text {deg}^+_G(v)\) of a node v is the number of its out-neighbors. Based on this concept, graphs are divided into two main families depending on the nature of their connections: those with edges, known as undirected graphs, and those with arrows, known as directed graphs. A subgraph H of G is also a graph where \(V(H)\subseteq V(G)\) and \(E(H)\subseteq E(G)\). Moreover, the subgraph of G induced by \(V' \subseteq V(G)\) denoted as \(G[V']\) is a subgraph where \(V(H)= V'\) and its set of edges consists of all edges in E(G) that connect the nodes in \(V'\).
In a graph G, a path from node u to node v is a sequence of nodes starting at u and ending at v, with consecutive nodes connected by edges. A graph is termed connected if there is a path linking every pair of its nodes.
A tree T is a connected undirected graph that becomes disconnected by removing any edge. In this context, every tree is rooted, meaning it has a designated root node \(\rho _T\), with the structure visualized such that all other nodes fall hierarchically beneath the root (refer to Fig. 3(A)). The leaves of the tree, L(T), are nodes with zero out-degree. The inner nodes, \(V^0(T)\), are those nodes that are neither leaves nor the root of T.
Although rooted trees are considered undirected, the convention \(uv \in E(T)\) indicates u as the unique parent of v and v as a child of u, with \(\text {ch}_T(u)\) representing children of u. Also, u is an ancestor of v, and v a descendant of u, if u lies on the unique path from v to \(\rho _T\). We express this as \(v \preceq _T u\), or more strictly as \(v \prec _T u\) if \(v \ne u\). Nodes \(u, v \in V(T)\) are non-comparable, noted as \(x \parallel _T y\), if neither is an ancestor or descendant of the other; they are comparable otherwise. The last common ancestor of a set \(X \subseteq V(T)\), \(\text {lca}_T(X)\), is the most distant node u from \(\rho _T\) that is an ancestor of all nodes in X. For individual nodes \(x,y \in V(T)\), we denote \(\text {lca}_T(x,y)\) as their last common ancestor. Furthermore, in [15] the \(\preceq _T\) relationship has been extended to consider edges within T; for edges \(e_0=uv, e_1=xy\) and a node z, \(e_0 \preceq _T e_1\) if \(v \preceq _T y\), \(z \prec _T e_0\) if \(z \preceq _T v\), and \(e_0 \prec _T z\) if \(u \preceq _T z\).
For any node v in V(T), the expression T(v) denotes the subtree rooted at v, encompassing all descendants of v. The restriction \(T_{|L'}\) of T to a leaf subset \(L' \subseteq L(T)\) is its minimal subtree connecting all leaves in \(L'\), excluding degree-two inner nodes. A tree T displays another tree \(T'\) with leaves \(L'\), denoted \(T' \le T\), if \(T'\) arises from contracting inner edges of \(T_{|L'}\). If \(L(T) = L(T')\), T is a refinement of \(T'\). The cluster \(C_T(v)\) includes all leaves in the subtree T(v).
All trees in this paper are phylogenetic, meaning each inner node \(v \in V^0(T)\) has an out-degree \(\text {deg}^+_T(v) > 1\), except the root. In some cases, like in Fig. 3(C), we examine planted trees formed by adding a new node \(0_T\) and edge \(0_T \rho _{T'}\) to a phylogenetic, rooted tree \(T'\).
A triple xy|z is a tree on three leaves x, y and z where x and y share a closer common ancestor than either does with z, triples are pivotal for supertree construction [3, 5, 28]. Each tree T corresponds to rooted triples R(T). A triple set R is consistent if it’s part of R(T) for some tree T that displays R. The BUILD algorithm [1, 28] checks this, returning a supertree for consistent R or noting inconsistency. It uses the Aho-graph \([R, L']\) (with \(L' = \bigcup (L(R))\) and edge xy for each triple \(xy|z\in R\)) to assess consistency; a disconnected graph confirms consistent triples.
1.2 A.2 Evolutionary Scenarios
In a species tree S, leaves symbolize extant species, and inner nodes indicate speciations. Conversely, a gene tree \((T,t,\sigma )\) depicts genes as leaves L(T). The function \(\sigma : L(T) \rightarrow L(S)\) maps each gene to its residing species. The function \(t: V(T) \rightarrow \{ \bullet , \square , \odot , \times \}\) classifies nodes in the gene tree based on evolutionary processes: \(t(x)= \bullet \) for speciation, \(t(x)= \square \) for duplication, \(t(x)= \odot \) for extant genes, and \(t(x)= \times \) for gene loss, as detailed in [15] and illustrated in Fig. 3B.
An evolutionary scenario \((S,T,t,\sigma )\) merges a gene tree \((T,t,\sigma )\) with a species tree S via the reconciliation map \(\mu \), as introduced in [11, 15] and exemplified in Fig. 3. Detailed mathematical constraints of such scenarios are elaborated in Appendix A.7.
Constructing an evolutionary scenario requires consistency between gene and species trees, assessed using color triples. Let \(\mathfrak {R}(T)=\{ r\in R(T) \mid t(\text {lca}_T(L(r)))=\bullet \text { and } |\sigma (L(r))|=3 \}\) be the set of speciation triples of the gene tree. Given a triple \(ab|c\in R(T)\), the corresponding color triple is \(\sigma (ab|c)= \sigma (a)\sigma (b)|\sigma (c)\). Finally, let \(\mathfrak {R}_\sigma (T)=\{ \sigma (r) \text { for all } r\in \mathfrak {R}(T) \}\) be the set of color triples of the gene tree. Here, the gene tree \((T,t,\sigma )\) and a species tree S are consistent whenever \(\mathfrak {R}_\sigma (T)\subseteq R(S)\) [12, 15]. Consistency is required to ensure that a reconciliation between \((T,t,\sigma )\) and S exists.
1.3 A.3 Best Match Graphs
The concept of best match graphs (BMGs) [10, 11, 24, 25, 27] outlines that a gene y is a best match for x if, x and y reside in distinct species and \(\text {lca}_T(x,y)\preceq \text {lca}_T(x,y')\) for all genes \(y'\) in the species \(\sigma (y)\), i.e., y is one of the genes in \(\sigma (y)\) that is evolutionary most closely related to x.
The best match graph \(G(T,\sigma )\), a directed graph, represents these relationships, with an arrow xy indicating y is the best match of x. The tree \((T,t,\sigma )\) explains \(G(T,\sigma )\).
For any directed graph G and a node-coloring map \(\sigma :V(G)\rightarrow M\), the informative triples \(\mathcal {R}(G,\sigma )\) ascertain if G is a BMG. A triple \(r\in \mathcal {R}(G,\sigma )\) exists with \(L(r)= x,y,y'\in V(G)\) and \(\sigma (x)\not =\sigma (y)=\sigma (y')\) if \(xy\in E(G)\), \(xy'\not \in E(G)\), and, if T is binary, \(yy'|x\) for both \(xy, xy'\in E(G)\). \((G,\sigma )\) is a BMG if and only if \((G,\sigma )=G(\text {aho}(\mathcal {R}(G,\sigma )), \sigma )\) [10, 25]. Figure 3AD-G depicts the interplay between gene trees, best match graphs, and informative triples.
1.4 A.4 Selection of Best Hits
Each alignment hit \(\overrightarrow{xy}\) is associated with a bit score \(\omega (\overrightarrow{xy})\), we estimate the evolutionary relatedness between two genes x and y as the normalized bit-score \(\omega _{xy}= ( \omega (\overrightarrow{xy})/\text {length}(y) + \omega (\overrightarrow{yx})/\text {length}(x) )/2\).
For each gene x, we identify the most closely related genes in a different species \(Y \ne \sigma (x)\). A gene y from species \(\sigma (y) = Y\) is considered a best hit of x if its alignment hit score \(\omega _{xy}\) meets or exceeds an adaptive threshold defined as \(f \cdot \omega _{x|Y}\), where \(\omega _{x|Y} = \max ({ \omega _{xy} \text { where } \sigma (y) = Y })\). Here, f is a factor between zero and one. This threshold, aimed at identifying paralogous best hits, was introduced in [22], and we set \(f=0.95\).
1.5 A.5 From Best Hits to Gene Trees
We start by constructing a best hit graph \((G,\sigma )\), which is a directed graph where nodes are the genes of the orthogroup, and there is an arrow xy if y is best hit of x. Then we proceed to find a least resolved gene tree \((T^*,\sigma )\) that maximizes the similarity of the best hit graph \((G,\sigma )\) and the best match graph \(G(T^*,\sigma )\). To do so, we use the heuristic introduced in [26], which consists of finding the maximum set of consistent, informative triples \(\mathcal {R}(G,\sigma )\).
The three \((T^*,\sigma )\) are further refined into an augmented tree \((T,\sigma )\), which allows us to assign evolutionary events in such a way that duplication events are minimized while maintaining the same best match graph, this is \(G(T^*,\sigma )=G(T,\sigma )\) [27].
Now, we create the evolutionary events map \(t:V(T)\rightarrow \{\bullet ,\square ,\odot \}\) in such a way that for a node \(v\in V(T)\), if such a node is a leaf then we set \(t(v)=\odot \), on the contrary, we set \(t(v)=\bullet \) if \(\sigma (C_T(v')) \cap \sigma (C_T(v'')) = \emptyset \), otherwise \(t(v)=\square \).
Finally, having the event-labeled gene tree \((T,t,\sigma )\), we compute the orthology relation underling this tree as the relation that comprises all pairs (x, y) and (y, x) of genes x and y for which \(t(\text {lca}_T(x,y))=\bullet \).
1.6 A.6 Consistency of Triple Sets
To reconcile a gene tree \((T,t,\sigma )\) inconsistent with the species tree S, we modify \((T,t,\sigma )\) to a consistent tree \((T',t,\sigma )\). We differentiate between consistent triples \(R_C = \{r \in \mathfrak {R}(T) : \sigma (r) \in R(S)\}\) and inconsistent triples \(R_I = \mathfrak {R}(T) {\setminus } R_C\). The aim is to eliminate triples in \(R_I\) while retaining those in \(R_C\). Removing a leaf \(a \in L(T)\) also removes all triples \(r \in R(T)\) with \(a \in L(r)\). Utilizing this, we can select a subset of inconsistent leaves \(L_I \subseteq L(R_I)\), set \(L' = L(R_T) {\setminus } L_I\), and construct a consistent tree \(T' = T_{|L'}\). The steps for this tree editing are outlined in Algorithm 1.
1.7 A.7 Tree Reconciliation
Once we ensure consistency between the gene tree \((T,t,\sigma )\) and species trees S, we perform a reconciliation map as follows.
Lets assume that \(x,y\in V(T)\), then the reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) from the gene tree \((T,t,\sigma )\) to the species tree S satisfies:
The reconciliation map \(\mu : V(T)\rightarrow V(S)\cup E(S)\) is computed in linear time [15]. Given a node \(v\in V(T)\) such that \(t(v)\not =\square \), it is straightforward to determine which element of \(V(S)\cup E(S)\) corresponds to \(\mu (v)\) by just looking at constraints \(U0-2\), in the case when \(t(v)=\square \) we set \(\mu (v)=xy\in E(S)\) such that \(y=\text {lca}_S(\sigma (C_T(x)))\), this assignation minimizes the gene-loss events.
1.8 A.8 Resolving Speciation Nodes
To refine a node \(x \in V(T)\) with more than two children via a map \(f: \text {ch}_T(x) \rightarrow {y_0, y_1}\), perform: (i) add nodes \(y_0, y_1\) to the tree, (ii) remove edges xy for each \(y \in \text {ch}_T(x)\), and (iii) add edges \(xy_0\), \(xy_1\), and f(y)y for each \(y \in \text {ch}_T(x)\). When u is a speciation node, use the reconciliation map \(\mu \) and map \(f: \text {ch}_T(u) \rightarrow \text {ch}_S(\mu (u))\) to resolve u. For \(v \in \text {ch}_T(u)\) and \(v' \in \text {ch}_S(\mu (u))\), set \(f(v) = v'\) iff \(\mu (v) \preceq v'\).
1.9 A.9 Inferring Gene Loss
For a speciation node \(x \in V(T)\) in the gene tree, the reconciliation map \(\mu \) helps detect gene losses by mapping x to a node \(y = \mu (x)\) in the species tree. If we find a node \(y' \in \text {ch}_S(y)\) for which all nodes \(x' \in V(T)\) satisfying \(x' \prec x\) also fulfill \(\mu (x') \parallel _S y'\), a gene loss at \(y'\) is inferred.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ramírez-Rafael, J.A. et al. (2024). REvolutionH-tl: Reconstruction of Evolutionary Histories tool. In: Scornavacca, C., Hernández-Rosales, M. (eds) Comparative Genomics. RECOMB-CG 2024. Lecture Notes in Computer Science(), vol 14616. Springer, Cham. https://doi.org/10.1007/978-3-031-58072-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-58072-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-58071-0
Online ISBN: 978-3-031-58072-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)