Skip to main content

NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees

  • Conference paper
  • First Online:
Comparative Genomics (RECOMB-CG 2018)

Abstract

Divide-and-conquer methods, which divide the species set into overlapping subsets, construct trees on the subsets, and then combine the trees using a supertree method, provide a key algorithmic framework for boosting the scalability of phylogeny estimation methods to large datasets. Yet the use of supertree methods, which typically attempt to solve NP-hard optimization problems, limits the scalability of these approaches. In this paper, we present a new divide-and-conquer approach that does not require supertree estimation: we divide the species set into disjoint subsets, construct trees on the subsets, and then combine the trees using a distance matrix computed on the full species set. For this merger step, we present a new method, called NJMerge, which is a polynomial-time extension of the Neighbor Joining algorithm. We report on the results of an extensive simulation study evaluating NJMerge’s utility in scaling three popular species tree estimation methods: ASTRAL, SVDquartets, and concatenation analysis using RAxML. We find that NJMerge provides substantial improvements in running time without sacrificing accuracy and sometimes even improves accuracy. Furthermore, although NJMerge can sometimes fail to return a tree, the failure rate in our experiments is less than 1%. Together, these results suggest that NJMerge is a valuable technique for scaling computationally intensive methods to larger datasets, especially when computational resources are limited. NJMerge is freely available on Github: https://github.com/ekmolloy/njmerge. All datasets, scripts, and supplementary materials are freely available through the Illinois Data Bank: https://doi.org/10.13012/B2IDB-1424746_V1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aho, A.V., Sagiv, Y., Szymanski, T.G., Ullman, J.D.: Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10(3), 405–421 (1981). https://doi.org/10.1137/0210030

    Article  MathSciNet  MATH  Google Scholar 

  2. Allman, E.S., Degnan, J.H., Rhodes, J.A.: Species tree inference from gene splits by unrooted STAR methods. IEEE/ACM Trans. Comput. Biol. Bioinform. 15(1), 337–342 (2018). https://doi.org/10.1109/TCBB.2016.2604812

    Article  Google Scholar 

  3. Atteson, K.: The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25(2–3), 251–278 (1999). https://doi.org/10.1007/PL00008277

    Article  MathSciNet  MATH  Google Scholar 

  4. Bayzid, M.S., Hunt, T., Warnow, T.: Disk covering methods improve phylogenomic analyses. BMC Genomics 15(6), S7 (2014). https://doi.org/10.1186/1471-2164-15-S6-S7

    Article  Google Scholar 

  5. Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N.A., RoyChoudhury, A.: Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29(8), 1917–1932 (2012). https://doi.org/10.1093/molbev/mss086

    Article  Google Scholar 

  6. Chifman, J., Kubatko, L.: Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23), 3317–3324 (2014). https://doi.org/10.1093/bioinformatics/btu530

    Article  Google Scholar 

  7. Chifman, J., Kubatko, L.: Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47 (2015). https://doi.org/10.1016/j.jtbi.2015.03.006

    Article  MathSciNet  MATH  Google Scholar 

  8. Dasarathy, G., Nowak, R., Roch, S.: Data requirement for phylogenetic inference from multiple loci: a new distance method. IEEE/ACM Trans. Comput. Biol. Bioinform. 12(2), 422–432 (2015). https://doi.org/10.1109/TCBB.2014.2361685

    Article  Google Scholar 

  9. Fletcher, W., Yang, Z.: INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26(8), 1879–1888 (2009). https://doi.org/10.1093/molbev/msp098

    Article  Google Scholar 

  10. Huson, D.H., Vawter, L., Warnow, T.: Solving large scale phylogenetic problems using DCM2. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 118–129. AAAI Press (1999)

    Google Scholar 

  11. Jarvis, E.D., Mirarab, S., et al.: Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346(6215), 1320–1331 (2014). https://doi.org/10.1126/science.1253451

    Article  Google Scholar 

  12. Jukes, T.H., Cantor, C.R.: Evolution of protein molecules. In: Munro, H. (ed.) Mammalian Protein Metabolism, vol. 3, pp. 21–132. Academic Press, New York (1969)

    Chapter  Google Scholar 

  13. Lagergren, J.: Combining polynomial running time and fast convergence for the disk-covering method. J. Comput. Syst. Sci. 65(3), 481–493 (2002). https://doi.org/10.1016/S0022-0000(02)00005-3

    Article  MathSciNet  MATH  Google Scholar 

  14. Lefort, V., Desper, R., Gascuel, O.: FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol. 32(10), 2798–2800 (2015). https://doi.org/10.1093/molbev/msv150

    Article  Google Scholar 

  15. Liu, L., Yu, L.: Estimating species trees from unrooted gene trees. Syst. Biol. 60(5), 661–667 (2011). https://doi.org/10.1093/sysbio/syr027

    Article  Google Scholar 

  16. Maddison, W.P.: Gene trees in species trees. Syst. Biol. 46(3), 523–536 (1997). https://doi.org/10.1093/sysbio/46.3.523

    Article  Google Scholar 

  17. Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Systematic Biol. 65(2), 334–344 (2016). https://doi.org/10.1093/sysbio/syv082

    Article  Google Scholar 

  18. Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5), 377–386 (2015). https://doi.org/10.1089/cmb.2014.0156

    Article  Google Scholar 

  19. Mirarab, S., Reaz, R., Bayzid, M.S., Zimmermann, T., Swenson, M.S., Warnow, T.: ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17), i541–i548 (2014). https://doi.org/10.1093/bioinformatics/btu462

    Article  Google Scholar 

  20. Mirarab, S., Warnow, T.: ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31(12), i44–i52 (2015). https://doi.org/10.1093/bioinformatics/btv234

    Article  Google Scholar 

  21. Molloy, E.K., Warnow, T.: To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67(2), 285–303 (2018). https://doi.org/10.1093/sysbio/syx077

    Article  Google Scholar 

  22. Nelesen, S., Liu, K., Wang, L.S., Linder, C.R., Warnow, T.: DACTAL: divide-and-conquer trees (almost) without alignments. Bioinformatics 28(12), i274–i282 (2012). https://doi.org/10.1093/bioinformatics/bts218

    Article  Google Scholar 

  23. Ogilvie, H.A., Bouckaert, R.R., Drummond, A.J.: StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol. Biol. Evol. 34(8), 2101–2114 (2017). https://doi.org/10.1093/molbev/msx126

    Article  Google Scholar 

  24. Pamilo, P., Nei, M.: Relationships between gene trees and species trees. Mol. Biol. Evol. 5(5), 568–583 (1988)

    Google Scholar 

  25. Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree 2 - approximately maximum-likelihood trees for large alignments. PLOS ONE 5(3), 1–10 (2010). https://doi.org/10.1371/journal.pone.0009490

    Article  Google Scholar 

  26. Rannala, B., Yang, Z.: Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics 164(4), 1645–1656 (2003)

    Google Scholar 

  27. Robinson, D., Foulds, L.: Comparison of phylogenetic trees. Math. Biosci. 53(1), 131–147 (1981). https://doi.org/10.1016/0025-5564(81)90043-2

    Article  MathSciNet  MATH  Google Scholar 

  28. Roch, S., Steel, M.: Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62 (2015). https://doi.org/10.1016/j.tpb.2014.12.005

    Article  MATH  Google Scholar 

  29. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987). https://doi.org/10.1093/oxfordjournals.molbev.a040454

    Article  Google Scholar 

  30. Stamatakis, A.: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9), 1312–1313 (2014). https://doi.org/10.1093/bioinformatics/btu033

    Article  Google Scholar 

  31. Steel, M.: The complexity of reconstructing trees from qualitative characters and subtrees. J. Classif. 9(1), 91–116 (1992). https://doi.org/10.1007/BF02618470

    Article  MathSciNet  MATH  Google Scholar 

  32. Sukumaran, J., Holder, M.T.: DendroPy: a python library for phylogenetic computing. Bioinformatics 26(12), 1569–1571 (2010). https://doi.org/10.1093/bioinformatics/btq228

    Article  Google Scholar 

  33. Swenson, M.S., Suri, R., Linder, C.R., Warnow, T.: An experimental study of Quartets MaxCut and other supertree methods. Algorithm. Mol. Biol. 6(1), 7 (2011). https://doi.org/10.1186/1748-7188-6-7

    Article  Google Scholar 

  34. Swofford, D.L.: PAUP* (*Phylogenetic Analysis Using PAUP), Version 4a161 (2018). http://phylosolutions.com/paup-test/

  35. Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17(2), 57–86 (1986)

    MathSciNet  MATH  Google Scholar 

  36. Vachaspati, P., Warnow, T.: ASTRID: accurate species trees from internode distances. BMC Genomics 16(10), S3 (2015). https://doi.org/10.1186/1471-2164-16-S10-S3

    Article  Google Scholar 

  37. Vachaspati, P., Warnow, T.: SVDquest: improving SVDquartets species tree estimation using exact optimization within a constrained search space. Mol. Phylogenet. Evol. 124, 122–136 (2018). https://doi.org/10.1016/j.ympev.2018.03.006

    Article  Google Scholar 

  38. Warnow, T.: Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press, Cambridge UK (2017)

    Book  Google Scholar 

  39. Warnow, T.: Supertree Construction: Opportunities and Challenges. ArXiv e-prints, May 2018. https://arxiv.org/abs/1805.03530

  40. Warnow, T., Moret, B.M.E., St. John, K.: Absolute convergence: true trees from short sequences. In: Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2001, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 186–195 (2001)

    Google Scholar 

  41. Warnow, T.: Tree compatibility and inferring evolutionary history. J. Algorith. 16(3), 388–407 (1994). https://doi.org/10.1006/jagm.1994.1018

    Article  MathSciNet  MATH  Google Scholar 

  42. Zhang, C., Rabiee, M., Sayyari, E., Mirarab, S.: ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19(6), 153 (2018). https://doi.org/10.1186/s12859-018-2129-y

    Article  Google Scholar 

  43. Zhang, Q.R., Rao, S., Warnow, T.: New absolute fast converging phylogeny estimation methods with improved scalability and accuracy. In: Parida, L., Ukkonen, E. (eds.) 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), vol. 113, pp. 8:1–8:12. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2018). https://doi.org/10.4230/LIPIcs.WABI.2018.8

Download references

Acknowledgments

The authors with to thank the anonymous reviewers, whose feedback led to improvements in the paper.

Funding

This work was supported by the National Science Foundation (award CCF-1535977) to TW. EKM was supported by the NSF Graduate Research Fellowship (award DGE-1144245) and the Ira and Debra Cohen Graduate Fellowship in Computer Science. Computational experiments were performed on Blue Waters, which is supported by the NSF (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tandy Warnow .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Molloy, E.K., Warnow, T. (2018). NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees. In: Blanchette, M., Ouangraoua, A. (eds) Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science(), vol 11183. Springer, Cham. https://doi.org/10.1007/978-3-030-00834-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00834-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00833-8

  • Online ISBN: 978-3-030-00834-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics