Abstract
Phylogenetic placement, the problem of placing a sequence into a precomputed phylogenetic “backbone” tree, is useful for constructing large trees, performing taxon identification of newly obtained sequences, and other applications. The most accurate current method, pplacer, performs the placement using maximum likelihood but fails frequently on backbone trees with 5000 sequences. We show a simple technique, pplacer-XR (pplacer-eXtra Range), that extends pplacer to large datasets. We show, using challenging large datasets, that pplacer-XR provides the accuracy of pplacer and the scalability to ultra-large datasets of a leading fast phylogenetic placmement method, APPLES. pplacer-XR is available in open source form on github.
Y. Cai and E. Wedell—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Balaban, M., Roush, D., Zhu, Q., Mirarab, S.: APPLES-2: faster and more accurate distance-based phylogenetic placement using divide and conquer. bioRxiv (2021). https://doi.org/10.1101/2021.02.14.431150
Balaban, M., Sarmashghi, S., Mirarab, S.: APPLES: scalable distance-based phylogenetic placement with or without alignments. Syst. Biol. 69(3), 566–578 (2020)
Barbera, P., et al.: EPA-NG: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68(2), 365–369 (2019)
Berger, S.A., Krompass, D., Stamatakis, A.: Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst. Biol. 60(3), 291–302 (2011)
Bik, H.M., Porazinska, D.L., Creer, S., Caporaso, J.G., Knight, R., Thomas, W.K.: Sequencing our way towards understanding global eukaryotic biodiversity. Trends Ecol. Evol. 27(4), 233–243 (2012)
Chaumeil, P.A., Mussig, A.J., Hugenholtz, P., Parks, D.H.: GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics 36(6), 1925–1927 (2020)
Conlan, S., Kong, H.H., Segre, J.A.: Species-level analysis of DNA sequence data from the NIH Human Microbiome Project. PLoS ONE 7(10), e47075 (2012)
Matsen, F.A., Kodner, R.B., Armbrust, E.V.: pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinform. 11(1), 538 (2010)
McCoy, C.O., Matsen IV, F.A.: Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth. PeerJ 1, e157 (2013)
Mirarab, S., Nguyen, N., Guo, S., Wang, L.S., Kim, J., Warnow, T.: PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22(5), 377–386 (2015)
Mirarab, S., Nguyen, N., Warnow, T.: SEPP: SATé-enabled phylogenetic placement. In: Biocomputing 2012, pp. 247–258. World Scientific (2012)
Nguyen, N.P., Mirarab, S., Liu, B., Pop, M., Warnow, T.: TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics 30(24), 3548–3555 (2014)
Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree 2-approximately maximum-likelihood trees for large alignments. PLoS ONE 5(3), e9490 (2010)
Shah, N., Molloy, E.K., Pop, M., Warnow, T.: TIPP2: metagenomic taxonomic profiling using phylogenetic markers. Bioinformatics (2021)
Stamatakis, A.: RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22(21), 2688–2690 (2006)
Tavaré, S.: Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17(2), 57–86 (1986)
Yang, Z.: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39(3), 306–314 (1994)
Acknowledgments
The research presented here is the result of a course project by EW and YC for the Spring 2020 course CS 581: Algorithmic Genomic Biology, at the University of Illinois, taught by TW. This work was supported in part by the National Science Foundation grant ABI-1458652 to TW.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendix
Appendix
Commands to create backbone trees: All placement methods use the same backbone tree topologies, but have different branch lengths (following the protocol as provided in [2]). We downloaded the backbone trees with their optimized branch lengths for each phylogenetic placement method from the APPLES repository.
On replicates 1 and 2 of the 100,000-leaf backbone condition and replicate 0 of the 200,000-leaf replicate backbone condition, each containing more than two identical sequences, the backbone trees had polytomies. We randomly resolved these in order to run RAxML.
Random tree refinement: raxmlHPC-PTHREADS -f e -t res_true.fasttree -m GTRGAMMA -s aln_dna.phy -n REF -p 1984 -T 16
APPLES command: run_apples.py -t backbone.tree -s ref.fa -q query.fa -T 16 -o apples.jplace
EPA-ng command: epa-ng –ref-msa ref.fa –tree backbone.tree –query query.fa –outdir $query –model RAxML_info.REF8 –redo -T 16
pplacer-XR commands: python3 pplacer-XR.py GTR RAxML_info.REF backbone.tree output_dir aln.fa query.txt 2000
Github site: https://github.com/chry04/pplacer_plusplus
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wedell, E., Cai, Y., Warnow, T. (2021). Scalable and Accurate Phylogenetic Placement Using pplacer-XR. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://doi.org/10.1007/978-3-030-74432-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-74432-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74431-1
Online ISBN: 978-3-030-74432-8
eBook Packages: Computer ScienceComputer Science (R0)