Skip to main content

On Clustering Validation in Metagenomics Sequence Binning

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2019)

Abstract

In clustering, one of the most challenging aspects is the validation, whose objective is to evaluate how good a clustering solution is. Sequence binning is a clustering task on metagenomic data analysis. The sequence clustering challenge is essentially putting together sequences belonging to the same genome. As a clustering problem it requires proper use of validation criteria of the discovered partitions. In sequence binning, the concepts of precision and recall, and F-measure index (external validation) are normally used as benchmark. However, on practice, information about the (sub) optimal number of cluster is unknown, so these metrics might be biased to an overestimated “ground truth”. In the case of sequence binning analysis, where the reference information about genomes is not available, how to evaluate the quality of bins resulting from a clustering solution? To answer this question we empirically study both quantitative (internal indexes) and qualitative aspects (biological soundness) while evaluating clustering solutions on the sequence binning problem. Our experimental study indicates that the number of clusters, estimated by binning algorithms, do not have as much impact on the quality of bins by means of biological soundness of the discovered clusters. The quality of the sub-optimal bins (greater than 90%) were identified in both rich and poor clustering partitions. Qualitative validation is essential for proper evaluation of a sequence binning solution, generating bins with sub-optimal quality. Internal indexes can only be used in compliance with qualitative ones as a trade-off between the number of partitions and biological soundness of its respective bins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mande, S.S.: Classification of metagenomic sequences: methods and challenges. Brief. Bioinform. 13, 669–681 (2012)

    Article  Google Scholar 

  2. Sedlar, K.: Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput. Struct. Biotechnol. J. 15, 48–55 (2017)

    Article  Google Scholar 

  3. Wang, Y., et al.: MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics 28(18), i356–i362 (2012)

    Article  Google Scholar 

  4. Vinh, L., et al.: A two-phase binning algorithm using \(l\)-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol. 10, 2 (2015). https://doi.org/10.1186/s13015-014-0030-4

    Article  Google Scholar 

  5. Wang, Y., et al.: MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinform. 16, 36 (2015)

    Article  Google Scholar 

  6. Wu, Y., et al.: MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2, 26 (2014). https://doi.org/10.1186/2049-2618-2-26

    Article  Google Scholar 

  7. Lin, H., Yu-Chieh, L.: Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6, 24175 (2016)

    Article  Google Scholar 

  8. Parks, D., et al.: CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015)

    Article  Google Scholar 

  9. Simão, F., et al.: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 1367–4803 (2015)

    Article  Google Scholar 

  10. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  11. Davies, D.L., Bouldin, D.W.: A cluster separation measure. Trans. Pattern Anal. Mach. Intell. 1(2), 224–227 (1979)

    Article  Google Scholar 

  12. Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974)

    MathSciNet  MATH  Google Scholar 

  13. Li, W., et al.: Ultrafast clustering algorithms for metagenomic sequence analysis. Brief. Bioinform. 13(6), 656–668 (2012)

    Article  Google Scholar 

  14. Kang, D., Froula, J., Egan, R., Wang, Z.: MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3, e1165 (2015)

    Article  Google Scholar 

  15. Sieber, C., et al.: Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018)

    Article  Google Scholar 

  16. Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. Benelearn (2015)

    Google Scholar 

  17. Legány, C., Juhász, S., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence (2006)

    Google Scholar 

  18. Alves, R., Rodriguez-Baena, D.S., Aguilar-Ruiz, J.S.: Gene association analysis: a survey of frequent pattern mining from gene expression data. Brief. Bioinform. 11(2), 210–224 (2010)

    Article  Google Scholar 

  19. Mikheenko, A., Saveliev, V., Gurevich, A.: MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32(7), 1088–1090 (2016)

    Article  Google Scholar 

  20. Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)

    Article  Google Scholar 

  21. Girotto, S., Pizzi, C., Comin, M.: MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)

    Article  Google Scholar 

  22. Reyes, P., Villegas, C.: An empirical comparison of EM and K-means algorithms for binning metagenomics datasets. Ingeniare. Rev. Chil. Ing. 26, 20–27 (2018)

    Article  Google Scholar 

  23. Richter, D.C., et al.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS ONE 3, e3373 (2018)

    Article  Google Scholar 

  24. Alneberg, J., Bjarnason, B.S., De Bruijn, I., Schirmer, M., Quick, J., Ijaz, U.Z., et al.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11(11), 1144–1146 (2014)

    Article  Google Scholar 

  25. Baridam, B.B., Ali, M.M.: An investigation of K-means clustering to high and multi-dimensional biological data. Kybernetes 42(4), 614–627 (2013)

    Article  Google Scholar 

  26. Li, D., et al.: MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016)

    Article  Google Scholar 

  27. Parks, D., et al.: Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017)

    Article  Google Scholar 

  28. Khan, A.R., et al.: A comprehensive study of de novo genome assemblers: current challenges and future prospective. Evol. Bioinform. Online 14 (2018)

    Google Scholar 

  29. Krakauer, D.C., Plotkin, J.B.: Redundancy, antiredundancy, and the robustness of genomes. Proc. Nat. Acad. Sci. U.S.A. 99(3), 1405–1409 (2002)

    Article  Google Scholar 

  30. Chen, H.W., et al.: Predicting genome-wide redundancy using machine learning. BMC Evol. Biol. 10, 1471–2148 (2010)

    Google Scholar 

  31. Klassen, J.L., Currie, C.R.: Gene fragmentation in bacterial draft genomes: extent, consequences and mitigation. BMC Genom. 13, 14 (2012)

    Article  Google Scholar 

  32. Poptsova, M.S., et al.: Non-random DNA fragmentation in next-generation sequencing. Sci. Rep. 4, 4532 (2014)

    Article  Google Scholar 

  33. Mikheenko, A., Prjibelski, A., Saveliev, V., Antipov, D., Gurevich, A.: Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34(13), i142–i150 (2018)

    Article  Google Scholar 

  34. Sangwan, N., Xia, F., Gilbert, J.: Recovering complete and draft population genomes from metagenome datasets. Microbiome 04(1), 2049–2618 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ronnie Alves .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oliveira, P., Padovani, K., Alves, R. (2020). On Clustering Validation in Metagenomics Sequence Binning. In: Kowada, L., de Oliveira, D. (eds) Advances in Bioinformatics and Computational Biology. BSB 2019. Lecture Notes in Computer Science(), vol 11347. Springer, Cham. https://doi.org/10.1007/978-3-030-46417-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46417-2_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46416-5

  • Online ISBN: 978-3-030-46417-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics