Skip to main content

Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods

  • Conference paper
  • First Online:
Pattern Recognition and Image Analysis (IbPRIA 2022)

Abstract

Archaea are single-celled organisms found in practically every habitat and serve essential functions in the ecosystem, such as carbon fixation and nitrogen cycling. The classification of these organisms is challenging because most have not been isolated in a laboratory and are only found in ambient samples by their gene sequences. This paper presents an automated classification approach for any taxonomic level based on an ensemble method using non-comparative features. This methodology overcomes the problems of reference-based classification since it classifies sequences without resorting directly to the reference genomes, using the features of the biological sequences instead. Overall we obtained high results for classification at different taxonomic levels. For example, the Phylum classification task achieved 96% accuracy, whereas 91% accuracy was achieved in the genus identification task of archaea in a pool of 55 different genera. These results show that the proposed methodology is a fast, highly-accurate solution for archaea identification and classification, being particularly interesting in the applied case due to the challenging classification of these organisms. The method and complete study are freely available, under the GPLv3 license, at https://github.com/jorgeMFS/Archaea2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Biesecker, L.G., Burke, W., Kohane, I., Plon, S.E., Zimmern, R.: Next-generation sequencing in the clinic: are we ready? Nat. Rev. Genet. 13(11), 818–824 (2012)

    Google Scholar 

  2. Chiu, C.Y., Miller, S.A.: Clinical metagenomics. Nat. Rev. Genet. 20(6), 341–355 (2019)

    Google Scholar 

  3. Hampton-Marcell, J.T., Lopez, J.V., Gilbert, J.A.: The human microbiome: an emerging tool in forensics. Microbial Biotechnol. 10(2), 228–230 (2017)

    Google Scholar 

  4. Amorim, A., Pereira, F., Alves, C., García, O.: Species assignment in forensics and the challenge of hybrids. Forensic Sci. Int. Genet. 48, 102333 (2020)

    Google Scholar 

  5. Eloe-Fadrosh, E.A., et al.: Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat. Commun. 7(1), 1–10 (2016)

    Google Scholar 

  6. Del Fabbro, C., Scalabrin, S., Morgante, M., Giorgi, F.M.: An extensive evaluation of read trimming effects on illumina NGS data analysis. PLoS ONE 8(12) (2013)

    Google Scholar 

  7. Toppinen, M., Sajantila, A., Pratas, D., Hedman, K., Perdomo, M.F.: The human bone marrow is host to the DNAs of several viruses. Front. Cell. Infect. Microbiol. 11, 329 (2021)

    Google Scholar 

  8. Hosseini, M., Pratas, D., Morgenstern, B., Pinho, A.J.: Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements. GigaScience 9(5), giaa048 (2020)

    Google Scholar 

  9. Mardis, E.R.: DNA sequencing technologies: 2006–2016. Nat. Protoc. 12(2), 213–218 (2017)

    Google Scholar 

  10. Thomas, T., Gilbert, J., Meyer, F.: Metagenomics - a guide from sampling to data analysis. Microb. Inf. Exp. 2(1), 1–12 (2012)

    Google Scholar 

  11. Abnizova, I., et al.: Analysis of context-dependent errors for illumina sequencing. J. Bioinform. Comput. Biol. 10(2) (2012)

    Google Scholar 

  12. Boekhorst, R.T., et al.: Computational problems of analysis of short next generation sequencing reads. Vavilov J. Genet. Breed. 20(6), 746–755 (2016)

    Google Scholar 

  13. Breitwieser, F.P., Lu, J., J., Salzberg, J., A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20(4), 1–15 (2017)

    Google Scholar 

  14. Chen, S., He, C., Li, Y., Li, Z., Charles III, E.M.: A computational toolset for rapid identification of SARS-CoV-2, other viruses, and microorganisms from sequencing data. Brief. Bioinform. 22(2), 924–935 (2021)

    Google Scholar 

  15. Pickett, B.E., et al.: ViPR: an open bioinformatics database and analysis resource for virology research. Nucl. Acids Res. 40(D1), D593–D598 (2012)

    Google Scholar 

  16. Khan, A., et al.: Detection of human papillomavirus in cases of head and neck squamous cell carcinoma by RNA-Seq and VirTect. Mol. Oncol. (13), 829–839 (2018)

    Google Scholar 

  17. Chen, X., et al.: A virome-wide clonal integration analysis platform for discovering cancer viral etiology. Genome Res. (2019)

    Google Scholar 

  18. Vilsker, M., et al.: Genome detective: an automated system for virus identification from high-throughput sequencing data. Bioinformatics 35(5), 871–873 (2019)

    Google Scholar 

  19. Piro, V.C., Dadi, T.H., Seiler, E., Reinert, K., Renard, B.Y.: Ganon: precise metagenomics classification against large and up-to-date sets of reference sequences. Bioinformatics 36, i12–i20 (2020)

    Google Scholar 

  20. Meyer, F., et al.: The metagenomics RAST server-a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 9(1), 1–8 (2008)

    Google Scholar 

  21. Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)

    Google Scholar 

  22. Brown, S.M., et al.: MGS-fast: metagenomic shotgun data fast annotation using microbial gene catalogs. GigaScience 8(4), giz020 (2019)

    Google Scholar 

  23. Truong, D.T., et al.: MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12(10), 902–903 (2015)

    Google Scholar 

  24. Karlicki, M., Antonowicz, S., Karnkowska, A.: Tiara: deep learning-based classification system for eukaryotic sequences. Bioinformatics 38(2), 344–350 (2022)

    Google Scholar 

  25. Lourenço, A.: Reconstruction and classification of unknown DNA sequences. Master dissertation (2021)

    Google Scholar 

  26. Almeida, J.R., Pinho, A.J., Oliveira, J.L., Fajarda, O., Pratas, D.: GTO: a toolkit to unify pipelines in genomic and proteomic research. SoftwareX 12, 100535 (2020)

    Google Scholar 

  27. Kans, J.: Entrez direct: e-utilities on the UNIX command line. National Center for Biotechnology Information (US) (2020)

    Google Scholar 

  28. Pratas, D., Pinho, A.J.: On the approximation of the Kolmogorov complexity for DNA sequences. In: Alexandre, L.A., Salvador Sánchez, J., Rodrigues, J.M.F. (eds.) IbPRIA 2017. LNCS, vol. 10255, pp. 259–266. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58838-4_29

  29. Silva, M., Pratas, D., Pinho, A.J.: Efficient DNA sequence compression with neural networks. GigaScience 9(11), 11. giaa119 (2020)

    Google Scholar 

  30. Hosseini, M., Pratas, D., Pinho, A.J.: AC: a compression tool for amino acid sequences. Interdisc. Sci. Comput. Life Sci. 11(1), 68–76 (2019)

    Google Scholar 

  31. Romiguier, J., Ranwez, V., Douzery, E.J.P., Galtier, N.: Contrasting GC-content dynamics across 33 mammalian genomes: relationship with life-history traits and chromosome sizes. Genome Res. 20(8), 1001–1009 (2010)

    Google Scholar 

  32. Chen, H., Skylaris, C.-K.: Analysis of DNA interactions and GC content with energy decomposition in large-scale quantum mechanical calculations. Phys. Chem. Chem. Phys. 23(14), 8891–8899, 102333 (2021)

    Google Scholar 

  33. Duret, L., Galtier, N.: Biased gene conversion and the evolution of mammalian genomic landscapes. Annu. Rev. Genomics Hum. Genet. 10, 285–311 (2009)

    Google Scholar 

  34. Cristianini, N., Shawe-Taylor, J., et al.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  35. Rish, I., et al.: An empirical study of the Naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)

    Google Scholar 

  36. McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition, vol. 544. Wiley, New York (2004)

    Google Scholar 

  37. Guo, G., Wang, H., Bell, D., Bi, Y., Greer, K.: KNN model-based approach in classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) OTM 2003. LNCS, vol. 2888, pp. 986–996. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39964-3_62

  38. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016)

    Google Scholar 

  39. Lu, J., Salzberg, S.L.: Removing contaminants from databases of draft genomes. PLoS Comput. Biol. 14(6), e1006277 (2018)

    Google Scholar 

  40. Cornet, L., Baurain, D.: Contamination detection in genomic data: more is not enough. Genome Biol. (2022)

    Google Scholar 

  41. Tavares, A.H.M.P., et al.: DNA word analysis based on the distribution of the distances between symmetric words. Sci. Rep. 7(1), 1–11 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Miguel Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silva, J.M., Pratas, D., Caetano, T., Matos, S. (2022). Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04881-4_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04880-7

  • Online ISBN: 978-3-031-04881-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics