Deep Clustering for Metagenomics

Bonet, Isis; Pena, Alejandro; Lochmuller, Christian; Patino, Alejandro; Gongora, Mario

doi:10.1007/978-3-030-63061-4_29

Isis Bonet¹²,
Alejandro Pena¹²,
Christian Lochmuller¹²,
Alejandro Patino¹² &
…
Mario Gongora¹³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12313))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

435 Accesses
1 Citations

Abstract

Metagenomics is an area that is supported by modern next generation sequencing technology, which investigates microorganisms obtained directly from environmental samples, without the need to isolate them. This type of sequencing results in a large number of DNA fragments from different organisms. Thus, the challenge consists in identifying groups of DNA sequences that belong to the same organism. The use of supervised methods for solving this problem is limited, despite the fact that large databases of species sequences are available, by the small number of species that are known. Additionally, by the required computational processing time to analyse segments against species sequences. In order to overcome these problems, a binning process can be used for the reconstruction and identification of a set of metagenomic fragments. The binning process serves as a step of pre-processing to join fragments into groups of the same taxonomic levels. In this work, we propose the application of a clustering model, with a feature extraction process that uses an autoencoder neural network. For the clustering a k-means is used that begins with a k-value which is large enough to obtain very pure clusters. These are reduced through a process of combining various distance functions. The results show that the proposed method outperforms the k-means and other classical methods of feature extraction such as PCA, obtaining 90% of purity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Locey, K.J., Lennon, J.T.: Scaling laws predict global microbial diversity. In: Proceedings of the National Academy of Sciences, vol. 11, issue 21, pp. 5970–5975 (2016)
Google Scholar
Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLOS Comput. Biol. 6(2), 1–13 (2010)
Article Google Scholar
Teeling, H., Waldmann, J., Lombardot, T., Bauer, M., Glöckner, F.O.: Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinform. 5(1), 163 (2004)
Article Google Scholar
Reddy, R.M., Mohammed, M.H., Mande, S.S.: Metacaa: a clustering-aided methodology for efficient assembly of metagenomic datasets. Genomics 10(2), 161–168 (2014)
Article Google Scholar
Xie, J., Girshick, R. B., Farhadi, A.: Unsupervised deep embedding for clustering analysis. CoRR, abs/1511.06335, 2015
Google Scholar
Bonet, I., Escobar, A., Mesa-Múnera, A., Alzate, J.F.: Clustering of metagenomic data by combining different distance function. Acta Polytech. Hung. 14(3), 223–236 (2017)
Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, Berkeley, Calif. University of California Press (1967)
Google Scholar
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Abdi, H., Williams, L.J.: WIREs principal component analysis. Comput. Stat. 2(4), 433–459 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

EIA University, km 2 + 200 Vía al Aeropuerto José María Córdova, Antioquia, Envigado, Colombia
Isis Bonet, Alejandro Pena, Christian Lochmuller & Alejandro Patino
De Montfort University, The Gateway, Leicester, LE1 9BH, UK
Mario Gongora

Authors

Isis Bonet
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Pena
View author publications
You can also search for this author in PubMed Google Scholar
Christian Lochmuller
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Patino
View author publications
You can also search for this author in PubMed Google Scholar
Mario Gongora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isis Bonet .

Editor information

Editors and Affiliations

University of Bergamo, Bergamo, Italy
Paolo Cazzaniga
University of Milano-Bicocca, Milan, Italy
Daniela Besozzi
National Research Council, Segrate, Italy
Ivan Merelli
Università degli Studi di Trieste, Trieste, Italy
Luca Manzoni

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bonet, I., Pena, A., Lochmuller, C., Patino, A., Gongora, M. (2020). Deep Clustering for Metagenomics. In: Cazzaniga, P., Besozzi, D., Merelli, I., Manzoni, L. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2019. Lecture Notes in Computer Science(), vol 12313. Springer, Cham. https://doi.org/10.1007/978-3-030-63061-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-63061-4_29
Published: 10 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63060-7
Online ISBN: 978-3-030-63061-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics