An Efficient Greedy Incremental Sequence Clustering Algorithm

Ju, Zhen; Zhang, Huiling; Meng, Jingtao; Zhang, Jingjing; Li, Xuelei; Fan, Jianping; Pan, Yi; Liu, Weiguo; Wei, Yanjie

doi:10.1007/978-3-030-91415-8_50

An Efficient Greedy Incremental Sequence Clustering Algorithm

Zhen Ju^12,13,
Huiling Zhang^12,13,
Jingtao Meng¹³,
Jingjing Zhang^12,13,
Xuelei Li¹³,
Jianping Fan¹³,
Yi Pan¹³,
Weiguo Liu¹⁴ &
…
Yanjie Wei¹³

Conference paper
First Online: 18 November 2021

1771 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13064))

Abstract

Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene-sequence-clustering.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Ahmed, N., Lévy, J., Ren, S., Mushtaq, H., Bertels, K., Al-Ars, Z.: Gasal2: a GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinf. 20(1), 1–20 (2019)
Article CAS Google Scholar
Alser, M., Hassan, H., Kumar, A., Mutlu, O., Alkan, C.: Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 35(21), 4255–4263 (2019)
Article CAS Google Scholar
Chan, Y., Xu, K., Lan, H., Schmidt, B., Peng, S., Liu, W.: Myphi: efficient levenshtein distance computation on xeon phi based architectures. Current Bioinf. 13(5), 479–486 (2018)
Article CAS Google Scholar
Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010)
Article CAS Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)
Article CAS Google Scholar
Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinf. (Oxford, England) 14(5), 423–429 (1998)
Article CAS Google Scholar
James, B.T., Luczak, B.B., Girgis, H.Z.: Meshclust: an intelligent tool for clustering DNA sequences. Nucleic acids Res. 46(14), e83–e83 (2018)
Article Google Scholar
Karim, M.R., et al.: Deep learning-based clustering approaches for bioinformatics. Briefings Bioinf. 22(1), 393–415 (2021)
Article Google Scholar
Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Article CAS Google Scholar
Loving, J., Hernandez, Y., Benson, G.: Bitpal: a bit-parallel, general integer-scoring sequence alignment algorithm. Bioinformatics 30(22), 3166–3173 (2014)
Article CAS Google Scholar
Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F.: Vsearch: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016)
Article Google Scholar
Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018)
Article CAS Google Scholar
Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf. 13(1), 1–15 (2012)
Article Google Scholar
Xin, H., et al.: Shifted hamming distance: a fast and accurate simd-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 31(10), 1553–1560 (2015)
Article CAS Google Scholar
Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Briefings Bioinf. 21(1), 1–10 (2020)
CAS Google Scholar

Download references

Acknowledgment

This work was partly supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204403, Strategic Priority CAS Project XDB38050100, National Science Foundation of China under grant no. U1813203, the Shenzhen Basic Research Fund under grant no. RCYX2020071411473419, KQTD20200820113106007 and JCYJ20180507182818013, CAS Key Lab under grant no. 2011DP173015. We would like to thank Intel for the tech support and resources such as oneAPI DevCloud in this study.

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, 100049, China
Zhen Ju, Huiling Zhang & Jingjing Zhang
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518005, China
Zhen Ju, Huiling Zhang, Jingtao Meng, Jingjing Zhang, Xuelei Li, Jianping Fan, Yi Pan & Yanjie Wei
Shandong University, Jinan, 250100, China
Weiguo Liu

Authors

Zhen Ju
View author publications
You can also search for this author in PubMed Google Scholar
Huiling Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jingtao Meng
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuelei Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yi Pan
View author publications
You can also search for this author in PubMed Google Scholar
Weiguo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yanjie Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanjie Wei .

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology, Shenzhen, China
Yanjie Wei
Central South University, Changsha, China
Min Li
Georgia State University, Atlanta, GA, USA
Pavel Skums
Georgia State University, Atlanta, GA, USA
Zhipeng Cai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ju, Z. et al. (2021). An Efficient Greedy Incremental Sequence Clustering Algorithm. In: Wei, Y., Li, M., Skums, P., Cai, Z. (eds) Bioinformatics Research and Applications. ISBRA 2021. Lecture Notes in Computer Science(), vol 13064. Springer, Cham. https://doi.org/10.1007/978-3-030-91415-8_50

Download citation

DOI: https://doi.org/10.1007/978-3-030-91415-8_50
Published: 18 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91414-1
Online ISBN: 978-3-030-91415-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics