Skip to main content

An Efficient Greedy Incremental Sequence Clustering Algorithm

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13064))

Abstract

Gene sequence clustering is very basic and important in computational biology and bioinformatics for the study of phylogenetic relationships and gene function prediction, etc. With the rapid growth of the amount of biological data (gene/protein sequences), clustering faces more challenges in low efficiency and precision. For example, there are many redundant sequences in gene databases that do not provide valid information but consume computing resources. Widely used greedy incremental clustering tools improve the efficiency at the cost of precision. To design a balanced gene clustering algorithm, which is both fast and precise, we propose a modified greedy incremental sequence clustering tool, via introducing a pre-filter, a modified short word filter, a new data packing strategy, and GPU accelerates. The experimental evaluations on four independent datasets show that the proposed tool can cluster datasets with precisions of 99.99%. Compared with the results of CD-HIT, Uclust, and Vsearch, the number of redundant sequences by the proposed method is four orders of magnitude less. In addition, on the same hardware platform, our tool is 40% faster than the second-place. The software is available at https://github.com/SIAT-HPCC/gene-sequence-clustering.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ahmed, N., Lévy, J., Ren, S., Mushtaq, H., Bertels, K., Al-Ars, Z.: Gasal2: a GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinf. 20(1), 1–20 (2019)

    Article  CAS  Google Scholar 

  2. Alser, M., Hassan, H., Kumar, A., Mutlu, O., Alkan, C.: Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 35(21), 4255–4263 (2019)

    Article  CAS  Google Scholar 

  3. Chan, Y., Xu, K., Lan, H., Schmidt, B., Peng, S., Liu, W.: Myphi: efficient levenshtein distance computation on xeon phi based architectures. Current Bioinf. 13(5), 479–486 (2018)

    Article  CAS  Google Scholar 

  4. Edgar, R.C.: Search and clustering orders of magnitude faster than blast. Bioinformatics 26(19), 2460–2461 (2010)

    Article  CAS  Google Scholar 

  5. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)

    Article  CAS  Google Scholar 

  6. Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinf. (Oxford, England) 14(5), 423–429 (1998)

    Article  CAS  Google Scholar 

  7. James, B.T., Luczak, B.B., Girgis, H.Z.: Meshclust: an intelligent tool for clustering DNA sequences. Nucleic acids Res. 46(14), e83–e83 (2018)

    Article  Google Scholar 

  8. Karim, M.R., et al.: Deep learning-based clustering approaches for bioinformatics. Briefings Bioinf. 22(1), 393–415 (2021)

    Article  Google Scholar 

  9. Li, W., Godzik, A.: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)

    Article  CAS  Google Scholar 

  10. Loving, J., Hernandez, Y., Benson, G.: Bitpal: a bit-parallel, general integer-scoring sequence alignment algorithm. Bioinformatics 30(22), 3166–3173 (2014)

    Article  CAS  Google Scholar 

  11. Rognes, T., Flouri, T., Nichols, B., Quince, C., Mahé, F.: Vsearch: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016)

    Article  Google Scholar 

  12. Steinegger, M., Söding, J.: Clustering huge protein sequence sets in linear time. Nat. Commun. 9(1), 1–8 (2018)

    Article  CAS  Google Scholar 

  13. Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinf. 13(1), 1–15 (2012)

    Article  Google Scholar 

  14. Xin, H., et al.: Shifted hamming distance: a fast and accurate simd-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 31(10), 1553–1560 (2015)

    Article  CAS  Google Scholar 

  15. Zou, Q., Lin, G., Jiang, X., Liu, X., Zeng, X.: Sequence clustering in bioinformatics: an empirical study. Briefings Bioinf. 21(1), 1–10 (2020)

    CAS  Google Scholar 

Download references

Acknowledgment

This work was partly supported by the National Key Research and Development Program of China under Grant No. 2018YFB0204403, Strategic Priority CAS Project XDB38050100, National Science Foundation of China under grant no. U1813203, the Shenzhen Basic Research Fund under grant no. RCYX2020071411473419, KQTD20200820113106007 and JCYJ20180507182818013, CAS Key Lab under grant no. 2011DP173015. We would like to thank Intel for the tech support and resources such as oneAPI DevCloud in this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanjie Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ju, Z. et al. (2021). An Efficient Greedy Incremental Sequence Clustering Algorithm. In: Wei, Y., Li, M., Skums, P., Cai, Z. (eds) Bioinformatics Research and Applications. ISBRA 2021. Lecture Notes in Computer Science(), vol 13064. Springer, Cham. https://doi.org/10.1007/978-3-030-91415-8_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91415-8_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91414-1

  • Online ISBN: 978-3-030-91415-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics