Skip to main content
Log in

Using alignment-free and pattern mining methods for SARS-CoV-2 genome analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Examining the genome sequences of the SARS-CoV-2 virus, that causes the respiratory disease known as coronavirus disease 2019 (COVID-19), play important role in the proper understanding of this virus, its main characteristics and functionalities. This paper investigates the use of alignment-free (AF) sequence analysis and sequential pattern mining (SPM) to analyze SARS-CoV-2 genome sequences and learn interesting information about them respectively. AF methods are used to find (dis)similarity in the genome sequences of SARS-CoV-2 by using various distance measures, to compare the performance of these measures and to construct the phylogenetic trees. SPM algorithms are used to discover frequent amino acid patterns and their relationship with each other and to predict the amino acid(s) by using various sequence-based prediction models. In last, an algorithm is proposed to analyze mutation in genome sequences. The algorithm finds the locations for changed amino acid(s) in the genome sequences and computes the mutation rate. From obtained results, it is found that that both AF and SPM methods can be used to discover interesting information/patterns in SARS-CoV-2 genome sequences for examining the variations and evolution among strains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The code for Algorithm 1 in Python and the genome sequences used in the experiments are available at: https://github.com/saqibdola/SPM-MA4GSA/tree/master/MAP.

Notes

  1. https://www.ncbi.nlm.nih.gov/sars-cov-2/

References

  1. Wu F et al (2020) A new coronavirus associated with human respiratory disease in China. Nature 579:265–269

    Google Scholar 

  2. Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (2020) The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nat Microbiol 5:536–544

    Google Scholar 

  3. Mount DM (2004) Bioinformatics: Sequence and Genome Analysis, 2nd edn. Cold Spring Harbor Laboratory Press

    Google Scholar 

  4. Aggarwal C, Bhuiyan M, Hasan M (2014) Frequent pattern mining algorithms: A survey. In: Frequent Pattern Mining, Springer

  5. Zielezinski A et al (2017) Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol 18:186

    Google Scholar 

  6. Vinga S (2014) Information theory applications for biological sequence analysis. Brief Bioninf 15(3):376–389

    Google Scholar 

  7. Vinga S, Almeida J (2003) Alignment-free sequence comparison- A review. Bioinformatics 19:513–523

    Google Scholar 

  8. Zielezinski A et al (2019) Benchmarking of alignment-free sequence comparison methods. Genome Biol 20:144

    Google Scholar 

  9. Fournier-Viger P et al (2017) A survey of sequential pattern mining. Data Sci Patt Recog 1:54–77

    Google Scholar 

  10. Karim MR et al (2013) An efficient approach to mining maximal contiguous frequent patterns from large DNA sequence databases. Genomics Informat 10(1):51–57

    MathSciNet  Google Scholar 

  11. Kawade DR, Oza KS (2013) Exploration of DNA sequences using pattern mining. J Biomed Informa 2:144–148

    Google Scholar 

  12. Nawaz MS, Fournier-Viger P, Shojaee A, Fujita H (2021) Using artificial intelligence techniques for COVID-19 genome analysis. Appl Intell 51(5):3086–3103

    Google Scholar 

  13. Ni L et al (2020) Mining the local dependency itemset in a products network. ACM Trans Manage Infor Syst 11 (1): 3:1-3:31

  14. Mustafa RU et al (2017) Early detection of controversial urdu speeches from social media. Data Scie Patt Recogn 1(2):26–42

    Google Scholar 

  15. Pokou YJM, Fournier-Viger P, Moghrabi C (2016) Authorship attribution using small sets of frequent part-of-speech skip-grams. In: Proceedings of FLAIRS, pp. 86-91

  16. Nawaz MS, Fournier-Viger P, Zhang J (2020) Proof learning in PVS with utility pattern mining. IEEE Access 8:119806–119818

    Google Scholar 

  17. Nawaz MS, Sun M, Fournier-Viger P (2019). Proof guidance in PVS with sequential pattern mining. In: Proceedings of FSEN, pp. 45-60

  18. Schweizer D et al (2015) Using consumer behavior data to reduce energy consumption in smarthomes: Applying machine learning to save energy without lowering comfort of inhabitants. In: Proceedings of ICMLA, pp. 1123-1129

  19. Nawaz MS et al (2022) MalSPM: Metamorphic malware behavior analysis and classification using sequential pattern mining. Computers & Security 118:102741

    Google Scholar 

  20. Fournier-Viger P, Gueniche T, Tseng VS (2012). Using partially-ordered sequential rules to generate more accurate sequence prediction. In: Proceedings of ADMA, pp. 431-442

  21. Nawaz MS et al (2021) COVID-19 genome analysis using alignment-free methods. In: Proceedings of IEA AIE, pp. 316-328

  22. Rondo HM et al (2021) Pathogenesis, symptomatology, and transmission of SARS-CoV-2 through analysis of viral Genomics and structure. mSystems 6(5): e00095-21

  23. Nawaz MS, Fournier-Viger, P, He Y (2022) S-PDB: Analysis and classification of SARS-CoV-2 Spike protein structures. In: Proceedings of BIBM, pp. 2259-2265

  24. Khailany RA, Safdar M, Ozaslanc M (2020) Genomic characterization of a novel SARS-CoV-2. Gene Reports 19:100682

    Google Scholar 

  25. Shu J-J (2017) A new integrated symmetrical table for genetic codes. Biosystems 151:21–26

    Google Scholar 

  26. Mohamadou Y, Halidou A, Kapen PT (2020) A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Appl Intell 50:3913–3925

    Google Scholar 

  27. Nayak J et al (2021) Intelligent system for COVID-19 prognosis: A state-of-the-art survey. Appl Intell 51:2908–2938

    Google Scholar 

  28. Alyasseri Z et al (2021) Review on COVID-19 diagnosis models based on machine learning and deep learning approaches. Expert Systems e12759

  29. Lalmuanawma S, Hussain J, Chhakchhuak L (2020) Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review. Chaos Solito 139:110059

    MathSciNet  Google Scholar 

  30. Chen J, See JC (2020) Artificial intelligence for COVID-19: Rapid review. J Med Internet Res 22:e21476

    Google Scholar 

  31. Rasheed J et al (2021) COVID-19 in the age of artificial intelligence: A comprehensive review. Interdiscip Sci Comput Life Sci 13:153–175

    Google Scholar 

  32. Shi F et al (2021) Review of artificial intelligence techniques in imaging data acquisition, segmenta-tion and diagnosis for COVID-19. IEEE Rev Biomed Engg 21:4–15

    Google Scholar 

  33. Driggs D et al (2021) Machine Learning for COVID-19 diagnosis and prognostication: Lessons for amplifying the signal while reducing the noise. Radiology: Artificial Intelligence 3(4): e210011

  34. Roberts M et al (2021) Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat Mach Intell 3:199–217

    Google Scholar 

  35. Wynants L et al (2020) Prediction models for diagnosis and prognosis of COVID-19: Systematic review and critical appraisal. BMJ 369:m1328

    Google Scholar 

  36. Noor S et al (2020) Analysis of public reactions to the novel coronavirus (COVID-19) outbreak on Twitter. Kybernetes 50(5):1633–1653

    Google Scholar 

  37. Heng JW, Juwono FH, Reine R (2021) Using optimal sequencing algorithms for COVID-19 case study. In: Proceedings GECOST, pp. 1-4

  38. Pathan RK, Biswas M, Khandaker MU (2020) Time series prediction of COVID19 by mutation rate analysis using recurrent neural network-based LSTM model. Chaos Solit 138:110018

    Google Scholar 

  39. Zelenova M (2021) Analysis of 329,942 SARS-CoV-2 records retrieved from GISAID database. Comput Biol Med 139:104981

    Google Scholar 

  40. Kali K (2021) The lag in SARS-CoV-2 genome submissions to GISAID. Nat Biotechnol 39:1058–1060

    Google Scholar 

  41. Arslan H (2021) Machine learning methods for COVID-19 prediction using human genomic data. Proceedings 74(1), 20

  42. Arslan H, Arslan H (2021) A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Int J Eng Sci Technol 24(4):839–847

    Google Scholar 

  43. Arslan H (2021) COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Comput Ind Eng 161:107666

    Google Scholar 

  44. Lopez-Rincon et al (2021) Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Scient Rep 11:947

    Google Scholar 

  45. Naeem SM (2021) A diagnostic genomic signal processing (GSP)-based system for automatic feature analysis and detection of COVID-19. Brief Bioinf 22(2):1197–1205

    Google Scholar 

  46. Randhawa GS et al (2020) Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS One 15(4):e0232391

    Google Scholar 

  47. Ahmed I, Jeon G (2021) Enabling artificial intelligence for genome sequence analysis of COVID-19 and alike viruses. Interdiscip Sci 6:1–16

    Google Scholar 

  48. Ren J et al (2018) Alignment free sequence analysis and applications. Annu Rev Biomed Sci 1:93–114

    Google Scholar 

  49. Bonham-Carter O et al (2014) Alignment-free genetic sequence comparisons: A review of recent approaches by word analysis. Brief Bioinf 15(6):890–905

    Google Scholar 

  50. Song J et al (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinf 15(3):343–353

    MathSciNet  Google Scholar 

  51. Lu YY et al (2017) CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res 45(Web Server issue): W554-W559

  52. Frigessi A, Heidergott B (2011) Markov Chains. In: Lovric M (ed) International Encyclopedia of Statistical Science. Springer

    Google Scholar 

  53. Otu HH, Sayood KA (2003) A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(1):2122–2130

    Google Scholar 

  54. Li M et al (2004) The similarity metric. IEEE Trans Infor Theory 50(12):3250–64

    MathSciNet  MATH  Google Scholar 

  55. Giancarlo R, Rombo SE, Utro F (2014) Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinf 15(3):390–406

    Google Scholar 

  56. Sayers EW et al (2019) Genbank. Nucleic Acids Res 48(D1):D84–D86

    Google Scholar 

  57. Fournier-Viger P et al (2016). The SPMF open-source data mining library version 2. In: Proceedings ECML PKDD, pp. 36-40

  58. Ayres J (2002). Sequential pattern mining using a bitmap representation. In: Proceedings KDD, pp. 429-435

  59. Fournier-Viger P et al (2013) TKS: Efficient mining of top-k sequential patterns. In: Proceedings of Advanced Data Mining and Applications (ADMA), pp. 109-120

  60. Fournier-Viger P (2014). Fast vertical mining of sequential patterns using co-occurrence information. In: Proceedings of PAKDD, pp. 40-52

  61. Aggarwal CC, Han J (2014) Frequent Pattern Mining. Springer

    MATH  Google Scholar 

  62. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings VLDB, pp. 487-499

  63. Fournier-Viger P (2014). ERMiner: Sequential rule mining using equivalence classes. In: Proceedings of IDA, pp. 108-119

  64. Gueniche T et al (2015) CPT+: Decreasing the time/space complexity of the compact prediction tree. In: Proceedings of PAKDD, pp. 625-636

  65. Gueniche T, Fournier-Viger P, Tseng VS (2013). Compact prediction tree: A lossless model for accurate sequence prediction. In: Proceedings of AADMA, pp. 177-188

  66. Padmanabhan VN, Mogul JC (1996) Using predictive prefetching to improve world wide web latency. Comp Comm Rev 26:22–36

    Google Scholar 

  67. Pitkow J, Pirolli P (1999) Mining longest repeating subsequence to predict world wide web surfing. In: Proceedings of USENIX Symposium on Internet Technologies and Systems, pp. 13-25

  68. Deshpande M, Karypis G (2004) Selective markov models for predicting web page accesses. ACM Trans. Inter. Techn. 4:163–184

    Google Scholar 

  69. Laird P, Saul R (1994) Discrete sequence prediction and its applications. Machine Learning 15:43–68

    MATH  Google Scholar 

  70. Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans. Infor. Theory. 24:530–536

    MathSciNet  MATH  Google Scholar 

  71. Altschul SF et al (1990) Basic local alignment search tool. J. Molec. Biolo. 215(3):403–410

    Google Scholar 

  72. Dong et al (2020) Analysis of the hosts and transmission paths of SARS-CoV-2 in the COVID-19 outbreak. Genes 11(6):637

    Google Scholar 

  73. Pachetti M et al (2020) Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J. Transl. Medi. 18:179

    Google Scholar 

  74. Ventura S, Luna JM (2018) Supervised Descriptive Pattern Mining. Springer

    Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Guangdong Province (2023A1515011667) and Basic Research Foundations of Shenzhen (JCYJ20210324093609026)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Philippe Fournier-Viger.

Ethics declarations

Conflict of interest

Authors declare no conflict on interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nawaz, M.S., Fournier-Viger, P., Aslam, M. et al. Using alignment-free and pattern mining methods for SARS-CoV-2 genome analysis. Appl Intell 53, 21920–21943 (2023). https://doi.org/10.1007/s10489-023-04618-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04618-0

Keywords

Navigation