Skip to main content

Weighted Chaos Game Representation for Molecular Sequence Classification

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14648))

Included in the following conference series:

  • 107 Accesses

Abstract

Molecular sequence analysis is a crucial task in bioinformatics and has several applications in drug discovery and disease diagnosis. However, traditional methods for molecular sequence classification are based on sequence alignment, which can be computationally expensive and lack accuracy. Although alignment-free methods exist, they usually do not take full advantage of deep learning (DL) models since DL models traditionally perform below power on tabular data compared to their effectiveness on image-based data. To address this, we propose a novel approach to classify molecular sequences using a Chaos Game Representation (CGR)-based approach. We utilize k-mers-based frequency chaos game representation (FCGR) to generate 2D images for molecular sequences. Additionally, we incorporate scaling features for the sliding windows, including Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the characters, and Hydropathy scale, to assign weights to the k-mers. By selecting multiple features, we aim to improve the accuracy of molecular sequence classification models. The motivations to incorporate weights for the k-mers in the molecular sequence analysis are the fact that different k-mers may have different levels of importance or relevance to the classification task at hand and that incorporating additional information, such as hydropathy scales, could improve the accuracy of classification models. The proposed method shows promising results in molecular sequence classification by outperforming the baseline methods and provides a new direction for analyzing sequences using image classification techniques.

T. Murad and S. Ali—Joint First Authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36(3), 307–340 (2003)

    Article  Google Scholar 

  2. Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)

    Article  Google Scholar 

  3. Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology. 11(3), 418 (2022)

    Article  Google Scholar 

  4. Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)

    Article  Google Scholar 

  5. Ma, Y., Yu, Z., Tang, R., Xie, X., Han, G., Anh, V.V.: Phylogenetic analysis of HIV-1 genomes based on the position-weighted K-mers method. Entropy 22(2), 255 (2020)

    Article  MathSciNet  Google Scholar 

  6. Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed K-mers from DNA and protein sequences. In: International Conference on Big Data and Smart Computing, pp. 342–349 (2020)

    Google Scholar 

  7. Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)

    Google Scholar 

  8. Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)

    Article  Google Scholar 

  9. Löchel, H.F., Eger, D., Sperlea, T., Heider, D.: Deep learning on chaos game representation for proteins. Bioinformatics 36(1), 272–279 (2020)

    Article  Google Scholar 

  10. Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference (2018)

    Google Scholar 

  11. Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)

    Google Scholar 

  12. Barnsley, M.F.: Fractals everywhere: New edition (2012)

    Google Scholar 

  13. Tzanov, V.: Strictly self-similar fractals composed of star-polygons that are attractors of iterated function systems. arXiv preprint arXiv:1502.01384 (2015)

  14. Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Bio. 157(1), 105–132 (1982)

    Article  Google Scholar 

  15. Eisenberg, D.: Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53(1), 595–623 (1984)

    Article  Google Scholar 

  16. Hopp, T.P., Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. PNAS 78(6), 3824–3828 (1981)

    Article  Google Scholar 

  17. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22(12), 2577–2637 (1983)

    Google Scholar 

  18. MacCallum, J.L., Tieleman, D.P.: Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem. Sci. 36(12), 653–662 (2011)

    Article  Google Scholar 

  19. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)

  21. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)

    Google Scholar 

  22. Hassan, Z.: 3 pre-trained image classification models (2022). https://www.folio3.ai/blog/image-classification-models/

  23. Campbell, K., et al.: Making genomic surveillance deliver: A lineage classification and nomenclature system to inform rabies elimination. PLoS Pathog. 18(5), e1010023 (2022)

    Article  Google Scholar 

  24. Ali, S., Murad, T., Patterson, M.: PSSM2Vec: a compact alignment-free embedding approach for coronavirus spike sequence classification. In: Neural Information Processing (ICONIP), pp. 420–432 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarwan Ali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Murad, T., Ali, S., Patterson, M. (2024). Weighted Chaos Game Representation for Molecular Sequence Classification. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14648. Springer, Singapore. https://doi.org/10.1007/978-981-97-2238-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2238-9_18

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2240-2

  • Online ISBN: 978-981-97-2238-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics