Weighted Chaos Game Representation for Molecular Sequence Classification

Murad, Taslim; Ali, Sarwan; Patterson, Murray

doi:10.1007/978-981-97-2238-9_18

Taslim Murad¹³,
Sarwan Ali¹³ &
Murray Patterson¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14648))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

107 Accesses

Abstract

Molecular sequence analysis is a crucial task in bioinformatics and has several applications in drug discovery and disease diagnosis. However, traditional methods for molecular sequence classification are based on sequence alignment, which can be computationally expensive and lack accuracy. Although alignment-free methods exist, they usually do not take full advantage of deep learning (DL) models since DL models traditionally perform below power on tabular data compared to their effectiveness on image-based data. To address this, we propose a novel approach to classify molecular sequences using a Chaos Game Representation (CGR)-based approach. We utilize k-mers-based frequency chaos game representation (FCGR) to generate 2D images for molecular sequences. Additionally, we incorporate scaling features for the sliding windows, including Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the characters, and Hydropathy scale, to assign weights to the k-mers. By selecting multiple features, we aim to improve the accuracy of molecular sequence classification models. The motivations to incorporate weights for the k-mers in the molecular sequence analysis are the fact that different k-mers may have different levels of importance or relevance to the classification task at hand and that incorporating additional information, such as hydropathy scales, could improve the accuracy of classification models. The proposed method shows promising results in molecular sequence classification by outperforming the baseline methods and provides a new direction for analyzing sequences using image classification techniques.

T. Murad and S. Ali—Joint First Authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36(3), 307–340 (2003)
Article Google Scholar
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)
Article Google Scholar
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology. 11(3), 418 (2022)
Article Google Scholar
Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)
Article Google Scholar
Ma, Y., Yu, Z., Tang, R., Xie, X., Han, G., Anh, V.V.: Phylogenetic analysis of HIV-1 genomes based on the position-weighted K-mers method. Entropy 22(2), 255 (2020)
Article MathSciNet Google Scholar
Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed K-mers from DNA and protein sequences. In: International Conference on Big Data and Smart Computing, pp. 342–349 (2020)
Google Scholar
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)
Google Scholar
Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)
Article Google Scholar
Löchel, H.F., Eger, D., Sperlea, T., Heider, D.: Deep learning on chaos game representation for proteins. Bioinformatics 36(1), 272–279 (2020)
Article Google Scholar
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference (2018)
Google Scholar
Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)
Google Scholar
Barnsley, M.F.: Fractals everywhere: New edition (2012)
Google Scholar
Tzanov, V.: Strictly self-similar fractals composed of star-polygons that are attractors of iterated function systems. arXiv preprint arXiv:1502.01384 (2015)
Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Bio. 157(1), 105–132 (1982)
Article Google Scholar
Eisenberg, D.: Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53(1), 595–623 (1984)
Article Google Scholar
Hopp, T.P., Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. PNAS 78(6), 3824–3828 (1981)
Article Google Scholar
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22(12), 2577–2637 (1983)
Google Scholar
MacCallum, J.L., Tieleman, D.P.: Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem. Sci. 36(12), 653–662 (2011)
Article Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)
Google Scholar
Hassan, Z.: 3 pre-trained image classification models (2022). https://www.folio3.ai/blog/image-classification-models/
Campbell, K., et al.: Making genomic surveillance deliver: A lineage classification and nomenclature system to inform rabies elimination. PLoS Pathog. 18(5), e1010023 (2022)
Article Google Scholar
Ali, S., Murad, T., Patterson, M.: PSSM2Vec: a compact alignment-free embedding approach for coronavirus spike sequence classification. In: Neural Information Processing (ICONIP), pp. 420–432 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia State University, Atlanta, GA, USA
Taslim Murad, Sarwan Ali & Murray Patterson

Authors

Taslim Murad
View author publications
You can also search for this author in PubMed Google Scholar
Sarwan Ali
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarwan Ali .

Editor information

Editors and Affiliations

Academia Sinica, Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murad, T., Ali, S., Patterson, M. (2024). Weighted Chaos Game Representation for Molecular Sequence Classification. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14648. Springer, Singapore. https://doi.org/10.1007/978-981-97-2238-9_18

Download citation

DOI: https://doi.org/10.1007/978-981-97-2238-9_18
Published: 01 May 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2240-2
Online ISBN: 978-981-97-2238-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Weighted Chaos Game Representation for Molecular Sequence Classification