Abstract
Molecular sequence analysis is a crucial task in bioinformatics and has several applications in drug discovery and disease diagnosis. However, traditional methods for molecular sequence classification are based on sequence alignment, which can be computationally expensive and lack accuracy. Although alignment-free methods exist, they usually do not take full advantage of deep learning (DL) models since DL models traditionally perform below power on tabular data compared to their effectiveness on image-based data. To address this, we propose a novel approach to classify molecular sequences using a Chaos Game Representation (CGR)-based approach. We utilize k-mers-based frequency chaos game representation (FCGR) to generate 2D images for molecular sequences. Additionally, we incorporate scaling features for the sliding windows, including Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the characters, and Hydropathy scale, to assign weights to the k-mers. By selecting multiple features, we aim to improve the accuracy of molecular sequence classification models. The motivations to incorporate weights for the k-mers in the molecular sequence analysis are the fact that different k-mers may have different levels of importance or relevance to the classification task at hand and that incorporating additional information, such as hydropathy scales, could improve the accuracy of classification models. The proposed method shows promising results in molecular sequence classification by outperforming the baseline methods and provides a new direction for analyzing sequences using image classification techniques.
T. Murad and S. Ali—Joint First Authors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Q. Rev. Biophys. 36(3), 307–340 (2003)
Kuzmin, K., et al.: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem. Biophys. Res. Commun. 533(3), 553–558 (2020)
Ali, S., Bello, B., Chourasia, P., Punathil, R.T., Zhou, Y., Patterson, M.: PWM2Vec: an efficient embedding approach for viral host specification from coronavirus spike sequences. Biology. 11(3), 418 (2022)
Chowdhury, B., Garai, G.: A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109(5–6), 419–431 (2017)
Ma, Y., Yu, Z., Tang, R., Xie, X., Han, G., Anh, V.V.: Phylogenetic analysis of HIV-1 genomes based on the position-weighted K-mers method. Entropy 22(2), 255 (2020)
Zhang, J., Bi, C., Wang, Y., Zeng, T., Liao, B., Chen, L.: Efficient mining closed K-mers from DNA and protein sequences. In: International Conference on Big Data and Smart Computing, pp. 342–349 (2020)
Ali, S., Patterson, M.: Spike2vec: an efficient and scalable embedding approach for COVID-19 spike sequences. In: IEEE Big Data, pp. 1533–1540 (2021)
Jeffrey, H.J.: Chaos game representation of gene structure. Nucleic Acids Res. 18(8), 2163–2170 (1990)
Löchel, H.F., Eger, D., Sperlea, T., Heider, D.: Deep learning on chaos game representation for proteins. Bioinformatics 36(1), 272–279 (2020)
Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: AAAI Conference (2018)
Farhan, M., et al.: Efficient approximation algorithms for strings kernel based sequence classification. In: NeurIPS, pp. 6935–6945 (2017)
Barnsley, M.F.: Fractals everywhere: New edition (2012)
Tzanov, V.: Strictly self-similar fractals composed of star-polygons that are attractors of iterated function systems. arXiv preprint arXiv:1502.01384 (2015)
Kyte, J., Doolittle, R.F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Bio. 157(1), 105–132 (1982)
Eisenberg, D.: Three-dimensional structure of membrane and surface proteins. Annu. Rev. Biochem. 53(1), 595–623 (1984)
Hopp, T.P., Woods, K.R.: Prediction of protein antigenic determinants from amino acid sequences. PNAS 78(6), 3824–3828 (1981)
Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym. Orig. Res. Biomol. 22(12), 2577–2637 (1983)
MacCallum, J.L., Tieleman, D.P.: Hydrophobicity scales: a thermodynamic looking glass into lipid-protein interactions. Trends Biochem. Sci. 36(12), 653–662 (2011)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019)
Hassan, Z.: 3 pre-trained image classification models (2022). https://www.folio3.ai/blog/image-classification-models/
Campbell, K., et al.: Making genomic surveillance deliver: A lineage classification and nomenclature system to inform rabies elimination. PLoS Pathog. 18(5), e1010023 (2022)
Ali, S., Murad, T., Patterson, M.: PSSM2Vec: a compact alignment-free embedding approach for coronavirus spike sequence classification. In: Neural Information Processing (ICONIP), pp. 420–432 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Murad, T., Ali, S., Patterson, M. (2024). Weighted Chaos Game Representation for Molecular Sequence Classification. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14648. Springer, Singapore. https://doi.org/10.1007/978-981-97-2238-9_18
Download citation
DOI: https://doi.org/10.1007/978-981-97-2238-9_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2240-2
Online ISBN: 978-981-97-2238-9
eBook Packages: Computer ScienceComputer Science (R0)