Abstract
The aim of this paper is to evaluate improvement in the classification of protein sequence data by introducing clustering as a prepossessing step. Clustering analysis was introduced to discover any possible sub-clusters that might have different patterns within the same protein class. A classification learning algorithm is then applied to each cluster to enhance the classification accuracy. Two standard benchmark datasets: caspase 3 human substrates that include cleaved and non-cleaved peptides, and the membrane proteins inner and \(\alpha\)-helical proteins were used to examine the proposed approach. Different descriptors based on the physicochemical properties of amino acids were extracted from the protein sequence data and two encoding methods were used to represent the protein sequences using the descriptors. The results show that applying clustering process prior to classification gives higher prediction accuracy than using classification alone. In addition, the result of time performance shows that the proposed approach succeeded in reducing the training time of the classification process significantly while maintaining the accuracy of prediction.
Similar content being viewed by others
References
Abeel T, Peer Y, Saeys Y (2009) Java-ml: a machine learning library. J Mach Learn Res 10:931–934
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Awad M, Khan L, Bastani F, Yen I (2004) An effective support vector machines (svm) performance using hierarchical clustering. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence. pp 663–667
Ayyash M, Tamimi H, Ashhab Y (2012) Developing a powerful in silico tool for the discovery of novel caspase-3 substrates: a preliminary screening of the human proteome. BMC Bioinformatics
Bánhalmi A, Busa-Fekete R, Kégl B (2009) A one-class classification approach for protein sequences and structures. In: International symposium on bioinformatics research and applications. Springer, pp 310–322
Cervantes J, Li X, Yu W (2006) Support vector machine classication based on fuzzy clustering for large data sets. In: MICAI’06 proceedings of the 5th Mexican international conference on artificial intelligence. pp 572–582
Chou C (2001) Prediction of protein cellular attributes using pseudo-amino-acid composition. Proteins Struct Funct Genet 24:246–255
Das S, Dawson NL, Orengo CA (2015) Diversity in protein domain superfamilies. Curr Opin Genet Dev 35:40–49
Fawcett T (2006) An introduction to roc analysis. Pattern Recognit Lett 27:861–874
Gaddam Shekhar, Phoha Vir, Balagani Kiran (2007) K-means+id3: a novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. Knowl Data Eng IEEE Trans 19:345–354
Gao Q, Ye X, Jin Z, He J (2010) Improving discrimination of outer membrane proteins by fusing different forms of pseudo amino acid composition. Anal Biochem 398:52–59
Georgiev A (2009) Interpretable numerical descriptors of amino acid space. J Comput Biol 16(5):703–23
Gunn S (1998) Support vector machines for classification and regression. Tech Rep 14:5–16
Hellberg S, Sjostrom M, Wold S (1986) The prediction of bradykinin potentiating potency of pentapeptides. an example of a peptide quantitative structure-activity relationship. Acta Chem Scand 40:135–140
Huang Y, Kechadi T (2013) An effective hybrid learning system for telecommunication churn prediction. Expert Syst Appl 40:5635–5647
Jain P, Hirst JD (2010) Automatic structure classification of small proteins using random forest. In: BMCBI
Kawashima S, Kanehisa M (1999) Aaindex: amino acid index database. Nucleic Acids Res 27:27–36
Kyriakopoulou A, Kalamboukis T (2006) Text classification using clustering. In: Proceedings of the ECML-PKDD discovery challenge workshop
Kyriakopoulou Antonia, Kalamboukis Theodore (2008) Combining clustering with classification for spam detection in social bookmarking systems. RSDC
Laskowski Roman A, Thornton Janet M, Sternberg Michael JE (2009) The fine details of evolution. Biochem Soc Trans 374:723–726
Lingras P, West C (2004) Interval set clustering of web users with rough k-means. J Intell Inf Syst 23:5–16
Mathura V, Kolippakkam D (2005) Apdbase: amino acid physicochemical properties database. Bioinformation 1
McKee M, McKee J (2011) Biochemistry: the molecular basis of life, 5th edn. Oxford University Press, Oxford
Nanni L, Brahnam S, Lumini A (2010) High performance set of pseAAC and sequence based descriptors for protein classification. J Theoret Biol 266:1
Ohta T (2008) Gene families: multigene families and superfamilies. eLS
Ong S, Lin H, Chen Y, Li Z, Cao Z (2007) Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinf 8:1–4
Park K, Gromiha M, Horton P, Suwa M (2005) Discrimination of outer membrane proteins using support vector machines. Bioinformatics 21:223–229
Prlic A, Yates A, Bliven S et al (2012) Biojava: an open-source framework for bioinformatics. Bioinformatics 28:2693–2695
Rahideh A, Shaheed M (2011) Cancer classification using clustering based gene selection and artificial neural networks. In: 2nd International conference on control, instrumentation and automation (ICCIA)
Rajamohamed R, Manokaran J (2018) Improved credit card churn prediction based on rough clustering and supervised learning techniques. Cluster Comput 21:1–13
Ray S, Kepler T (2007) Amino acid biophysical properties in the statistical prediction of peptide-MHC class i binding. Immunome Res 3:1–10
Rojas R (1996) Neural networks: a systematic introduction. Springer, Berlin
Saidi R, Maddouri M, Nguifo E (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinf 11:1–3
Sneath P (1996) Relations between chemical structure and biological activity in peptides. J Theor Biol 12:157–195
The Mathworks (2021) Statistical toolbox 7.0. http://www.mathworks.com/help/stats/index.html
Tseng Yan Y, Li WH (2012) Classification of protein functional surfaces using structural characteristics. Proc Natl Acad Sci 1094:1170–1175
Xiao J, Tian Y, Xie L, Huang J (2019) A hybrid classification framework based on clustering. IEEE Tran Ind Inf 8:1
Xiong Y, Liu J, Zhang W, Zeng T (2012) Prediction of heme binding residues from protein sequences with integrative sequence profiles. Proteome Sci 10:1–8
Yu H, Yang J, Han J (2003) Classifying large data sets using svms with hierarchical clusters. ACM, Knowledge Discovery and Data Mining conference
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Altartouri, H., Tamimi, H. & Ashhab, Y. The impact of pre-clustering on classification of heterogeneous protein data. Netw Model Anal Health Inform Bioinforma 11, 3 (2022). https://doi.org/10.1007/s13721-021-00336-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-021-00336-0