Abstract:
Protein secondary structure (PSS) describes the local folded structures which get formed inside a polypeptide due to interactions among atoms of the backbone. Generally, globular proteins are divided into four classes, namely all-α, all-β, α + β, and α/β. As nearly 90% of proteins fall into the said four classes, these are mostly considered for the purpose of computational classification of proteins. Classification of PSS is important for different biological functions that include protein fold recognition, tertiary structure prediction, prediction of DNA-binding sites, and reduction of the conformation search space among others. In this paper, we have proposed a machine learning–based model for secondary structure classification of proteins into four classes: all-α, all-β, α + β, and α/β. In doing so, we have considered both sequence-based and structure-based features. At first, mutual information (MI), a filter-based feature selection method, is used to remove the redundant features, and then these selected features are used to train three different classifiers—random forest, K-nearest neighbor (KNN), and multi-layer perceptron (MLP). After that, some standard classifier combination approaches are applied to integrate the decision made by the said classifiers and it has been found that weighted product rule performs the best among all. The overall accuracies obtained using the proposed model on the four standard datasets, namely 640, 1189, 25pdb, and fc699 are 86.89%, 92.93%, 91.38%, and 94.87% respectively. The proposed model outperforms some state-of-the-art methods considered here for comparison. Significantly high classification accuracy produced by our proposed model on four datasets is attributed to the development of a comprehensive feature set (by eliminating redundant features through feature selection technique) which is then passed through an ensemble consists of three different classifiers. Assigning different weights to the outcome of different classifiers thus proved to be useful in designing the model for predicting the secondary structure of proteins based on its sequence-based and structure-based features.
Similar content being viewed by others
References
Fundamentals of protein structure and function. Springer US, Boston, MA 2007
Levitt M, Chothia C (1976) Structural patterns in globular proteins. Nature 261(5561):552
Ding S, Zhang S, Li Y, Wang T (2012) A novel protein structural classes prediction method based on predicted secondary structure. Biochimie 94(5):1166–1171
Dehghani T, Naghibzadeh M, Eghdami M (2019) BetaDL: A protein beta-sheet predictor utilizing a deep learning model and independent set solution. Comput Biol Med 104:241–249
Kurgan LA, Homaeian L (2006) Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recogn 39(12):2323–2343
Dehzangi A, Paliwal K, Sharma A, Dehzangi O, Sattar A (2013) A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem. IEEE/ACM Trans Comput Biol Bioinforma 10(3):564–575
Bankapur S (2018) Protein secondary structural class prediction using effective feature modeling and machine learning techniques. In: 2018 IEEE 18th Int. Conf. Bioinforma. Bioeng, pp 18–21
Pauk J, Minta-Bielecka K (2016) Gait patterns classification based on cluster and bicluster analysis. Biocybern Biomed Eng 36(2):391–396
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43(3):246–255
Costantini S, Facchiano AM (2009) Prediction of the protein structural class by specific peptide frequencies. Biochimie 91(2):226–229
Lee SY, Lee JY, Jung KS, Ryu KH (Jun. 2009) A 9-state hidden Markov model using protein secondary structure information for protein fold recognition. Comput Biol Med 39(6):527–534
Chou K (2005) Progress in protein structural class prediction and its ito bioinformatics and proteomics. Curr Protein Pept Sci 6(5):423–436
Chou K-C (1995) A novel approach to predicting protein structural classes in a (20–1)-D amino acid composition space. Proteins Struct Funct Bioinforma 21(4):319–344
Chou KC (1999) A key driving force in determination of protein structural classes. Biochem Biophys Res Commun 264(1):216–224
Kurgan L, Cios K, Chen K (2008) SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 9:1–15
Zhang L, Zhao X, Kong L (2013) A protein structural class prediction method based on novel features. Biochimie 95(9):1741–1744
Liu T, Zheng X, Wang J (2010) Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie 92(10):1330–1334
Chou K-C, Cai Y-D (2004) Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 321(4):1007–1009
Yang J-Y, Peng Z-L, Yu Z-G, Zhang R-J, Anh V, Wang D (2009) Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 257(4):618–626
Liu T, Jia C (2010) A high-accuracy protein structural class prediction algorithm using predicted secondary structural information. J Theor Biol 267(3):272–275
Yang J, Peng Z, Chen X (Jan. 2010) Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinformatics 11(S1):S9
Dai Q, Li Y, Liu X, Yao Y, Cao Y, He P (2013) Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position. BMC Bioinformatics 14
Bao W, Wang D, Chen Y (Sep. 2017) Classification of protein structure classes on flexible neutral tree. IEEE/ACM Trans Comput Biol Bioinforma 14(5):1122–1133
Breiman L (Oct. 2001) Random forests. Mach Learn 45(1):5–32
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Rosenblatt F (1961) Principles of neurodynamics. perceptrons and the theory of brain mechanisms. CORNELL AERONAUTICAL LAB INC BUFFALO NY
Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014
Mizianty MJ, Kurgan L (Dec. 2009) Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 10(1):414
Khalatbari L, Kangavari MR, Hosseini S, Yin H, Cheung N-M (Jul. 2019) MCP: A multi-component learning machine to predict protein secondary structure. Comput Biol Med 110:144–155
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195–202
Zhang S, Ding S, Wang T (2011) High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure. Biochimie 93(4):710–714
Ghosh KK, Ahmed S, Singh PK, Geem ZW, Sarkar R (2020) Improved binary sailfish optimizer based on adaptive β-hill climbing for feature selection. IEEE Access:1–1
Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, Sarkar R (Jan. 2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57(1):159–176
Chatterjee B, Bhattacharyya T, Ghosh KK, Singh PK, Geem ZW, Sarkar R (Apr. 2020) Late acceptance hill climbing based social ski driver algorithm for feature selection. IEEE Access:1–1
Shannon CE (Jul. 1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Uğuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Syst 24(7):1024–1032
Ghosh S, Bhowmik S, Ghosh KK, Sarkar R, Chakraborty S (2016) A filter ensemble feature selection method for handwritten numeral recognition. EMR 007213
Mohamed NS, Zainudin S, Othman ZA (2017) Metaheuristic approach for an enhanced mRMR filter method for classification using drug response microarray data. Expert Syst Appl 90:224–231
Saha S, Ghosh M, Ghosh S, Sen S, Singh PK, Geem ZW, Sarkar R (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816
Karegowda AG, Manjunath AS, Jayaram MA (2010) Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inf Technol Knowl Manag 2(2):271–277
Pandit S, Gupta S (2011) A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci 2(1):29–31
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Machine Learning Proceedings 1992. Elsevier, pp 249–256
Jin X, Xu A, Bie R, Guo P (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles, pp 106–115
Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient, pp 1–4
Hauke J, Kossowski T (2011) Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaest Geogr 30(2):87–93
He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507–514
Daskalakis A, Kostopoulos S, Spyridonos P, Glotsos D, Ravazoula P, Kardari M, Kalatzis I, Cavouras D, Nikiforidis G (Feb. 2008) Design of a multi-classifier system for discriminating benign from malignant thyroid nodules using routinely H&E-stained cytological images. Comput Biol Med 38(2):196–203
Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Heal Informatics Bioinforma 2(3):159–173
Magimai-Doss M, Hakkani-Tur D, Cetin O, Shriberg E, Fung J, Mirghafori N (2007) Entropy based classifier combination for sentence segmentation. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007, vol. 4, p IV-189-IV–192
Rohlfing T, Russakoff DB, Maurer CR (2004) Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Trans Med Imaging 23(8):983–994
J. Kittler, M. Hater, and R. P. W. Duin, “Combining classifiers,” in Proceedings - International Conference on Pattern Recognition, 1996, vol. 2, no. 3, pp. 897–901.
Fierrez J, Morales A, Vera-Rodriguez R, Camacho D (2018) Multiple classifiers in biometrics. part 1: Fundamentals and review. Inf Fusion 44:57–64
Kittler J (Mar. 1998) Combining classifiers: a theoretical framework. Pattern Anal Applic 1(1):18–27
Berman HM et al (2000) The Protein Data Bank. In: The Protein Data Bank
Ho TK (1995) Random decision forests. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol 1, pp 278–282
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Ghosh S, Bhattacharya R, Majhi S, Bhowmik S, Malakar S, Sarkar R (2018) Textual content retrieval from filled-in form images. In: Workshop on document analysis and recognition. Springer, Singapore, pp 27–37
Magerman DM (1995) Statistical decision-tree models for parsing, pp 276–283
P. E. Hart, Pattern classification and scene analysis. 1973.
Franco-Lopez H, Ek AR, Bauer ME (2001) Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote Sens Environ 77(3):251–274
Zhang M-L, Zhou Z-H (2005) A k-nearest neighbor based algorithm for multi-label classification. In: 2005 IEEE International Conference on Granular Computing, vol 5, pp 718–721 Vol. 2
Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312
Mafarja MM, Eleyan D, Jaber I, Hammouri A, Mirjalili S (2017) Binary dragonfly algorithm for feature selection. In: 2017 International Conference on New Trends in Computing Sciences (ICTCS), pp 12–17
Bourlard H, Kamp Y (Sep. 1988) Auto-association by multilayer perceptrons and singular value decomposition. Biol Cybern 59(4–5):291–294
Mazumder R, Paul S, Mandal A, Kundu S, Ghosh M, Sarkar R, Ghosh S (2019) A case study of genetic algorithm coupled multi-layer perceptron
Liu L, Cui J, Zhou J (2016) A novel prediction method of protein structural classes based on protein super-secondary structure. J Comput Commun 04(15):54–62
B. Wenzheng, C. Yuehui, and W. Dong, “Prediction of protein structure classes with flexible neural tree,” in Bio-medical materials and engineering, 2014, vol. 24, no. 6, pp. 3797–3806.
Kurgan L, Cios K, Chen K (2008) SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 9(1):226
Liu T, Geng X, Zheng X, Li R, Wang J (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids 42(6):2243–2249
A. Al-Ani and M. Deriche, “A new technique for combining multiple classifiers using the Dempster-Shafer theory of evidence,” 2011.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ghosh, K.K., Ghosh, S., Sen, S. et al. A two-stage approach towards protein secondary structure classification. Med Biol Eng Comput 58, 1723–1737 (2020). https://doi.org/10.1007/s11517-020-02194-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-020-02194-w