Skip to main content
Log in

A two-stage approach towards protein secondary structure classification

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract:

Protein secondary structure (PSS) describes the local folded structures which get formed inside a polypeptide due to interactions among atoms of the backbone. Generally, globular proteins are divided into four classes, namely all-α, all-β, α + β, and α/β. As nearly 90% of proteins fall into the said four classes, these are mostly considered for the purpose of computational classification of proteins. Classification of PSS is important for different biological functions that include protein fold recognition, tertiary structure prediction, prediction of DNA-binding sites, and reduction of the conformation search space among others. In this paper, we have proposed a machine learning–based model for secondary structure classification of proteins into four classes: all-α, all-β, α + β, and α/β. In doing so, we have considered both sequence-based and structure-based features. At first, mutual information (MI), a filter-based feature selection method, is used to remove the redundant features, and then these selected features are used to train three different classifiers—random forest, K-nearest neighbor (KNN), and multi-layer perceptron (MLP). After that, some standard classifier combination approaches are applied to integrate the decision made by the said classifiers and it has been found that weighted product rule performs the best among all. The overall accuracies obtained using the proposed model on the four standard datasets, namely 640, 1189, 25pdb, and fc699 are 86.89%, 92.93%, 91.38%, and 94.87% respectively. The proposed model outperforms some state-of-the-art methods considered here for comparison. Significantly high classification accuracy produced by our proposed model on four datasets is attributed to the development of a comprehensive feature set (by eliminating redundant features through feature selection technique) which is then passed through an ensemble consists of three different classifiers. Assigning different weights to the outcome of different classifiers thus proved to be useful in designing the model for predicting the secondary structure of proteins based on its sequence-based and structure-based features.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Graph 1
Graph 2
Graph 3
Graph 4

Similar content being viewed by others

References

  1. Fundamentals of protein structure and function. Springer US, Boston, MA 2007

  2. Levitt M, Chothia C (1976) Structural patterns in globular proteins. Nature 261(5561):552

    CAS  PubMed  Google Scholar 

  3. Ding S, Zhang S, Li Y, Wang T (2012) A novel protein structural classes prediction method based on predicted secondary structure. Biochimie 94(5):1166–1171

    CAS  PubMed  Google Scholar 

  4. Dehghani T, Naghibzadeh M, Eghdami M (2019) BetaDL: A protein beta-sheet predictor utilizing a deep learning model and independent set solution. Comput Biol Med 104:241–249

    CAS  PubMed  Google Scholar 

  5. Kurgan LA, Homaeian L (2006) Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recogn 39(12):2323–2343

    Google Scholar 

  6. Dehzangi A, Paliwal K, Sharma A, Dehzangi O, Sattar A (2013) A combination of feature extraction methods with an ensemble of different classifiers for protein structural class prediction problem. IEEE/ACM Trans Comput Biol Bioinforma 10(3):564–575

    CAS  Google Scholar 

  7. Bankapur S (2018) Protein secondary structural class prediction using effective feature modeling and machine learning techniques. In: 2018 IEEE 18th Int. Conf. Bioinforma. Bioeng, pp 18–21

    Google Scholar 

  8. Pauk J, Minta-Bielecka K (2016) Gait patterns classification based on cluster and bicluster analysis. Biocybern Biomed Eng 36(2):391–396

    Google Scholar 

  9. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Genet 43(3):246–255

    CAS  PubMed  Google Scholar 

  10. Costantini S, Facchiano AM (2009) Prediction of the protein structural class by specific peptide frequencies. Biochimie 91(2):226–229

    CAS  PubMed  Google Scholar 

  11. Lee SY, Lee JY, Jung KS, Ryu KH (Jun. 2009) A 9-state hidden Markov model using protein secondary structure information for protein fold recognition. Comput Biol Med 39(6):527–534

    CAS  PubMed  Google Scholar 

  12. Chou K (2005) Progress in protein structural class prediction and its ito bioinformatics and proteomics. Curr Protein Pept Sci 6(5):423–436

    CAS  PubMed  Google Scholar 

  13. Chou K-C (1995) A novel approach to predicting protein structural classes in a (20–1)-D amino acid composition space. Proteins Struct Funct Bioinforma 21(4):319–344

    CAS  Google Scholar 

  14. Chou KC (1999) A key driving force in determination of protein structural classes. Biochem Biophys Res Commun 264(1):216–224

    CAS  PubMed  Google Scholar 

  15. Kurgan L, Cios K, Chen K (2008) SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 9:1–15

    Google Scholar 

  16. Zhang L, Zhao X, Kong L (2013) A protein structural class prediction method based on novel features. Biochimie 95(9):1741–1744

    CAS  PubMed  Google Scholar 

  17. Liu T, Zheng X, Wang J (2010) Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. Biochimie 92(10):1330–1334

    CAS  PubMed  Google Scholar 

  18. Chou K-C, Cai Y-D (2004) Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 321(4):1007–1009

    CAS  PubMed  Google Scholar 

  19. Yang J-Y, Peng Z-L, Yu Z-G, Zhang R-J, Anh V, Wang D (2009) Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. J Theor Biol 257(4):618–626

    CAS  PubMed  Google Scholar 

  20. Liu T, Jia C (2010) A high-accuracy protein structural class prediction algorithm using predicted secondary structural information. J Theor Biol 267(3):272–275

    CAS  PubMed  Google Scholar 

  21. Yang J, Peng Z, Chen X (Jan. 2010) Prediction of protein structural classes for low-homology sequences based on predicted secondary structure. BMC Bioinformatics 11(S1):S9

    PubMed  PubMed Central  Google Scholar 

  22. Dai Q, Li Y, Liu X, Yao Y, Cao Y, He P (2013) Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position. BMC Bioinformatics 14

  23. Bao W, Wang D, Chen Y (Sep. 2017) Classification of protein structure classes on flexible neutral tree. IEEE/ACM Trans Comput Biol Bioinforma 14(5):1122–1133

    CAS  Google Scholar 

  24. Breiman L (Oct. 2001) Random forests. Mach Learn 45(1):5–32

    Google Scholar 

  25. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Google Scholar 

  26. Rosenblatt F (1961) Principles of neurodynamics. perceptrons and the theory of brain mechanisms. CORNELL AERONAUTICAL LAB INC BUFFALO NY

  27. Cao J, Xiong L (2014) Protein sequence classification with improved extreme learning machine algorithms. Biomed Res Int 2014

  28. Mizianty MJ, Kurgan L (Dec. 2009) Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics 10(1):414

    PubMed  PubMed Central  Google Scholar 

  29. Khalatbari L, Kangavari MR, Hosseini S, Yin H, Cheung N-M (Jul. 2019) MCP: A multi-component learning machine to predict protein secondary structure. Comput Biol Med 110:144–155

    CAS  PubMed  Google Scholar 

  30. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195–202

    CAS  PubMed  Google Scholar 

  31. Zhang S, Ding S, Wang T (2011) High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure. Biochimie 93(4):710–714

    CAS  PubMed  Google Scholar 

  32. Ghosh KK, Ahmed S, Singh PK, Geem ZW, Sarkar R (2020) Improved binary sailfish optimizer based on adaptive β-hill climbing for feature selection. IEEE Access:1–1

  33. Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, Sarkar R (Jan. 2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57(1):159–176

    PubMed  Google Scholar 

  34. Chatterjee B, Bhattacharyya T, Ghosh KK, Singh PK, Geem ZW, Sarkar R (Apr. 2020) Late acceptance hill climbing based social ski driver algorithm for feature selection. IEEE Access:1–1

  35. Shannon CE (Jul. 1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423

    Google Scholar 

  36. Uğuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Syst 24(7):1024–1032

    Google Scholar 

  37. Ghosh S, Bhowmik S, Ghosh KK, Sarkar R, Chakraborty S (2016) A filter ensemble feature selection method for handwritten numeral recognition. EMR 007213

  38. Mohamed NS, Zainudin S, Othman ZA (2017) Metaheuristic approach for an enhanced mRMR filter method for classification using drug response microarray data. Expert Syst Appl 90:224–231

    Google Scholar 

  39. Saha S, Ghosh M, Ghosh S, Sen S, Singh PK, Geem ZW, Sarkar R (2020) Feature selection for facial emotion recognition using cosine similarity-based harmony search algorithm. Appl Sci 10(8):2816

  40. Karegowda AG, Manjunath AS, Jayaram MA (2010) Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inf Technol Knowl Manag 2(2):271–277

    Google Scholar 

  41. Pandit S, Gupta S (2011) A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci 2(1):29–31

  42. Kira K, Rendell LA (1992) A practical approach to feature selection. In: Machine Learning Proceedings 1992. Elsevier, pp 249–256

  43. Jin X, Xu A, Bie R, Guo P (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles, pp 106–115

    Google Scholar 

  44. Benesty J, Chen J, Huang Y, Cohen I (2009) Pearson correlation coefficient, pp 1–4

    Google Scholar 

  45. Hauke J, Kossowski T (2011) Comparison of values of Pearson’s and Spearman’s correlation coefficients on the same sets of data. Quaest Geogr 30(2):87–93

    Google Scholar 

  46. He X, Cai D, Niyogi P (2006) Laplacian score for feature selection. In: Advances in neural information processing systems, pp 507–514

    Google Scholar 

  47. Daskalakis A, Kostopoulos S, Spyridonos P, Glotsos D, Ravazoula P, Kardari M, Kalatzis I, Cavouras D, Nikiforidis G (Feb. 2008) Design of a multi-classifier system for discriminating benign from malignant thyroid nodules using routinely H&E-stained cytological images. Comput Biol Med 38(2):196–203

    PubMed  Google Scholar 

  48. Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Heal Informatics Bioinforma 2(3):159–173

    Google Scholar 

  49. Magimai-Doss M, Hakkani-Tur D, Cetin O, Shriberg E, Fung J, Mirghafori N (2007) Entropy based classifier combination for sentence segmentation. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 2007, vol. 4, p IV-189-IV–192

    Google Scholar 

  50. Rohlfing T, Russakoff DB, Maurer CR (2004) Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Trans Med Imaging 23(8):983–994

    PubMed  Google Scholar 

  51. J. Kittler, M. Hater, and R. P. W. Duin, “Combining classifiers,” in Proceedings - International Conference on Pattern Recognition, 1996, vol. 2, no. 3, pp. 897–901.

  52. Fierrez J, Morales A, Vera-Rodriguez R, Camacho D (2018) Multiple classifiers in biometrics. part 1: Fundamentals and review. Inf Fusion 44:57–64

    Google Scholar 

  53. Kittler J (Mar. 1998) Combining classifiers: a theoretical framework. Pattern Anal Applic 1(1):18–27

    Google Scholar 

  54. Berman HM et al (2000) The Protein Data Bank. In: The Protein Data Bank

    Google Scholar 

  55. Ho TK (1995) Random decision forests. In: Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, vol 1, pp 278–282

    Google Scholar 

  56. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844

    Google Scholar 

  57. Ghosh S, Bhattacharya R, Majhi S, Bhowmik S, Malakar S, Sarkar R (2018) Textual content retrieval from filled-in form images. In: Workshop on document analysis and recognition. Springer, Singapore, pp 27–37

  58. Magerman DM (1995) Statistical decision-tree models for parsing, pp 276–283

    Google Scholar 

  59. P. E. Hart, Pattern classification and scene analysis. 1973.

    Google Scholar 

  60. Franco-Lopez H, Ek AR, Bauer ME (2001) Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method. Remote Sens Environ 77(3):251–274

    Google Scholar 

  61. Zhang M-L, Zhou Z-H (2005) A k-nearest neighbor based algorithm for multi-label classification. In: 2005 IEEE International Conference on Granular Computing, vol 5, pp 718–721 Vol. 2

    Google Scholar 

  62. Mafarja MM, Mirjalili S (2017) Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260:302–312

    Google Scholar 

  63. Mafarja MM, Eleyan D, Jaber I, Hammouri A, Mirjalili S (2017) Binary dragonfly algorithm for feature selection. In: 2017 International Conference on New Trends in Computing Sciences (ICTCS), pp 12–17

    Google Scholar 

  64. Bourlard H, Kamp Y (Sep. 1988) Auto-association by multilayer perceptrons and singular value decomposition. Biol Cybern 59(4–5):291–294

    CAS  PubMed  Google Scholar 

  65. Mazumder R, Paul S, Mandal A, Kundu S, Ghosh M, Sarkar R, Ghosh S (2019) A case study of genetic algorithm coupled multi-layer perceptron

  66. Liu L, Cui J, Zhou J (2016) A novel prediction method of protein structural classes based on protein super-secondary structure. J Comput Commun 04(15):54–62

    Google Scholar 

  67. B. Wenzheng, C. Yuehui, and W. Dong, “Prediction of protein structure classes with flexible neural tree,” in Bio-medical materials and engineering, 2014, vol. 24, no. 6, pp. 3797–3806.

  68. Kurgan L, Cios K, Chen K (2008) SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences. BMC Bioinformatics 9(1):226

    PubMed  PubMed Central  Google Scholar 

  69. Liu T, Geng X, Zheng X, Li R, Wang J (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids 42(6):2243–2249

    CAS  PubMed  Google Scholar 

  70. A. Al-Ani and M. Deriche, “A new technique for combining multiple classifiers using the Dempster-Shafer theory of evidence,” 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kushal Kanti Ghosh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghosh, K.K., Ghosh, S., Sen, S. et al. A two-stage approach towards protein secondary structure classification. Med Biol Eng Comput 58, 1723–1737 (2020). https://doi.org/10.1007/s11517-020-02194-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-020-02194-w

Keywords

Navigation