Research article
A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination

https://doi.org/10.1016/j.compbiolchem.2015.08.012Get rights and content

Highlights

  • Prediction performance of protein structural class has been improved.

  • A high-quality feature extraction technique has been designed.

  • A recursive feature selection has been used to reduce feature abundance.

Abstract

Structural class characterizes the overall folding type of a protein or its domain. Many methods have been proposed to improve the prediction accuracy of protein structural class in recent years, but it is still a challenge for the low-similarity sequences. In this study, we introduce a feature extraction technique based on auto cross covariance (ACC) transformation of position-specific score matrix (PSSM) to represent a protein sequence. Then support vector machine-recursive feature elimination (SVM-RFE) is adopted to select top K features according to their importance and these features are input to a support vector machine (SVM) to conduct the prediction. Performance evaluation of the proposed method is performed using the jackknife test on three low-similarity datasets, i.e., D640, 1189 and 25PDB. By means of this method, the overall accuracies of 97.2%, 96.2%, and 93.3% are achieved on these three datasets, which are higher than those of most existing methods. This suggests that the proposed method could serve as a very cost-effective tool for predicting protein structural class especially for low-similarity datasets.

Introduction

Knowledge of protein structural class plays an important role in the prediction of secondary structure and function analysis from the amino acid sequence information (Anand et al., 2008). Nowadays, the most frequently used classifications of protein structural classes can be found in the Structural Classifications of Protein (SCOP) database (Murzin et al., 1995). There are 110,800 protein domains with known structural class in SCOP, and about 90% of them belong to the four major classes: all-α, all-β, α/β and α+β. With the rapid development of genomics and proteomics, the newly discovered protein sequences are growing exponentially, which has made a large gap between the number of sequence-known and structure-known proteins. The current experimental determination of protein structure is costly and time-consuming and thus cannot cope with the demand for rapid classification. Hence there exists a great challenge to develop reliable and accurate computational methods to determine protein structural class.

As a typical pattern recognition problem, computational methods for predicting protein structural class generally consist of two main steps: protein feature representation and algorithm design for classification. For the first step, previous studies have shown that sequence features can be represented in many different ways, including amino acids composition (Chou, 1999 Zhou, 1998), pseudo amino acid (PseAA) composition (Chou, 2001, Li et al., 2009), polypeptide composition (Luo et al., 2002, Sun and Huang, 2006), functional domain composition (Chou and Cai, 2004), amino acid sequence reverse encoding (Yang et al., 2009), position-specific score matrix (PSSM) (Chen et al., 2008, Ding et al., 2014, Liu et al., 2010, Liu et al., 2012), and predicted secondary structure information (Dai et al., 2013, Dehzangi et al., 2014, Kong et al., 2014, Kurgan et al., 2008, Mizianty and Kurgan, 2009, Yang et al., 2010). It is worth mentioning that through quantitative analysis, Dai and his coauthors verify that exploring the position information of predicted secondary structural elements is a promising way to improve the abilities of protein structural class prediction (Dai et al., 2013). For the later step, a wide range of classification algorithms have been used to perform the prediction, such as neural network (Cai and Zhou, 2000), support vector machine (SVM) (Cai et al., 2001, Kong and Zhang, 2014, Li et al., 2008, Nanni et al., 2014), fuzzy clustering (Shen et al., 2005), fuzzy k-nearest neighbor (Zhang et al., 2008, Zheng et al., 2010), Bayesian classification (Wang and Yuan, 2000), Logistic regression (Jahandideh et al., 2007, Kurgan and Chen, 2007, Kurgan and Homaeian, 2006), rough sets (Cao et al., 2006), and classifier fusion techniques (Cai et al., 2006, Chen et al., 2006, Chen et al., 2009, Dehzangi et al., 2013). Early methods can achieve prediction accuracies more than 90% when tested on datasets with high sequence identities. However, they perform poorly on low-similarity datasets, with accuracies between 50% and 70% (Kurgan et al., 2008). To solve this problem, by incorporating various features such as PSSM, predicted secondary structure and physical-chemical properties, several methods have been proposed to improve prediction accuracies on low-similarity datasets (Dehzangi et al., 2013, Dehzangi et al., 2014, Kurgan et al., 2008, Liu et al., 2010, Wang et al., 2015, Yang et al., 2010). Nevertheless, most studies which rely only on predicted secondary structure to enhance the accuracy could not reach too far better results than 80% (Kurgan et al., 2008, Yang et al., 2010). This may be due to limited prediction accuracy (about 80%) of protein secondary structure by PSIPRED (Jones, 1999). On the other hand, since the performance of PSIPRED algorithm relies mainly on PSSM, PSSM profile provides more important and original discriminatory information for protein structural class prediction. In our previous study, we extracted auto-covariance variables from the PSSM profile and also obtained favorable prediction accuracy when the predicted secondary structure was not utilized (Liu et al., 2012).

In this study, in order to further improve the prediction accuracy of protein structural class, we extract both auto-covariance variables and cross-covariance variables from the PSSM profile by auto cross covariance (ACC) transformation. The flowchart of the proposed method is depicted in Fig. 1, which presents the pipeline that goes from the query sequence to the final output as well as intermediate steps. Firstly, the PSSM profile generated by PSI-BLAST program (Altschul et al., 1997) is transformed into a fixed-length feature vector by ACC transformation. Secondly, support vector machine-recursive feature elimination (SVM-RFE) is applied for feature selection and reduced vectors are input to an SVM classifier to perform the prediction. Finally, results by the jackknife test on three widely used benchmark datasets suggest that the proposed method yields substantial improvements in prediction accuracies compared with most published results.

Section snippets

Datasets

In order to evaluate the prediction accuracy of the proposed method and compare it with those of existing methods, three widely used datasets are adopted in our work: D640 (Chen et al., 2008), 1189 (Wang and Yuan, 2000) and 25PDB (Kurgan and Homaeian, 2006), with sequence similarity lower than 25%, 40% and 25% respectively. The D640 dataset contains 640 protein sequences, which consists of 138 all-α proteins, 154 all-β proteins, 177 α/β proteins and 171 α+β proteins. The 1189 dataset includes

Results and discussion

To evaluate the proposed method, we first discuss the selections of parameter G and top K features, and then calculate the prediction accuracies of our method on three low-similarity datasets by the jackknife test, and finally we conduct an extensive performance comparison between the current method and major existing methods.

Conclusions

Though many efforts have been made so far, prediction of protein structural class for low-similarity sequences still remains a challenging problem in bioinformatics. In this regard, we apply ACC and SVM-RFE to further improve prediction accuracy. In this paper, ACC transformation is used to convert the PSSM profile into a fixed-length feature vector. Then, these features are ranked by SVM-RFE based on their importance and the optimal top K features are input to an SVM classifier to perform the

Acknowledgements

This work was supported by the National Natural Science Foundation of China (41376135), the Doctoral Fund of Ministry of Education of China (20133104110006), the Innovation Program of Shanghai Municipal Education Commission (No. 13YZ098), the Foundation for University Youth Teachers of Shanghai (No. ZZhy12028) and the Doctoral Fund of Shanghai Ocean University.

References (56)

  • S. Jahandideh et al.

    Novel two-stage hybrid neural discriminant model for predicting proteins structural classes

    Biophys Chem.

    (2007)
  • D.T. Jones

    Protein secondary structure prediction based on position-specific scoring matrices

    J. Mol. Biol.

    (1999)
  • L. Kong et al.

    Novel structure-driven features for accurate prediction of protein structural class

    Genomics

    (2014)
  • L. Kong et al.

    Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou's pseudo amino acid composition

    J. Theor. Biol.

    (2014)
  • L. Kurgan et al.

    Prediction of protein structural class for the twilight zone sequences

    Biochem. Biophys. Res. Commun.

    (2007)
  • L.A. Kurgan et al.

    Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

    Pattern Recognit.

    (2006)
  • T. Liu et al.

    Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile

    Biochimie

    (2010)
  • X. Liu et al.

    Protein remote homology detection based on auto-cross covariance transformation

    Comput. Biol. Med.

    (2011)
  • A.G. Murzin et al.

    SCOP: a structural classification of proteins database for the investigation of sequences and structures

    J. Mol. Biol.

    (1995)
  • L. Nanni et al.

    Prediction of protein structure classes by incorporating different protein descriptors into general Chou's pseudo amino acid composition

    J. Theor. Biol.

    (2014)
  • H.B. Shen et al.

    Using supervised fuzzy clustering to predict protein structural classes

    Biochem. Biophys. Res. Commun.

    (2005)
  • J. Wang et al.

    High-accuracy prediction of protein structural classes using PseAA structural properties and secondary structural patterns

    Biochimie

    (2014)
  • J.R. Wang et al.

    Prediction of protein structural classes for low-similarity sequences using reduced PSSM and position-based secondary structural features

    Gene

    (2015)
  • S. Wold et al.

    DNA and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least-squares projections to latent structures

    Anal. Chim. Acta

    (1993)
  • J.Y. Yang et al.

    Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation

    J. Theor. Biol.

    (2009)
  • T.L. Zhang et al.

    Prediction protein structural classes with pseudo-amino acid composition: Approximate entropy and hydrophobicity pattern

    J. Theor. Biol.

    (2008)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucl. Acids Res.

    (1997)
  • Y.D. Cai et al.

    Support vector machines for predicting protein structural class

    BMC Bioinform.

    (2001)
  • Cited by (23)

    • A regional-scale hyperspectral prediction model of soil organic carbon considering geomorphic features

      2021, Geoderma
      Citation Excerpt :

      For SOC retrieval, RFE selects the random forest (RF) function as the basic function. Based on a comparison of the prediction accuracy of different RF models, the input quantity that corresponds to the model with the highest accuracy was selected and multiple decision trees could consider the effects of different variable combinations on modeling performance and enhance the stability of variable selection when making predictions with highly correlated data (Neville, 2013; Li et al., 2015). Compared with the principal component analysis (PCA) method for obtaining ancillary variables (Ganiyu et al., 2018), the RFE algorithm yields variables that preserve the physical meaning of the original variable information, which helps to determine the spectral and ancillary variable information that is utilized for analysis and guides further investigations.

    • A novel feature selection method to predict protein structural class

      2018, Computational Biology and Chemistry
      Citation Excerpt :

      Usually, different protein feature representation methods lead to different prediction accuracy rates under the same classifier. In order to obtain better predictive performance of protein structural class, some researchers explore new classifiers (Zhang et al., 2014), and some researchers develop new protein feature representation methods (Li et al., 2015; Zhang et al., 2016). As the new protein feature representation methods are introduced continuously, integrating different types of protein feature representations to predict protein structural class has become a new trend of research.

    • Multi-objective feature selection for warfarin dose prediction

      2017, Computational Biology and Chemistry
      Citation Excerpt :

      In section four, the experimental results are presented, and finally we conclude our work in section five. Feature selection techniques have been used in computational biology and its related applications for a long time (He and Yu, 2010; Martinez et al., 2010; Gumus et al., 2013; Garbarine et al., 2011; Li et al., 2015). Reducing the size of the datasets, decreasing the computation time of classification by removing redundant or irrelevant features, and improving the classification time by eliminating misleading and inappropriate features, are some of the most important purposes of features selection.

    View all citing articles on Scopus
    View full text