A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination

doi:10.1016/j.compbiolchem.2015.08.012

Computational Biology and Chemistry

Volume 59, Part A, December 2015, Pages 95-100

https://doi.org/10.1016/j.compbiolchem.2015.08.012 Get rights and content

Highlights

•
Prediction performance of protein structural class has been improved.
•
A high-quality feature extraction technique has been designed.
•
A recursive feature selection has been used to reduce feature abundance.

Abstract

Structural class characterizes the overall folding type of a protein or its domain. Many methods have been proposed to improve the prediction accuracy of protein structural class in recent years, but it is still a challenge for the low-similarity sequences. In this study, we introduce a feature extraction technique based on auto cross covariance (ACC) transformation of position-specific score matrix (PSSM) to represent a protein sequence. Then support vector machine-recursive feature elimination (SVM-RFE) is adopted to select top K features according to their importance and these features are input to a support vector machine (SVM) to conduct the prediction. Performance evaluation of the proposed method is performed using the jackknife test on three low-similarity datasets, i.e., D640, 1189 and 25PDB. By means of this method, the overall accuracies of 97.2%, 96.2%, and 93.3% are achieved on these three datasets, which are higher than those of most existing methods. This suggests that the proposed method could serve as a very cost-effective tool for predicting protein structural class especially for low-similarity datasets.

Graphical abstract

Introduction

Knowledge of protein structural class plays an important role in the prediction of secondary structure and function analysis from the amino acid sequence information (Anand et al., 2008). Nowadays, the most frequently used classifications of protein structural classes can be found in the Structural Classifications of Protein (SCOP) database (Murzin et al., 1995). There are 110,800 protein domains with known structural class in SCOP, and about 90% of them belong to the four major classes: all-α, all-β, α/β and α+β. With the rapid development of genomics and proteomics, the newly discovered protein sequences are growing exponentially, which has made a large gap between the number of sequence-known and structure-known proteins. The current experimental determination of protein structure is costly and time-consuming and thus cannot cope with the demand for rapid classification. Hence there exists a great challenge to develop reliable and accurate computational methods to determine protein structural class.

As a typical pattern recognition problem, computational methods for predicting protein structural class generally consist of two main steps: protein feature representation and algorithm design for classification. For the first step, previous studies have shown that sequence features can be represented in many different ways, including amino acids composition (Chou, 1999 Zhou, 1998), pseudo amino acid (PseAA) composition (Chou, 2001, Li et al., 2009), polypeptide composition (Luo et al., 2002, Sun and Huang, 2006), functional domain composition (Chou and Cai, 2004), amino acid sequence reverse encoding (Yang et al., 2009), position-specific score matrix (PSSM) (Chen et al., 2008, Ding et al., 2014, Liu et al., 2010, Liu et al., 2012), and predicted secondary structure information (Dai et al., 2013, Dehzangi et al., 2014, Kong et al., 2014, Kurgan et al., 2008, Mizianty and Kurgan, 2009, Yang et al., 2010). It is worth mentioning that through quantitative analysis, Dai and his coauthors verify that exploring the position information of predicted secondary structural elements is a promising way to improve the abilities of protein structural class prediction (Dai et al., 2013). For the later step, a wide range of classification algorithms have been used to perform the prediction, such as neural network (Cai and Zhou, 2000), support vector machine (SVM) (Cai et al., 2001, Kong and Zhang, 2014, Li et al., 2008, Nanni et al., 2014), fuzzy clustering (Shen et al., 2005), fuzzy k-nearest neighbor (Zhang et al., 2008, Zheng et al., 2010), Bayesian classification (Wang and Yuan, 2000), Logistic regression (Jahandideh et al., 2007, Kurgan and Chen, 2007, Kurgan and Homaeian, 2006), rough sets (Cao et al., 2006), and classifier fusion techniques (Cai et al., 2006, Chen et al., 2006, Chen et al., 2009, Dehzangi et al., 2013). Early methods can achieve prediction accuracies more than 90% when tested on datasets with high sequence identities. However, they perform poorly on low-similarity datasets, with accuracies between 50% and 70% (Kurgan et al., 2008). To solve this problem, by incorporating various features such as PSSM, predicted secondary structure and physical-chemical properties, several methods have been proposed to improve prediction accuracies on low-similarity datasets (Dehzangi et al., 2013, Dehzangi et al., 2014, Kurgan et al., 2008, Liu et al., 2010, Wang et al., 2015, Yang et al., 2010). Nevertheless, most studies which rely only on predicted secondary structure to enhance the accuracy could not reach too far better results than 80% (Kurgan et al., 2008, Yang et al., 2010). This may be due to limited prediction accuracy (about 80%) of protein secondary structure by PSIPRED (Jones, 1999). On the other hand, since the performance of PSIPRED algorithm relies mainly on PSSM, PSSM profile provides more important and original discriminatory information for protein structural class prediction. In our previous study, we extracted auto-covariance variables from the PSSM profile and also obtained favorable prediction accuracy when the predicted secondary structure was not utilized (Liu et al., 2012).

In this study, in order to further improve the prediction accuracy of protein structural class, we extract both auto-covariance variables and cross-covariance variables from the PSSM profile by auto cross covariance (ACC) transformation. The flowchart of the proposed method is depicted in Fig. 1, which presents the pipeline that goes from the query sequence to the final output as well as intermediate steps. Firstly, the PSSM profile generated by PSI-BLAST program (Altschul et al., 1997) is transformed into a fixed-length feature vector by ACC transformation. Secondly, support vector machine-recursive feature elimination (SVM-RFE) is applied for feature selection and reduced vectors are input to an SVM classifier to perform the prediction. Finally, results by the jackknife test on three widely used benchmark datasets suggest that the proposed method yields substantial improvements in prediction accuracies compared with most published results.

Section snippets

Datasets

In order to evaluate the prediction accuracy of the proposed method and compare it with those of existing methods, three widely used datasets are adopted in our work: D640 (Chen et al., 2008), 1189 (Wang and Yuan, 2000) and 25PDB (Kurgan and Homaeian, 2006), with sequence similarity lower than 25%, 40% and 25% respectively. The D640 dataset contains 640 protein sequences, which consists of 138 all-α proteins, 154 all-β proteins, 177 α/β proteins and 171 α+β proteins. The 1189 dataset includes

Results and discussion

To evaluate the proposed method, we first discuss the selections of parameter G and top K features, and then calculate the prediction accuracies of our method on three low-similarity datasets by the jackknife test, and finally we conduct an extensive performance comparison between the current method and major existing methods.

Conclusions

Though many efforts have been made so far, prediction of protein structural class for low-similarity sequences still remains a challenging problem in bioinformatics. In this regard, we apply ACC and SVM-RFE to further improve prediction accuracy. In this paper, ACC transformation is used to convert the PSSM profile into a fixed-length feature vector. Then, these features are ranked by SVM-RFE based on their importance and the optimal top K features are input to an SVM classifier to perform the

Acknowledgements

This work was supported by the National Natural Science Foundation of China (41376135), the Doctoral Fund of Ministry of Education of China (20133104110006), the Innovation Program of Shanghai Municipal Education Commission (No. 13YZ098), the Foundation for University Youth Teachers of Shanghai (No. ZZhy12028) and the Doctoral Fund of Shanghai Ocean University.

References (56)

A. Anand et al.
Predicting protein structural class by SVM with class-wise optimized features and decision probabilities
J. Theor. Biol.
(2008)
Y.D. Cai et al.
Prediction of protein structural classes by neural network
Biochimie
(2000)
Y.D. Cai et al.
Using LogitBoost classifier to predict protein structural classes
J. Theor. Biol.
(2006)
C. Chen et al.
Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network
Anal. Biochem.
(2006)
K.C. Chou
A key driving force in determination of protein structural classes
Biochem. Biophys. Res. Commun.
(1999)
K.C. Chou
Some remarks on protein attribute prediction and pseudo amino acid composition
J. Theor. Biol.
(2011)
K.C. Chou et al.
Predicting protein structural class by functional domain composition
Biochem. Biophys. Res. Commun.
(2004)
K.C. Chou et al.
Recent progress in protein subcellular location prediction
Anal. Biochem.
(2007)
S. Ding et al.
A protein structural classes prediction method based on PSI-BLAST profile
J. Theor. Biol.
(2014)
M. Hayat et al.
Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces
J. Theor. Biol.
(2014)

S. Jahandideh et al.

Novel two-stage hybrid neural discriminant model for predicting proteins structural classes

Biophys Chem.

(2007)

D.T. Jones

Protein secondary structure prediction based on position-specific scoring matrices

J. Mol. Biol.

(1999)

L. Kong et al.

Novel structure-driven features for accurate prediction of protein structural class

Genomics

(2014)

L. Kong et al.

Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou's pseudo amino acid composition

J. Theor. Biol.

(2014)

L. Kurgan et al.

Prediction of protein structural class for the twilight zone sequences

Biochem. Biophys. Res. Commun.

(2007)

L.A. Kurgan et al.

Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

Pattern Recognit.

(2006)

T. Liu et al.

Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile

Biochimie

(2010)

X. Liu et al.

Protein remote homology detection based on auto-cross covariance transformation

Comput. Biol. Med.

(2011)

A.G. Murzin et al.

SCOP: a structural classification of proteins database for the investigation of sequences and structures

J. Mol. Biol.

(1995)

L. Nanni et al.

Prediction of protein structure classes by incorporating different protein descriptors into general Chou's pseudo amino acid composition

J. Theor. Biol.

(2014)

H.B. Shen et al.

Using supervised fuzzy clustering to predict protein structural classes

Biochem. Biophys. Res. Commun.

(2005)

J. Wang et al.

High-accuracy prediction of protein structural classes using PseAA structural properties and secondary structural patterns

Biochimie

(2014)

J.R. Wang et al.

Prediction of protein structural classes for low-similarity sequences using reduced PSSM and position-based secondary structural features

Gene

(2015)

S. Wold et al.

DNA and peptide sequences and chemical processes multivariately modeled by principal component analysis and partial least-squares projections to latent structures

Anal. Chim. Acta

(1993)

J.Y. Yang et al.

Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation

J. Theor. Biol.

(2009)

T.L. Zhang et al.

Prediction protein structural classes with pseudo-amino acid composition: Approximate entropy and hydrophobicity pattern

J. Theor. Biol.

(2008)

S.F. Altschul et al.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucl. Acids Res.

(1997)

Y.D. Cai et al.

Support vector machines for predicting protein structural class

BMC Bioinform.

(2001)

Cited by (23)

A regional-scale hyperspectral prediction model of soil organic carbon considering geomorphic features
2021, Geoderma
Citation Excerpt :
For SOC retrieval, RFE selects the random forest (RF) function as the basic function. Based on a comparison of the prediction accuracy of different RF models, the input quantity that corresponds to the model with the highest accuracy was selected and multiple decision trees could consider the effects of different variable combinations on modeling performance and enhance the stability of variable selection when making predictions with highly correlated data (Neville, 2013; Li et al., 2015). Compared with the principal component analysis (PCA) method for obtaining ancillary variables (Ganiyu et al., 2018), the RFE algorithm yields variables that preserve the physical meaning of the original variable information, which helps to determine the spectral and ancillary variable information that is utilized for analysis and guides further investigations.
The prediction of soil organic carbon (SOC) from hyperspectral data often lacks geographic and environmental information related to soil genesis, which would improve the accuracy of the predicted SOC. The main purpose of this study was to improve the accuracy of SOC prediction and the mapping of SOC spatial distributions. We employed satellite hyperspectral image (HSI) data combined with ancillary variables (spectral indexes (SIs), terrain attributes (TAs) and spectral texture features (TFs)) by first stratifying the soil at the great group level. The central part of the Songnen Plain in Northeast China was selected as a region for a case study, because the region attracts considerable research interest as major grain production area in China. In different prediction models, recursive feature elimination (RFE) was applied to optimize input variables to reflect the soil-landscape relationships of different soil classes. The results showed that when the soil stratification strategy and ancillary variables were comprehensively considered, the accuracy of the model was significantly improved (with a coefficient of determination (R²) of 0.76, root mean square error (RMSE) of 3.16 g kg⁻¹, and ratio of performance to interquartile distance (RPIQ) of 2.28). The introduction of SIs, TAs and TFs improved the R² values by 6.15%, 6.15%, and 13.85%, respectively, compared to those achieved with the original reflectance (OR) bands alone. Moreover, the introduction of ancillary variables improved the accuracies of the SOC models, yielding R² values of Phaeozems, Chernozems, Arenosols and Cambisols of 0.79, 0.53, 0.76, and 0.81, respectively. Compared with the prediction model, which is based on only the OR, the proposed model can better explain SOC spatial variations. The performance comparison highlights the advantage of the considering geomorphic features when utilized for SOC prediction in regional-scale; this model covers the elimination and expression of optimal ancillary variables for different soil classes, which are closely related to the formation of various soil types and the geomorphic evolution of the region. The SOC map that we obtained shows detailed soil information and effectively expresses the soil factors associated with the environment. The map can support planners in establishing efficient SOC monitoring methods and assessments and prioritizing inputs for future exploitation and research.
Prediction of human phosphorylated proteins by extracting multi-perspective discriminative features from the evolutionary profile and physicochemical properties through LFDA
2020, Chemometrics and Intelligent Laboratory Systems
Protein phosphorylation is an emerging post-translational modification, which critically involved in the intracellular process of the human body by controlling diverse functions ranging from cell growth to metabolism. The existing experimental methods for identifying phosphorylated proteins are overpriced and resource-intensive; thus, it is necessary to develop a fast and accurate computational method to address the problem. Here we report a novel predictor HPhosPPred, a phosphorylated protein prediction method that is incorporating highly discriminative evolutionary and physicochemical information conserved in protein primary motifs, namely pseudo-position specific scoring matrix, the auto-covariance transformation of the position-specific scoring matrix and normalized moreau-broto auto-correlation. Further, to boost up the generalization capability of HPhosPPred, we used local fisher discriminant analysis as a dominant feature selection strategy for eliminating redundant and noise patterns from the extracted features. Finally, the optimized features feed to support vector machine with radial basis function kernel to predict phosphorylated proteins. As evident from the results, the proposed method achieved promising performance with an accuracy of 80.68%, sensitivity of 84.63%, specificity of 73.67%, and Matthew’s correlation coefficient of 0.581 using rigorous leave-one-out-cross-validation test and 10-fold cross-validation test. The empirical outcomes demonstrate that the developed model outperformed the existing state-of-the-art methods. Furthermore, our analysis reveals that the proposed tool can help detect unseen phosphorylated proteins in particular and proteomics research in general. The source code and dataset are publicly available at https://github.com/saeed344/HPhosPPred.
A novel feature selection method to predict protein structural class
2018, Computational Biology and Chemistry
Citation Excerpt :
Usually, different protein feature representation methods lead to different prediction accuracy rates under the same classifier. In order to obtain better predictive performance of protein structural class, some researchers explore new classifiers (Zhang et al., 2014), and some researchers develop new protein feature representation methods (Li et al., 2015; Zhang et al., 2016). As the new protein feature representation methods are introduced continuously, integrating different types of protein feature representations to predict protein structural class has become a new trend of research.
Integrating various features from different protein properties helps to improve the prediction accuracy of protein structural class but need to deal with the corresponding integrated high-dimensional data. Thus, the feature selection process used to select the informative features from the integrated features also becomes an indispensable key step. This paper proposes a novel feature selection method, Partial-Maximum-Correlation-Information based Recursive Feature Elimination (PMCI-RFE), to quickly select the best feature subset from the integrated high-dimensional protein features set to improve the prediction performance of protein structural class. PMCI-RFE can also be used to find different types of informative features to further analyze some biological relationships. The proposed PMCI-RFE method uses the correlation information between the feature space and class encoding space to select informative features based on the idea of orthogonal component projection in the feature space. The experimental results on six widely used benchmark datasets show that PMCI-RFE is a fast and effective method compare to other four state-of-the-art feature selection methods, which indeed can make full use of different protein property information and improve the predictability of protein structural class.
Multi-objective feature selection for warfarin dose prediction
2017, Computational Biology and Chemistry
Citation Excerpt :
In section four, the experimental results are presented, and finally we conclude our work in section five. Feature selection techniques have been used in computational biology and its related applications for a long time (He and Yu, 2010; Martinez et al., 2010; Gumus et al., 2013; Garbarine et al., 2011; Li et al., 2015). Reducing the size of the datasets, decreasing the computation time of classification by removing redundant or irrelevant features, and improving the classification time by eliminating misleading and inappropriate features, are some of the most important purposes of features selection.
With increasing the application of decision support systems in various fields, using such systems in different aspects of medical science has been growing. Drug’s dose prediction is one of the most important issues which can be improved using decision support systems. In this paper, a new multi-objective feature approach has been proposed to support warfarin dose prediction decision. Warfarin is an anticoagulant normally used in the prevention of the formation of clots. This research was conducted on 553 patients during 2013–2015 who were candidates for using warfarin and their INR was in the target range. Features affecting dose was implemented and evaluated, which were clinical and genetic characteristics extracted, and new methods of feature selection based on multi-objective optimization methods such as the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) and Multi-Objective Particle Swarm Optimization (MOPSO) along with the evaluation of artificial neural networks were used. Multi-objective optimization methods have more accuracy and performance compared to the classic methods of feature selection. Furthermore, multi-objective particle swarm optimization algorithm has higher precision than Non-dominated Sorting Genetic Algorithm-II. With a choice of seven features Mean Square Error (MSE), root mean square error (RMSE) and mean absolute error (MAE) were 0.011, 0.1 and 0.109 for MOPSO, respectively.
Structural class prediction of protein using novel feature extraction method from chaos game representation of predicted secondary structure
2016, Journal of Theoretical Biology
Protein structural class prediction plays an important role in protein structure and function analysis, drug design and many other biological applications. Extracting good representation from protein sequence is fundamental for this prediction task. In recent years, although several secondary structure based feature extraction strategies have been specially proposed for low-similarity protein sequences, the prediction accuracy still remains limited. To explore the potential of secondary structure information, this study proposed a novel feature extraction method from the chaos game representation of predicted secondary structure to mainly capture sequence order information and secondary structure segments distribution information in a given protein sequence. Several kinds of prediction accuracies obtained by the jackknife test are reported on three widely used low-similarity benchmark datasets (25PDB, 1189 and 640). Compared with the state-of-the-art prediction methods, the proposed method achieves the highest overall accuracies on all the three datasets. The experimental results confirm that the proposed feature extraction method is effective for accurate prediction of protein structural class. Moreover, it is anticipated that the proposed method could be extended to other graphical representations of protein sequence and be helpful in future research.
ASRpro: A machine-learning computational model for identifying proteins associated with multiple abiotic stress in plants
2024, Plant Genome

View all citing articles on Scopus

View full text

Research articleA highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Datasets

Results and discussion

Conclusions

Acknowledgements

J. Theor. Biol.

Biochimie

J. Theor. Biol.

Anal. Biochem.

Biochem. Biophys. Res. Commun.

J. Theor. Biol.

Biochem. Biophys. Res. Commun.

Anal. Biochem.

J. Theor. Biol.

J. Theor. Biol.

Biophys Chem.

J. Mol. Biol.

Genomics

J. Theor. Biol.

Biochem. Biophys. Res. Commun.

Pattern Recognit.

Biochimie

Comput. Biol. Med.

J. Mol. Biol.

J. Theor. Biol.

Biochem. Biophys. Res. Commun.

Biochimie

Gene

Anal. Chim. Acta

J. Theor. Biol.

J. Theor. Biol.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucl. Acids Res.

Support vector machines for predicting protein structural class

BMC Bioinform.

Research article
A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination