Enzyme classification using multiclass support vector machine and feature subset selection
Graphical abstract
Introduction
Proteins are important macromolecules responsible for almost all biological processes in a cell such as growth, function, cell metabolism and maintenance. With the availability of large no of biological sequences obtained from different sequencing projects (Koonin et al., 1998a, Fetrow and Skolnick, 1998), the challenge with the scientist is to know the functions of the newly generated protein sequences in order to understand the biological processes (Siomi and Dreyfuss, 1997, Draper, 1999, Koonin et al., 1998b).
There are many methods available for functional annotation of newly sequenced proteins. The wet lab method of functional characterization of proteins is time consuming and expensive, where as computational approaches are fast and cost effective. The classical computational approaches for function prediction use programs like FASTA (Pearson and Lipman, 1988) and PSI-BLAST (Altschul et al., 1990) which are based on homology between the annotated sequences with unannotated sequence i.e the new sequence. The methods of Comparative Genomics are also used for the prediction of protein function (Pellegrini et al., 1999). They consider the protein to be functionally linked if they have similar phylogenetic profiles (Marcotte et al., 2000, Zheng et al., 2002). Some authors such as David J. Lockhart et al., Mark Schena (Lockhart et al., 1996, Schena et al., 1995), designed clustering algorithms to be used on DNA-microarray data to predict the protein function based on the assumption that genes with correlated expression profile are functionally related (Eisen et al., 1998, Zhou et al., 2002). The protein-protein-interaction networks are also used for prediction of protein function using Nearest Neighborhood approach (Lin et al., 2006) based on the fact that proteins may interact for a common purpose. But as protein-protein-interaction data is noisy, the prediction accuracy becomes low. Some methods (Tatusov et al., 2001, Jones et al., 2014) predict the function of a protein by classifying it into a specific functional class based on the sequence similarity. These methods work well if the similarity between sequences is significant. However the prediction becomes random if the similarity between two sequences is not up to a threshold.
Support vector machine method (Vapnik, 2013) is used for protein fold recognition (Ding and Dubchak, 2001, Cai et al., 2002a), protein structure prediction (Yuan et al., 2002, Hua and Sun, 2001, Cai et al., 2002b), protein–protein interaction prediction, and protein function classification (Cai et al., 2003). In these problems the physico-chemical properties of proteins computed from sequences, are used as input for implementing the method. Cai et al. (Cai et al., 2003) used Binary SVM classifier to predict the functional class of a protein. They considered the functional classes like RNA-binding proteins, protein homodimers, drug absorption proteins, drug delivery proteins, drug excretion proteins, Class-I drug metabolizing enzymes, Class-II drug metabolizing enzymes and used 1808 physico-chemical properties such as hydrophobicity, polarity, polarizability, charge, surface tension, secondary structure etc. to represent a protein sequence and obtained accuracy in the range 88%–99% for different classes. Moreover as the dimension of the feature vector used is very high the computation takes more time.
In our model at the first step a binary classifier is designed to classify a protein sequence as enzyme or non-enzyme. In the second step a multi-class classifier is designed to predict the functional class of the protein out of six available enzyme classes such as oxidoreductases, transferases, hydrolases, lyases, isomerase, and ligases. To implement the model, initially 32 physico-chemical properties like number of amino acids, theoretical pie, amino acid compositions(20), number of negatively charged residue, number of positively charged residue, atomic compositions(5), aliphatic index, and hydrophobicity are considered. Since many of the features may carry redundant information, Sequential Forward Floating Selection algorithm (SFFS) (Pudil et al., 1994), Orthogonal Forward Selection (OFS) (Mao, 2004) algorithm, and SVM Recursive Feature Elemination(SVM-RFE) (Guyon et al., 2002, Rakotomamonjy, 2003) are applied to identify the most significant features for classifying the proteins. SFFS gives amino acid compositions such as Arg(A), Asn(N), Cys(C), Gln(Q), Glu(E), Ile(I), Leu(L), Lys(K), Met(M), Phe(F), Pro(P), Ser(S), Thr(T), Trp(W), Tyr(Y), Val(V), atomic compositions, such as Hydrogen(H), Nitrogen(N), Oxygen(O), Sulfur(S) are more significant features where as OFS gives aliphatic index, number of amino acids, atomic compositions such as Carbon(C), Oxygen(O), amino acid compositions such as Cys(C), Asp(D), Arg(R), Phe(F), Gly(G), Pro(P), His(H), Ile(I), Thr(T), Trp(W), Leu(L), Gln(Q), Lys(K), Try(Y), no of positively charge residues, and no of negatively charged residues are more significant features. However, when SVM-RFE is applied it dropped seven features such as number of amino acids, Theoritical pie, Cys(C), Gly(G), Ile(I), Carbon(C), Sulfur(S) to yield 25 significant features and with these features an accuracy range of 90.6149%–93.5275% is obtained. Results of these three algorithms show that Gln(Q), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O) play major role for functional classification of proteins. Using all 32 features, i.e Without Feature Selection(WFS) an accuracy range from 90.9699% to 93.6455% is obtained where as using Sequential Forward Feature Floating Selection (SFFS) algorithm with 20 significant features an accuracy from 90.3010% to 92.3077% is obtained and using Orthogonal Forward Feature Selection algorithm (OFS) with 20 significant features an accuracy from 89.6321% to 94.3144% is obtained. Our model found that 20 (Atomic and Amino acid compositions) out of 32 physico-chemical properties are sufficient to predict the functional class of a protein with a high accuracy. The performance of our model is compared with the Random Forest classification algorithm (Liaw and Wiener, 2002). The average accuracy obtained by Random Forest Model is 86.7314%. It is observed that all the three models discussed above have better average accuracy than Random Forest Model.
The rest of the paper is organized as follows. Section 2 presents Multiclass Support Vector Machine, Sequential Forward Feature Selection algorithm, and Orthogonal Forward Feature Section algorithm. Section 3 describes the proposed model. Section 4 discusses the result and performance of our model and Section 5 concludes the work.
Section snippets
Multiclass support vector machine
The Support vector machine described in appendixA is a binary classifier i.e it classifies objects belonging to two distinct classes. However the real world problems deal with classifying objects into more than two classes. There are many approaches followed to use SVM for multiclass classification. Following are the frequently used approaches.
Proposed method
We are given a data set of protein sequences and their class labels. Each sequence is then represented by 32 physico-chemical properties. Now the dataset contains the protein sequences represented by 32 features each along with their class labels i.e. where is the feature vector of sequence and for classification of a protein as enzyme and non-enzyme or for classifying enzyme class. The objective of our work is to develop a model
Results and discussions
To train the binary support vector machine classifier , 200 distinct enzyme proteins are taken as positive samples and 200 hemoglobin proteins are taken as negative samples. Then the classifier is tested with 62 proteins which is a mixture of enzymes and hemoglobin proteins. The accuracy of the model with given test set is 98.3871%. Table 5 summarizes the performance measures of the classifier F1 with and without feature selection.
After training and testing of the binary classifier, six
Conclusion and future work
In this paper a model comprising of feature subset selection followed by multiclass support vector machine to determine the functional class of a newly generated protein sequence is proposed. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 classes are considered. To identify features those contribute significantly for functional classification of proteins, SFFS,OFS and SVM-RFE algorithms are applied. The results of all the algorithms show that
Acknowledgement
Authors thank the reviewers for their valuable comments.
References (31)
- et al.
Basic local alignment search tool
J. Mol. Biol.
(1990) - et al.
Prediction of protein structural classes by support vector machines
Comput. Chem.
(2002) - et al.
Protein function classification via support vector machine approach
Math. Biosci.
(2003) Themes in RNA-protein recognition
J. Mol. Biol.
(1999)- et al.
Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T 1 ribonucleases
J. Mol. Biol.
(1998) - et al.
A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach
J. Mol. Biol.
(2001) - et al.
Beyond complete genomes: from sequence to structure and function
Curr. Opin. Struct. Biol.
(1998) - et al.
Beyond complete genomes: from sequence to structure and function
Curr. Opin. Struct. Biol.
(1998) - et al.
Floating search methods in feature selection
Pattern Recognit. Lett.
(1994) - et al.
RNA-binding proteins as regulators of gene expression
Curr. Opin. Genet. Dev.
(1997)
A systematic analysis of performance measures for classification tasks
Inf. Process. Manage.
Support vector machines for the classification and prediction of β-turn types
J. Pept. Sci.
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics
Cluster analysis and display of genome-wide expression patterns
Proc. Natl. Acad. Sci.
Gene selection for cancer classification using support vector machines
Mach. Learn.
Cited by (9)
FPGA-based implementation of classification techniques: A survey
2021, IntegrationCitation Excerpt :This method showed a classification accuracy of 96.15%. Pradhan et al. [41] suggested a method for enzyme classification using multiclass SVM classifier. With 32 features, the average obtained accuracy ranged from 90.96% to 93.64%.
Feature selection with kernelized multi-class support vector machine
2021, Pattern RecognitionCitation Excerpt :In many real-world machine learning applications, the data involved are considerably high-dimensional, such as DNA micro-array analysis [1], EEG (electroencephalography) signal classification [2], medical image analysis [3], or protein classification [4].
A multiclass SVM classifier with teaching learning based feature subset selection for enzyme subclass classification
2020, Applied Soft Computing JournalCitation Excerpt :Sarda et al. [16] developed a multiclass SVM model to identify the protein subcellular localization which can be used to predict the functional classes of proteins. Pradhan et al. [17] proposed a protein classification model comprised of a feature selection algorithm i.e Orthogonal Forward Feature Selection (OFS) [18] and multiclass SVM to classify the proteins in two steps. In the first step, it classifies a new protein as an enzyme or non-enzyme.
Evolutionary Teaching-Learning Based Modified Polynomial Classifier
2021, Proceedings - 2021 19th OITS International Conference on Information Technology, OCIT 2021A Proposal of Clinical Decision Support System Using Ensemble Learning for Coronary Artery Disease Diagnosis
2021, Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST