A novel algorithm combining support vector machine with the discrete wavelet transform for the prediction of protein subcellular localization
Introduction
As the number of new genomes has risen sharply in recent years, it has once again brought to the forefront problem of protein function prediction. Subcellular localization is a key functional characteristic of potential gene products such as proteins [1]. Currently, the prediction of protein subcellular location is a very hot topic in molecular biology because it has involved the three essential features of a protein: its biological objective, its biochemical activity, as well as its place in the cell where a gene product is active. Therefore, comprehensive knowledge on the subcellular localization of proteins is essential for understanding their roles and interacting partners in cellular metabolism. However, the traditional way to annotate protein subcellular localization in a cell is by biochemical experiments, which are not able to keep up with the large number of sequences that continue to emerge from the genome sequencing projects due to both time-consuming and expensive. To bridge this gap, it is necessary to develop faster, accurate and genomescale computational methods for predicting the subcellular localization of proteins.
Several theoretical and computational methods have been developed over the past decade for predicting the subcellular localization of proteins. Most of the existing prediction methods are broadly classified into four categories according to their input data [2]: (1) methods based on the sorting signals, which rely on the presence of protein targeting or signal peptides [3]; (2) methods based on lexical analysis of keywords (LOCkey) from the functional annotation of proteins [4]; (3) methods based on the uses of phylogenetic profiles [5], domain projection [6] or a combination of evolutionary and structural information; (4) methods based on the differences in the amino acid composition or amino acid properties of proteins [7], [8], [9], [10]. In this paper, our interest was focused on the researches about the last type.
Previously, much progress using sequence-based information has been made in computational prediction of protein subcellular localization. Originally, Nakashima and Nishikawa first proposed a method based on amino acid composition and residue-pair frequencies to discriminate between intracellular and extracellular proteins [7]. Subsequently, Chou and Elrod also used the amino acid composition but the covariant discriminant algorithm was employed in their method [13]. The other studies using different algorithms, such as neural network model [14], Markov chain model [12] and support vector machine [11], showed that amino acid composition was closely related to protein subcellular localizations. For further improving the predictive quality, Chou proposed the pseudo-amino acid composition to take the effect of the amino acid order into account [8]. Furthermore, Cai and Chou suggested a hybrid approach integrating the pseudo-amino acid composition, the functional domain composition [15], [16], and the information of gene ontology [17]. It indicated that incorporating an amino acid order as well as the amino acid composition made it possible to improve prediction performance. Recently, a sequence representation method using multi-scale energy was established to predict the subcellular location based on the concept of Chou's pseudo-amino acid composition [18]. However, developing an more effective method to predict the subcellular location attributes based on their sequence information can not only save time, but can also be helpful to the design of drugs in treating certain diseases that are related to subcellular location attribute defects. Hence, it has become a crucial issue to complement the existing methods and enhance the quality of predicting protein subcellular localization by selecting more informative features. In this paper, a novel model (DWT–SVM) was proposed by combining the discrete wavelet transform (DWT) with support vector machine (SVM) based on the amino acid polarity to predict the subcellular localization of proteins. First and foremost, amino acids of protein subcellular localization were transformed into sequences of polarity energies per residue. Then, the polarity profile was decomposed into wavelet coefficients using DWT. Subsequently, using the statistical method, a series of statistical feature vectors were constructed to represent the protein sequences. Finally, SVM was applied to deal with the problem of multi-classification. The predictive results of the jackknife test show significant improvement compared with the previous algorithms, and hence the methodology presented in the current study could effectively complement the existing subcellular localization prediction methods and assist in the development of automated genome annotation tools.
Section snippets
Data sets
Fig. 1 shows the flowchart of the proposed approach combining the DWT with SVM algorithm to predict protein subcellular localization. As presented in Fig. 1, in this study, two datasets of proteins as a benchmark have been used. The first dataset, NNPSL dataset as the training dataset, was originally constructed by Reinhardt and Hubbard [14]. It included 997 prokaryotic sequences, which were classified into three location categories (688 cytoplasm, 107 extracellular and 202 periplasm) and 2427
Selection of wavelet function
Wavelet transform is based on the idea of mapping a signal into a set of basis function. Based on different basis functions, the wavelets have different families, and every family has its quality fitting for different signals and emerges different results [36], [52]. As the characteristics of the analyzing wavelet influence the performance of DWT, the better the analyzing wavelet matches the underlying structure in the signal, the better feature values can be extracted from the sequences.
Conclusion
In this paper, a new method that integrated DWT into SVM for protein subcellular localization prediction is presented. DWT, the novel feature extraction method based on the amino acid polarity, can reduce the dimension of the input vector, improve calculating efficiency and more effectively reflect the overall sequence order feature of a protein. Furthermore, SVM method can easily deal with high dimensional data and incorporate other useful features. The overall accuracies for prokaryotic and
Conflict of interest statement
The author has no conflict of interests concerning this work.
Acknowledgment
This work was supported by Grants from the National Natural Science Foundation of China (20605010, 21065006 and 21175064).
References (58)
- et al.
Wanted: subcellular localization of proteins based on sequence
Trans. Cell Biol.
(1998) - et al.
Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies
J. Mol. Biol.
(1994) - et al.
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization
Biochem. Biophys. Res. Commun.
(2006) Prediction of protein subcellular locations using Markov chain models
FEBS Lett.
(1999)- et al.
Using discriminant function for prediction of subcellular location of prokaryotic proteins
Biochem. Biophys. Res. Commun.
(1998) - et al.
A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology
Biochem. Biophys. Res. Commun.
(2003) - et al.
Wavelet transformation of protein hydrophobicity sequences suggests their memberships in structural families
Physica A
(1997) - et al.
Prediction of membrane protein types by means of wavelet analysis and cascaded neural networks
J. Theor. Biol.
(2008) - et al.
Prediction of protein secondary structure based on continuous wavelet transform
Talanta
(2003) Low-frequency collective motion in biomacromolecules and its biological functions
Biophys. Chem.
(1988)
Low-frequency motions in protein molecules: beta-sheet and beta-barrel
Biophys. J.
Polarity and hydrophobicity interactions in protein synthesis process
J. Theor. Biol.
Neural classification of lung sounds using wavelet coefficients
Comput. Biol. Med.
Support vector machines for predicting the specificity of GalNAc-transferase
Peptides
Comparison of predicted and observed secondary structure of T4 phage lysozyme
Biochem. Biophys. Acta
An introduction to ROC analysis
Pattern Recogn. Lett.
Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform
Anal. Biochem.
Construction of a class of Daubechies type wavelet bases
Chaos Soliton Fractals
Recent progress in protein subcellular location prediction
Anal. Biochem.
pTARGET: a new method for predicting protein subcellular localization in eukaryotes
Bioinformatics
PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization
Trends Biochem. Sci.
Inferring sub-cellular localization through automated lexical analysis
Bioinformatics
Localizing proteins in the cell from their phylogenetic profiles
USA Proc. Natl. Acad. Sci.
Predicting protein cellular localization using a domain projection method
Genome Res.
Prediction of protein cellular attributes using pseudo-amino acid composition
Proteins Struct. Funct. Genet.
Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins
Mol. Divers.
Support vector machine approach for protein subcellular localization prediction
Bioinformatics
Using neural networks for prediction of the subcellular location of proteins
Nucleic Acids Res
Predicting subcellular localization of proteins in a hybridization space
Bioinformatics
Cited by (20)
Exploring spatio-temporal correlation and complexity of safety monitoring data by complex networks
2022, Automation in ConstructionCitation Excerpt :As a result of it, a method of dimensionality reduction is required, which should also be able to extract the temporal features inside. Correspondingly, the wavelet transform (WT) is able to elucidate simultaneously both spectral and temporal information, in contrast to the Fourier transform that only elucidates spectral information [47,48]. It is an appropriate dimensionality reduction tool which can remove the disturbance and reveal distributional characteristics from raw settlement data Wavelet transforms can be divided into two groups, of which discrete wavelet transform is adopted in this paper given its excellent performance in decomposing and reconstructing original time series.
Prediction of protein structural class for low-similarity sequences using Chou's pseudo amino acid composition and wavelet denoising
2017, Journal of Molecular Graphics and ModellingCitation Excerpt :Different wavelet basis functions and wavelet decomposition scales are used to classify the protein structural. Different wavelet basis functions produce different wavelet families, each family with different signal processing capabilities, and the results are different [60,66]. Wavelet functions have many excellent properties, such as compactly support, orthogonality, symmetry, stationary and high order vanishing moments.
Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition
2017, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :Different wavelet functions are different for different signal processing. If the characteristics of the wavelet function can better match message structure of the signal information, the better feature information can be extracted from the sequences [53]. When analyzing the protein sequences, different decomposition scales have different prediction results.
A hybrid WA-CPSO-LSSVR model for dissolved oxygen content prediction in crab culture
2014, Engineering Applications of Artificial IntelligenceCitation Excerpt :Wavelet analysis is called a ‘microscope’ in mathematics and can distinguish between noise and useful signals. It is able to capture the non-stationary characteristics of dynamic systems and has been successfully applied to knowledge discovery and pattern recognition (Wu and Law, 2011; Eynard et al., 2011; Wang and Shi, 2013; Najah et al., 2012; Liang et al., 2012; Kisi and Cimen, 2011; Kao et al., 2013; Kalteh, 2013). Najah et al. (2012) proposed an augmented wavelet de-noising technique with a Neuro-Fuzzy Inference System (WDT-ANFIS) based on a data fusion module for water quality prediction.
Prediction of Golgi-resident protein types by using feature selection technique
2013, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :The parameter m or Fcutoff was chosen by using cross-validation. Support vector machine (SVM) is a powerful machine learning method and has been successfully applied in protein structure and function prediction [28–31]. The SVM can find a decision boundary that separates two training data.