A novel algorithm combining support vector machine with the discrete wavelet transform for the prediction of protein subcellular localization

https://doi.org/10.1016/j.compbiomed.2011.11.006Get rights and content

Abstract

Knowing the subcellular localization of proteins within the cell is an important step in elucidating its role in biological processes, its function and its potential as a drug target for disease diagnosis. As the number of complete genomes rapidly increases, accurate and efficient methods that automatically predict the subcellular localizations become more urgent. In the current paper, we developed a novel method that coupled the discrete wavelet transform with support vector machine based on the amino acid polarity to predict the subcellular localizations of prokaryotic and eukaryotic proteins. The results obtained by the jackknife test were quite promising, and indicated that the proposed method remarkably improved the prediction accuracy of subcellular locations, and could be as an effective and promising high-throughput method in the subcellular localization research.

Introduction

As the number of new genomes has risen sharply in recent years, it has once again brought to the forefront problem of protein function prediction. Subcellular localization is a key functional characteristic of potential gene products such as proteins [1]. Currently, the prediction of protein subcellular location is a very hot topic in molecular biology because it has involved the three essential features of a protein: its biological objective, its biochemical activity, as well as its place in the cell where a gene product is active. Therefore, comprehensive knowledge on the subcellular localization of proteins is essential for understanding their roles and interacting partners in cellular metabolism. However, the traditional way to annotate protein subcellular localization in a cell is by biochemical experiments, which are not able to keep up with the large number of sequences that continue to emerge from the genome sequencing projects due to both time-consuming and expensive. To bridge this gap, it is necessary to develop faster, accurate and genomescale computational methods for predicting the subcellular localization of proteins.

Several theoretical and computational methods have been developed over the past decade for predicting the subcellular localization of proteins. Most of the existing prediction methods are broadly classified into four categories according to their input data [2]: (1) methods based on the sorting signals, which rely on the presence of protein targeting or signal peptides [3]; (2) methods based on lexical analysis of keywords (LOCkey) from the functional annotation of proteins [4]; (3) methods based on the uses of phylogenetic profiles [5], domain projection [6] or a combination of evolutionary and structural information; (4) methods based on the differences in the amino acid composition or amino acid properties of proteins [7], [8], [9], [10]. In this paper, our interest was focused on the researches about the last type.

Previously, much progress using sequence-based information has been made in computational prediction of protein subcellular localization. Originally, Nakashima and Nishikawa first proposed a method based on amino acid composition and residue-pair frequencies to discriminate between intracellular and extracellular proteins [7]. Subsequently, Chou and Elrod also used the amino acid composition but the covariant discriminant algorithm was employed in their method [13]. The other studies using different algorithms, such as neural network model [14], Markov chain model [12] and support vector machine [11], showed that amino acid composition was closely related to protein subcellular localizations. For further improving the predictive quality, Chou proposed the pseudo-amino acid composition to take the effect of the amino acid order into account [8]. Furthermore, Cai and Chou suggested a hybrid approach integrating the pseudo-amino acid composition, the functional domain composition [15], [16], and the information of gene ontology [17]. It indicated that incorporating an amino acid order as well as the amino acid composition made it possible to improve prediction performance. Recently, a sequence representation method using multi-scale energy was established to predict the subcellular location based on the concept of Chou's pseudo-amino acid composition [18]. However, developing an more effective method to predict the subcellular location attributes based on their sequence information can not only save time, but can also be helpful to the design of drugs in treating certain diseases that are related to subcellular location attribute defects. Hence, it has become a crucial issue to complement the existing methods and enhance the quality of predicting protein subcellular localization by selecting more informative features. In this paper, a novel model (DWT–SVM) was proposed by combining the discrete wavelet transform (DWT) with support vector machine (SVM) based on the amino acid polarity to predict the subcellular localization of proteins. First and foremost, amino acids of protein subcellular localization were transformed into sequences of polarity energies per residue. Then, the polarity profile was decomposed into wavelet coefficients using DWT. Subsequently, using the statistical method, a series of statistical feature vectors were constructed to represent the protein sequences. Finally, SVM was applied to deal with the problem of multi-classification. The predictive results of the jackknife test show significant improvement compared with the previous algorithms, and hence the methodology presented in the current study could effectively complement the existing subcellular localization prediction methods and assist in the development of automated genome annotation tools.

Section snippets

Data sets

Fig. 1 shows the flowchart of the proposed approach combining the DWT with SVM algorithm to predict protein subcellular localization. As presented in Fig. 1, in this study, two datasets of proteins as a benchmark have been used. The first dataset, NNPSL dataset as the training dataset, was originally constructed by Reinhardt and Hubbard [14]. It included 997 prokaryotic sequences, which were classified into three location categories (688 cytoplasm, 107 extracellular and 202 periplasm) and 2427

Selection of wavelet function

Wavelet transform is based on the idea of mapping a signal into a set of basis function. Based on different basis functions, the wavelets have different families, and every family has its quality fitting for different signals and emerges different results [36], [52]. As the characteristics of the analyzing wavelet influence the performance of DWT, the better the analyzing wavelet matches the underlying structure in the signal, the better feature values can be extracted from the sequences.

Conclusion

In this paper, a new method that integrated DWT into SVM for protein subcellular localization prediction is presented. DWT, the novel feature extraction method based on the amino acid polarity, can reduce the dimension of the input vector, improve calculating efficiency and more effectively reflect the overall sequence order feature of a protein. Furthermore, SVM method can easily deal with high dimensional data and incorporate other useful features. The overall accuracies for prokaryotic and

Conflict of interest statement

The author has no conflict of interests concerning this work.

Acknowledgment

This work was supported by Grants from the National Natural Science Foundation of China (20605010, 21065006 and 21175064).

References (58)

  • K.C. Chou

    Low-frequency motions in protein molecules: beta-sheet and beta-barrel

    Biophys. J.

    (1985)
  • X.Q. Li et al.

    Polarity and hydrophobicity interactions in protein synthesis process

    J. Theor. Biol.

    (2006)
  • A. Kandaswamy et al.

    Neural classification of lung sounds using wavelet coefficients

    Comput. Biol. Med.

    (2004)
  • Y.D. Cai et al.

    Support vector machines for predicting the specificity of GalNAc-transferase

    Peptides

    (2002)
  • B.W. Matthews

    Comparison of predicted and observed secondary structure of T4 phage lysozyme

    Biochem. Biophys. Acta

    (1975)
  • T. Fawcett

    An introduction to ROC analysis

    Pattern Recogn. Lett.

    (2006)
  • J.D. Qiu et al.

    Prediction of G-protein-coupled receptor classes based on the concept of Chou's pseudo amino acid composition: an approach from discrete wavelet transform

    Anal. Biochem.

    (2009)
  • D.F. Li et al.

    Construction of a class of Daubechies type wavelet bases

    Chaos Soliton Fractals

    (2009)
  • K.C. Chou et al.

    Recent progress in protein subcellular location prediction

    Anal. Biochem.

    (2007)
  • C. Guda et al.

    pTARGET: a new method for predicting protein subcellular localization in eukaryotes

    Bioinformatics

    (2005)
  • K. Nakai et al.

    PSORT: a program for detecting the sorting signals of proteins and predicting their subcellular localization

    Trends Biochem. Sci.

    (1999)
  • R. Nair et al.

    Inferring sub-cellular localization through automated lexical analysis

    Bioinformatics

    (2002)
  • E.M. Marcotte et al.

    Localizing proteins in the cell from their phylogenetic profiles

    USA Proc. Natl. Acad. Sci.

    (2000)
  • R. Mott et al.

    Predicting protein cellular localization using a domain projection method

    Genome Res.

    (2002)
  • K.C. Chou

    Prediction of protein cellular attributes using pseudo-amino acid composition

    Proteins Struct. Funct. Genet.

    (2001)
  • N. Niu et al.

    Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins

    Mol. Divers.

    (2008)
  • S.J. Hua et al.

    Support vector machine approach for protein subcellular localization prediction

    Bioinformatics

    (2001)
  • A. Reinhardt et al.

    Using neural networks for prediction of the subcellular location of proteins

    Nucleic Acids Res

    (1998)
  • Y.D. Cai et al.

    Predicting subcellular localization of proteins in a hybridization space

    Bioinformatics

    (2004)
  • Cited by (20)

    • Exploring spatio-temporal correlation and complexity of safety monitoring data by complex networks

      2022, Automation in Construction
      Citation Excerpt :

      As a result of it, a method of dimensionality reduction is required, which should also be able to extract the temporal features inside. Correspondingly, the wavelet transform (WT) is able to elucidate simultaneously both spectral and temporal information, in contrast to the Fourier transform that only elucidates spectral information [47,48]. It is an appropriate dimensionality reduction tool which can remove the disturbance and reveal distributional characteristics from raw settlement data Wavelet transforms can be divided into two groups, of which discrete wavelet transform is adopted in this paper given its excellent performance in decomposing and reconstructing original time series.

    • Prediction of protein structural class for low-similarity sequences using Chou's pseudo amino acid composition and wavelet denoising

      2017, Journal of Molecular Graphics and Modelling
      Citation Excerpt :

      Different wavelet basis functions and wavelet decomposition scales are used to classify the protein structural. Different wavelet basis functions produce different wavelet families, each family with different signal processing capabilities, and the results are different [60,66]. Wavelet functions have many excellent properties, such as compactly support, orthogonality, symmetry, stationary and high order vanishing moments.

    • Prediction subcellular localization of Gram-negative bacterial proteins by support vector machine using wavelet denoising and Chou's pseudo amino acid composition

      2017, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      Different wavelet functions are different for different signal processing. If the characteristics of the wavelet function can better match message structure of the signal information, the better feature information can be extracted from the sequences [53]. When analyzing the protein sequences, different decomposition scales have different prediction results.

    • A hybrid WA-CPSO-LSSVR model for dissolved oxygen content prediction in crab culture

      2014, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Wavelet analysis is called a ‘microscope’ in mathematics and can distinguish between noise and useful signals. It is able to capture the non-stationary characteristics of dynamic systems and has been successfully applied to knowledge discovery and pattern recognition (Wu and Law, 2011; Eynard et al., 2011; Wang and Shi, 2013; Najah et al., 2012; Liang et al., 2012; Kisi and Cimen, 2011; Kao et al., 2013; Kalteh, 2013). Najah et al. (2012) proposed an augmented wavelet de-noising technique with a Neuro-Fuzzy Inference System (WDT-ANFIS) based on a data fusion module for water quality prediction.

    • Prediction of Golgi-resident protein types by using feature selection technique

      2013, Chemometrics and Intelligent Laboratory Systems
      Citation Excerpt :

      The parameter m or Fcutoff was chosen by using cross-validation. Support vector machine (SVM) is a powerful machine learning method and has been successfully applied in protein structure and function prediction [28–31]. The SVM can find a decision boundary that separates two training data.

    View all citing articles on Scopus
    View full text