Elsevier

Biosystems

Volume 113, Issue 1, July 2013, Pages 50-57
Biosystems

Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites

https://doi.org/10.1016/j.biosystems.2013.04.005Get rights and content

Abstract

Prediction of protein subcellular location is a meaningful task which attracted much attention in recent years. A lot of protein subcellular location predictors which can only deal with the single-location proteins were developed. However, some proteins may belong to two or even more subcellular locations. It is important to develop predictors which will be able to deal with multiplex proteins, because these proteins have extremely useful implication in both basic biological research and drug discovery. Considering the circumstance that the number of methods dealing with multiplex proteins is limited, it is meaningful to explore some new methods which can predict subcellular location of proteins with both single and multiple sites. Different methods of feature extraction and different models of predict algorithms using on different benchmark datasets may receive some general results. In this paper, two different feature extraction methods and two different models of neural networks were performed on three benchmark datasets of different kinds of proteins, i.e. datasets constructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins. These benchmark datasets have different number of location sites. The application result shows that RBF neural network has apparently superiorities against BP neural network on these datasets no matter which type of feature extraction is chosen.

Introduction

The knowledge of protein subcellular locations is very important because the function of a protein and its role in a cell are closely correlated with its subcellular location (Ehrlich et al., 2002, Glory and Murphy, 2007). It is also very crucial and useful during the process of drug development. For example, bacteria play an important role in both basic research and drug design, owing to the fact that they are the workhorses for the fields of molecular biology, genetics and biochemistry (Xiao et al., 2011b).

With the fast development of large-scale genome, large number of protein sequences is continuously created. Depending on multifarious biochemical experiments to receive the information of protein subcellular localization is unpractical, because these experiments are both resource-intensive and time-consuming. Actually, a series of classifiers or predictors have been developed to identify protein subcellular localization (Cai et al., 2010, Chou and Shen, 2006a, Chou and Shen, 2006b, Emanuelsson et al., 2000, Hu et al., 2012a, Hu et al., 2012b, Jin et al., 2008, Lin et al., 2008, Lin et al., 2009, Luo, 2012, Matsuda et al., 2005, Niu et al., 2008, Pierleoni et al., 2006, Shen and Chou, 2007a, Shen and Chou, 2007b, Su et al., 2007, Tejedor-Estrada et al., 2012, Wang et al., 2012, Zhou and Doctor, 2003). All of them can only deal with single-location protein sequence. But the phenomenon that proteins simultaneously exist at or move between different subcellular localizations has been broadly discovered in various kinds of proteins, so it is interesting and meaningful to focus on the development of multi-location protein classifiers.

There are several studies on multi-label prediction dealing with protein sequences. Gpos-mPloc (Shen and Chou, 2009) is a software to predict protein subcellular localization of Gram-positive bacteria. Meanwhile, it can also deal with multiple-location proteins as well. Plant-mPloc (Chou and Shen, 2010) is a top-down strategy which serves to predict single or multiple subcellular localization of plant protein subcellular. Virus-mPloc (Shen and Chou, 2010) is a fusion classifier which was developed by combining the functional domain information, gene ontology information and sequential evolutionary information to predict single or multiple protein subcellular localization of virus. In order to improve the prediction quality, three revised editions based on these predictors were developed. They are called: “iLoc-Gpos (Wu et al., 2012)”, “iLoc-Plant (Wu et al., 2011)”, and iLoc-Virus (Xiao et al., 2011a).

These studies were all using GO (Gene Ontology) database method which is based on biological process, cellular component and molecular function (Ashburner et al., 2000, Chou and Shen, 2010). The GO approach is not an ab initio approach but a higher-level approach. In current study, to the ab initio purpose, we try to introduce new applications of algorithms which are based on the features extracted directly from the amino-acid sequence or evolution information deriving from its primitive database without any prior knowledge. It is also very interesting to study performances of algorithms this paper used by adapting the GO approach. Considering that there are still some difficulties in the method of feature extraction, it will be our research topic for future study. To provide the readership with the updated view about the GO approach, a profound and penetrative discussion has been given in Section VI of a recent comprehensive review (Chou, 2013) and Section 3 of a recent paper (Lin et al., 2013).

Considering the fact that using BP neural network is very time-consuming, we only choose three feature extraction methods which are proved very efficient and representative in recent studies. We still make our effort to improve our prediction engines in order to make them more easily to compare with more feature extraction methods.

Usually, to construct a highly credible and powerful classifier for protein prediction, the following rules need to be followed (Chou, 2011): (1) a rigorous benchmark dataset, (2) proper descriptors of protein data, (3) the model of algorithm or method, and (4) evaluation criterion. In this paper, we built two multi-label learning models to predict three multi-location protein benchmark datasets according these rules mentioned above. The basic consideration of the first model is a multi-label neutral network deriving from the common BP neural network by modifying it in two respects (Zhang, 2006). The basic consideration of the second model is another multi-label neural network deriving from the common RBF algorithm (Zhang, 2009). It has two layers. In the first layer, by using K-means clustering method on examples from every single class, prototype vectors of its basis functions are set to centroids of clustered groups (Zhang, 2009). Then, the second-layer weights are determined by a sum-of-squares function which is minimized, so the information included in the source vectors can be fully absorbed during the process of optimizing (Zhang, 2009). Application to three rigorous benchmark dataset shows that RBF neutral network is superior to another well-developed multi-label neural network no matter which type of feature extraction is chosen.

Section snippets

Gram-positive bacterial protein benchmark dataset

We choose to use the same dataset S1 in constructing Gpos-mPloc (Shen and Chou, 2009) as the benchmark dataset for this paper. This dataset includes both singleplex and multiplex proteins and was established specialized for Gram-positive bacterial proteins. It covers 4 subcellular location sites containing 519 Gram-positive bacterial protein sequences, and of the 519 different proteins, 515 belong to only 1 location, 4 to 2 locations, and none have three or more locations. The detailed

Results and discussion

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical applications: independent dataset test, sub-sampling test, and jackknife test (Chou et al., 2011, Chou et al., 2012, He et al., 2010, Hu et al., 2011, Wu et al., 2011, Wu et al., 2012, Xiao et al., 2011a, Xiao et al., 2011b). Since the subsampling test and the jackknife test can be performed with one benchmark dataset and that the independent dataset

Conclusions

In this paper, six combinations of feature extraction method and predict algorithm were conducted on three multi-label benchmark datasets. The predict algorithms are two neural networks which are revised from traditional neural networks in order to be competent for the prediction of multi-label task. Comparative studies on three benchmark datasets show that RBF neural network achieve rather competitive performance to BP neural network no matter which type of feature extraction is used, and the

Ethical standards

The experiments comply with the current laws of the country in which they were performed.

Acknowledgments

C. Huang would like to express his thank to reviewers for their suggestions. This study was supported by the Doctoral Program of Higher Education of China (Grant No. 20110073110018).

References (64)

  • S. Mei

    Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning

    J. Theor. Biol.

    (2012)
  • H. Mohabatkar et al.

    Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine

    J. Theor. Biol.

    (2011)
  • H.B. Shen et al.

    PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition

    Anal. Biochem.

    (2008)
  • X. Xiao et al.

    iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites

    J. Theor. Biol.

    (2011)
  • M.L. Zhang et al.

    Feature selection for multi-label naive Bayes classification

    Inform. Sci.

    (2009)
  • M.L. Zhang et al.

    ML-KNN: a lazy learning approach to multi-label learning

    Pattern Recogn.

    (2007)
  • S.F. Altschul et al.

    Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

    Nat. Genet.

    (2000)
  • Y.D. Cai et al.

    Predicting subcellular location of proteins using integrated-algorithm method

    Mol. Divers.

    (2010)
  • D.S. Cao et al.

    Propy: a tool to generate various modes of Chou's PseAAC

    Bioinformatics

    (2013)
  • W. Chen et al.

    iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition

    Nucleic Acids Res.

    (2013)
  • W. Chen et al.

    iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties

    PLoS ONE

    (2012)
  • K.C. Chou

    Prediction of protein cellular attributes using pseudo-amino acid composition

    Proteins

    (2001)
  • K.C. Chou

    Some remarks on predicting multi-label attributes in molecular biosystems

    Mol. Biosyst.

    (2013)
  • K.C. Chou et al.

    Large-scale predictions of Gram-negative bacterial protein subcellular locations

    J. Proteome Res.

    (2006)
  • K.C. Chou et al.

    Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms

    Nat. Protocols

    (2008)
  • K.C. Chou et al.

    REVIEW: recent advances in developing web-servers for predicting protein attributes

    Nat. Sci.

    (2009)
  • K.C. Chou et al.

    Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization

    PLoS ONE

    (2010)
  • K.C. Chou et al.

    iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins

    PLoS ONE

    (2011)
  • K.C. Chou et al.

    iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites

    Mol. Biosyst.

    (2012)
  • M. Hayat et al.

    Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC

    Protein Pept. Lett.

    (2012)
  • Z. He et al.

    Predicting drug-target interaction networks based on functional groups and biological features

    PLoS ONE

    (2010)
  • Cited by (59)

    • Characterization of proteins in different subcellular localizations for Escherichia coli K12

      2019, Genomics
      Citation Excerpt :

      Proteins with multiple locations are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. Many studies are developed to deal with multi-label system in many organisms [1, 12, 18, 25, 27–34]. Amino acid composition, pseudo amino acid composition, functional domain mode, gene ontology mode, and sequential evolution mode were used to represent the protein samples for predicting protein subcellular localization in the last few decades [35].

    View all citing articles on Scopus
    View full text