Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites
Introduction
The knowledge of protein subcellular locations is very important because the function of a protein and its role in a cell are closely correlated with its subcellular location (Ehrlich et al., 2002, Glory and Murphy, 2007). It is also very crucial and useful during the process of drug development. For example, bacteria play an important role in both basic research and drug design, owing to the fact that they are the workhorses for the fields of molecular biology, genetics and biochemistry (Xiao et al., 2011b).
With the fast development of large-scale genome, large number of protein sequences is continuously created. Depending on multifarious biochemical experiments to receive the information of protein subcellular localization is unpractical, because these experiments are both resource-intensive and time-consuming. Actually, a series of classifiers or predictors have been developed to identify protein subcellular localization (Cai et al., 2010, Chou and Shen, 2006a, Chou and Shen, 2006b, Emanuelsson et al., 2000, Hu et al., 2012a, Hu et al., 2012b, Jin et al., 2008, Lin et al., 2008, Lin et al., 2009, Luo, 2012, Matsuda et al., 2005, Niu et al., 2008, Pierleoni et al., 2006, Shen and Chou, 2007a, Shen and Chou, 2007b, Su et al., 2007, Tejedor-Estrada et al., 2012, Wang et al., 2012, Zhou and Doctor, 2003). All of them can only deal with single-location protein sequence. But the phenomenon that proteins simultaneously exist at or move between different subcellular localizations has been broadly discovered in various kinds of proteins, so it is interesting and meaningful to focus on the development of multi-location protein classifiers.
There are several studies on multi-label prediction dealing with protein sequences. Gpos-mPloc (Shen and Chou, 2009) is a software to predict protein subcellular localization of Gram-positive bacteria. Meanwhile, it can also deal with multiple-location proteins as well. Plant-mPloc (Chou and Shen, 2010) is a top-down strategy which serves to predict single or multiple subcellular localization of plant protein subcellular. Virus-mPloc (Shen and Chou, 2010) is a fusion classifier which was developed by combining the functional domain information, gene ontology information and sequential evolutionary information to predict single or multiple protein subcellular localization of virus. In order to improve the prediction quality, three revised editions based on these predictors were developed. They are called: “iLoc-Gpos (Wu et al., 2012)”, “iLoc-Plant (Wu et al., 2011)”, and iLoc-Virus (Xiao et al., 2011a).
These studies were all using GO (Gene Ontology) database method which is based on biological process, cellular component and molecular function (Ashburner et al., 2000, Chou and Shen, 2010). The GO approach is not an ab initio approach but a higher-level approach. In current study, to the ab initio purpose, we try to introduce new applications of algorithms which are based on the features extracted directly from the amino-acid sequence or evolution information deriving from its primitive database without any prior knowledge. It is also very interesting to study performances of algorithms this paper used by adapting the GO approach. Considering that there are still some difficulties in the method of feature extraction, it will be our research topic for future study. To provide the readership with the updated view about the GO approach, a profound and penetrative discussion has been given in Section VI of a recent comprehensive review (Chou, 2013) and Section 3 of a recent paper (Lin et al., 2013).
Considering the fact that using BP neural network is very time-consuming, we only choose three feature extraction methods which are proved very efficient and representative in recent studies. We still make our effort to improve our prediction engines in order to make them more easily to compare with more feature extraction methods.
Usually, to construct a highly credible and powerful classifier for protein prediction, the following rules need to be followed (Chou, 2011): (1) a rigorous benchmark dataset, (2) proper descriptors of protein data, (3) the model of algorithm or method, and (4) evaluation criterion. In this paper, we built two multi-label learning models to predict three multi-location protein benchmark datasets according these rules mentioned above. The basic consideration of the first model is a multi-label neutral network deriving from the common BP neural network by modifying it in two respects (Zhang, 2006). The basic consideration of the second model is another multi-label neural network deriving from the common RBF algorithm (Zhang, 2009). It has two layers. In the first layer, by using K-means clustering method on examples from every single class, prototype vectors of its basis functions are set to centroids of clustered groups (Zhang, 2009). Then, the second-layer weights are determined by a sum-of-squares function which is minimized, so the information included in the source vectors can be fully absorbed during the process of optimizing (Zhang, 2009). Application to three rigorous benchmark dataset shows that RBF neutral network is superior to another well-developed multi-label neural network no matter which type of feature extraction is chosen.
Section snippets
Gram-positive bacterial protein benchmark dataset
We choose to use the same dataset S1 in constructing Gpos-mPloc (Shen and Chou, 2009) as the benchmark dataset for this paper. This dataset includes both singleplex and multiplex proteins and was established specialized for Gram-positive bacterial proteins. It covers 4 subcellular location sites containing 519 Gram-positive bacterial protein sequences, and of the 519 different proteins, 515 belong to only 1 location, 4 to 2 locations, and none have three or more locations. The detailed
Results and discussion
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical applications: independent dataset test, sub-sampling test, and jackknife test (Chou et al., 2011, Chou et al., 2012, He et al., 2010, Hu et al., 2011, Wu et al., 2011, Wu et al., 2012, Xiao et al., 2011a, Xiao et al., 2011b). Since the subsampling test and the jackknife test can be performed with one benchmark dataset and that the independent dataset
Conclusions
In this paper, six combinations of feature extraction method and predict algorithm were conducted on three multi-label benchmark datasets. The predict algorithms are two neural networks which are revised from traditional neural networks in order to be competent for the prediction of multi-label task. Comparative studies on three benchmark datasets show that RBF neural network achieve rather competitive performance to BP neural network no matter which type of feature extraction is used, and the
Ethical standards
The experiments comply with the current laws of the country in which they were performed.
Acknowledgments
C. Huang would like to express his thank to reviewers for their suggestions. This study was supported by the Doctoral Program of Higher Education of China (Grant No. 20110073110018).
References (64)
Some remarks on protein attribute prediction and pseudo amino acid composition
J. Theor. Biol.
(2011)- et al.
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization
Biochem. Biophys. Res. Commun.
(2006) - et al.
MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM
Biochem. Biophys. Res. Commun.
(2007) - et al.
Recent progress in protein subcellular location prediction
Anal. Biochem.
(2007) - et al.
PseAAC-builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions
Anal. Biochem.
(2012) - et al.
Spatio-temporal regulation of Rac1 localization and lamellipodia dynamics during epithelial cell–cell adhesion
Dev. Cell
(2002) - et al.
Predicting subcellular localization of proteins based on their N-terminal amino acid sequence
J. Mol. Biol.
(2000) - et al.
Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses
J. Theor. Biol.
(2010) - et al.
Automated subcellular location determination and high-throughput microscopy
Dev. Cell
(2007) - et al.
Predicting Gram-positive bacterial protein subcellular localization based on localization motifs
J. Theor. Biol.
(2012)
Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning
J. Theor. Biol.
Prediction of GABAA receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine
J. Theor. Biol.
PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition
Anal. Biochem.
iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites
J. Theor. Biol.
Feature selection for multi-label naive Bayes classification
Inform. Sci.
ML-KNN: a lazy learning approach to multi-label learning
Pattern Recogn.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res.
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat. Genet.
Predicting subcellular location of proteins using integrated-algorithm method
Mol. Divers.
Propy: a tool to generate various modes of Chou's PseAAC
Bioinformatics
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition
Nucleic Acids Res.
iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties
PLoS ONE
Prediction of protein cellular attributes using pseudo-amino acid composition
Proteins
Some remarks on predicting multi-label attributes in molecular biosystems
Mol. Biosyst.
Large-scale predictions of Gram-negative bacterial protein subcellular locations
J. Proteome Res.
Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms
Nat. Protocols
REVIEW: recent advances in developing web-servers for predicting protein attributes
Nat. Sci.
Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization
PLoS ONE
iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins
PLoS ONE
iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites
Mol. Biosyst.
Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC
Protein Pept. Lett.
Predicting drug-target interaction networks based on functional groups and biological features
PLoS ONE
Cited by (59)
iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection
2021, Analytical BiochemistryDMLDA-LocLIFT: Identification of multi-label protein subcellular localization using DMLDA dimensionality reduction and LIFT classifier
2020, Chemometrics and Intelligent Laboratory SystemsML-RBF: Predict protein subcellular locations in a multi-label system using evolutionary features
2020, Chemometrics and Intelligent Laboratory SystemsCharacterization of proteins in different subcellular localizations for Escherichia coli K12
2019, GenomicsCitation Excerpt :Proteins with multiple locations are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. Many studies are developed to deal with multi-label system in many organisms [1, 12, 18, 25, 27–34]. Amino acid composition, pseudo amino acid composition, functional domain mode, gene ontology mode, and sequential evolution mode were used to represent the protein samples for predicting protein subcellular localization in the last few decades [35].