Predicting protein subnuclear localization using GO-amino-acid composition features

doi:10.1016/j.biosystems.2009.06.007

Biosystems

Volume 98, Issue 2, November 2009, Pages 73-79

https://doi.org/10.1016/j.biosystems.2009.06.007 Get rights and content

Abstract

The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

Introduction

The cell nucleus is a highly complex organelle that organizes the comprehensive assembly of genes and their corresponding regulatory factors. The nucleus guides life processes of cells by directing their reproduction, controlling their differentiation and regulating their metabolic activities. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments (Heidi et al., 2001). The knowledge of protein subnuclear localization can provide valuable clues about its molecular function, as well as the biological pathway in which it participates (Cocco et al., 2004).

The bulk of computation methods exist in literature for predicting protein subcellular localization and has achieved high accuracy (Bhasin and Raghava, 2004, Cai and Chou, 2004, Chou and Shen, 2006a, Chou and Shen, 2006b, Huang et al., 2008, Nair and Rost, 2005, Nanni and Lumini, 2006, Pierleoni et al., 2006, Sarda et al., 2005), particularly systematically introduced in a recent review (Chou and Shen, 2007b), a step-by-step protocol paper (Chou and Shen, 2008) and a book chapter (Chou, 2009). However, the prediction of protein localization at subnuclear level is far more challenging (Lei and Dai, 2006). We have developed the first ProLoc system using SVM with automatic selection from physicochemical properties for this task using with considerable prediction accuracy (Huang et al., 2007b). In this work, we attempted to improve the performance of the system through the incorporation of information obtained from Gene Ontology (GO).

Gene Ontology, which is a controlled vocabulary of terms split into three related ontology consisting of molecular function, biological processes and cellular components (Ashburner et al., 2000), has been utilized to improve prediction of subcellular (Chou and Shen, 2006a, Chou and Shen, 2006b, Huang et al., 2008) and subnuclear localization (Lei and Dai, 2006, Shen and Chou, 2007b). Additionally, GO annotation has been used for various sequence-based prediction tasks, such as grouping GO terms to improve the assessment of gene set enrichment (Lewin and Grieve, 2006); using GO with probabilistic chain graphs for (Carroll and Pavlovic, 2006, Wolstencroft et al., 2006); using GO for analyzing the mouse basic/Helix-Loop-Helix transcription factor family (Li et al., 2006), using GO for identifying membrane proteins and their types (Cai et al., 2005), predicting the enzymatic attribute of proteins by hybridizing the gene product composition and pseudo amino acid composition (Cai et al., 2005), and predicting the transcription factor DNA binding preference (Qian et al., 2006). Querying a GO library to obtain GO terms requires the accession numbers of proteins. Therefore, the use of GO terms for solving sequence-based prediction problems is still worthy of study, especially when query protein sequences have no accession number or annotated GO term. Two ensemble classifiers Hum-PLoc (Chou and Shen, 2006a) and Euk-OET-PLoc (Chou and Shen, 2006b) directly use the accession numbers of known proteins to obtain GO terms, so they do not work for predicting novel proteins without known accession numbers. The GO-AA (Lei and Dai, 2006) utilizes the GO terms of their homologies that are retrieved by BLAST (Altschul et al., 1990) in predicting the subnuclear localization of novel proteins.

Most, but not all, eukaryotic protein sequences in the UniProtKB/Swiss-Prot database (Apweiler et al., 2004) have annotated GO terms. For example, the percentage of 2423 training proteins whose homologies are not annotated by GO terms is 3.96% (Huang et al., 2008). To predict the proteins that do not have annotated GO terms, existing GO-based prediction methods such as GO-AA (Lei and Dai, 2006), Euk-OET-PLoc (Chou and Shen, 2006b) and Hum-PLoc (Chou and Shen, 2006a), use two separate modules—one that uses GO terms as input features (called the GO-based classifier) and another that uses sequence-based features (called the sequenced-based classifier). The GO-based classifier is used for proteins with annotated GO terms. These proteins are represented as high-dimensional vectors of n binary features, where n is the total number of GO terms in the complete annotation set (a component of 1 indicates that the annotation is hit; otherwise, the component is 0). The sequence-based classifier is applied for proteins that have no corresponding GO terms.

This study proposes a prediction method PGAC for developing a single SVM-based classifier for sequence-based subnuclear localization prediction. First, BLAST is used to obtain homologies with known accession numbers from the query protein to retrieve GO terms. Each protein sequence had η = n + 20 GO-amino-acid composition (GAC) features, comprising 20 features of the conventional amino acid composition (AAC) and n GO terms. Subsequently, a feature mining algorithm, GACmining, which is an extension of GOmining (Huang et al., 2008), was proposed using an intelligent genetic algorithm (Ho et al., 2004a, Ho et al., 2004b) with an SVM classifier to identify simultaneously a small number m of η GAC features and parameter settings of SVM, where m ≪ η.

A data set SNL_35 of 561 subnuclear proteins with 35% sequence identity was established to evaluate the proposed prediction method. The data set SNL_35 was divided into two subsets, one for training (SNL_35L) and the other for independent test (SNL_35T), to avoid homolog bias and any overestimation of value of the methods. PGAC, when applied to the training data set SNL_35L, extracted m = 75 informative GAC features and yielded training and test accuracies of 85.7% and 76.3%, respectively. The Matthews correlation coefficient (MCC) (Hua and Sun, 2001, Huang et al., 2007b, Lei and Dai, 2006) performances were 0.749 and 0.668 for training and independent testing, respectively. Upon comparison with the existing method Nuc-PLoc which combines the amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix (Shen and Chou, 2007b), PGAC yields a leave-one-out cross-validation accuracy of 81.1% (MCC = 0.691), which is better than Nuc-PLoc with 67.4% (MCC = 0.50) using SNL_80. The prediction server that is based on PGAC for protein subnuclear localization has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

Section snippets

Data Sets

A data set SNL_80 with 80% sequence identity obtained from another work (Shen and Chou, 2007b) has 714 protein sequences in nine subnuclear compartments. The proteins in the data set were screened strictly using the following rules: (1) sequences with a same subnuclear location (SUBCELLULAR LOCATION) in the CC field might be annotated with different terms so that several keywords were used for a same subcellular location, e.g. in search for nuclear envelope proteins, the keywords ‘nuclear

Proposed Mining Algorithm GACmining

The proposed PGAC was implemented based on a mining algorithm, GACmining which is extension of GOmining for feature selection (Huang et al., 2008). An analysis of the selected informative GO terms in the GO graph reveals that GOmining can consider the internal correlation within relevant features rather than individual features using an efficient global optimization method (Huang et al., 2008). GACmining uses an intelligent genetic algorithm (IGA, Ho et al., 2004b) associated with an

Effectiveness of Feature Selection

To evaluate a candidate set of r informative GAC features accompanied with the SVM parameters, the prediction accuracy of 10-CV serves as a fitness function of IGA. Fig. 3 shows the training accuracies of PGAC from r = 40, 41, …, 80, were higher than those of SVM-RBS for SNL_35L and SNL_80L, where SVM-RBS performed by using SVM with a number r of selected informative GAC features by the rank-based selection (RBS) method (Li et al., 2004, Tung and Ho, 2007). One previous work in ProLoc-GO (Huang et

Conclusions

This study not only investigated the prediction of protein subnuclear localization by studying the features of GO annotation, but also developed a generalized method for deriving a GO-based feature set to be used with a specified classifier such as SVM to predict the functions or properties of protein sequences. A single-classifier prediction method PGAC was proposed to predict protein subnuclear localization. The SVM classifier used informative GAC features, consisting of 20 AAC features, nine

Acknowledgements

The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers NSC 97-2218-E-243-002, NSC 96-2628-E-009-141-MY3 and NSC 97-2627-B-009-005.

Contributions: W.L. Huang designed the system, implemented programs, participated in manuscript preparation and carried out the detail study. C.W. Tung, W.L. Huang and H.L. Huang designed the system and implemented programs. S.Y. Ho and W.L. Huang conceived the idea of this work.

References (45)

S.F. Altschul et al.
Basic local alignment search tool
J. Mol. Biol.
(1990)
Y.D. Cai et al.
Predicting 22 protein localizations in budding yeast
Biochem. Biophys. Res. Commun.
(2004)
Y.D. Cai et al.
Predicting enzyme family classes by hybridizing gene product composition and pseudo-amino acid composition
J. Theor. Biol.
(2005)
J. Cedano et al.
Relation between amino acid composition and cellular location of proteins
J. Mol. Biol.
(1997)
K.C. Chou et al.
Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization
Biochem. Biophys. Res. Commun.
(2006)
K.C. Chou et al.
Recent progress in protein subcellular location prediction
Anal. Biochem.
(2007)
L. Cocco et al.
Significance of subnuclear localization of key players of inositol lipid cycle
Adv. Enzyme Regul.
(2004)
S.Y. Ho et al.
Interpretable gene expression classifier with an accurate and compact fuzzy rule base for microarray data analysis
BioSystems
(2006)
W.L. Huang et al.
ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features
BioSystems
(2007)
J. Li et al.
Identification and analysis of the mouse basic/Helix-Loop-Helix transcription factor family
Biochem. Biophys. Res. Commun.
(2006)

B.W. Matthews

Comparison of the predicted and observed secondary structure of T4 phage lysozyme

Biochim. Biophys. Acta

(1975)

R. Nair et al.

Mimicking cellular sorting improves prediction of subcellular localization

J. Mol. Biol.

(2005)

Z. Qian et al.

A novel computational method to predict transcription factor DNA binding preference

Biochem. Biophys. Res. Commun.

(2006)

H.B. Shen et al.

Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition

Biochem. Biophys. Res. Commun.

(2005)

H.B. Shen et al.

Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites

Biochem. Biophys. Res. Commun.

(2007)

S.F. Altschul et al.

Gapped BLAST and PSIBLAST:a new generation of protein database search programs

Nucleic Acids Res.

(1997)

R. Apweiler et al.

UniProt: the Universal Protein knowledgebase

Nucleic Acids Res.

(2004)

M. Ashburner et al.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat. Genet.

(2000)

M. Bhasin et al.

ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST

Nucleic Acids Res.

(2004)

S. Carroll et al.

Protein classification using probabilistic chain graphs and the Gene Ontology structure

Bioinformatics

(2006)

Chang, C.C., Lin, C.J., 2001. LIBSVM: A library for support vector machines. Software available at...

K.C. Chou et al.

Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites

J. Proteome Res.

(2007)

Cited by (28)

PKC-alpha modulation by miR-483-3p in platinum-resistant ovarian carcinoma cells
2016, Toxicology and Applied Pharmacology
Citation Excerpt :
We considered only genes with a fold-change ≤ 0.5, with FDR < 5% and P value < 0.05. To test in silico the relevance of specific biological processes throughout our gene lists, we submitted them to DAVID, a web-based open-access software suite able to identify the most significant Gene Ontology (GO) terms enriched in a given list (Huang et al., 2009). The GO project is a collaborative effort to address the need for consistent descriptions of gene products (www.geneontology.org).
The occurrence of drug resistance limits the efficacy of platinum compounds in the cure of ovarian carcinoma. Since microRNAs (miRNAs) may contribute to this phenomenon by regulating different aspects of tumor cell response, the aim of this study was to exploit the analysis of expression of miRNAs in platinum sensitive/resistant cells in an attempt to identify potential regulators of drug response. MiR-483-3p, which may participate in apoptosis and cell proliferation regulation, was found up-regulated in 4 platinum resistant variants, particularly in the IGROV-1/Pt1 subline, versus parental cells. Transfection of a synthetic precursor of miR-483-3p in IGROV-1 parental cells elicited a marked up-regulation of the miRNA levels. Growth-inhibition and colony-forming assays indicated that miR-483-3p over-expression reduced cell growth and conferred mild levels of cisplatin resistance in IGROV-1 cells, by interference with their proliferative potential. Predicted targets of miR-483-3p included PRKCA (encoding PKC-alpha), previously reported to be associated to platinum-resistance in ovarian carcinoma. We found that miR-483-3p directly targeted PRKCA in IGROV-1 cells. In keeping with this finding, cisplatin sensitivity of IGROV-1 cells decreased upon molecular/pharmacological inhibition of PKC-alpha. Overall, our results suggest that overexpression of miR-483-3p by ovarian carcinoma platinum-resistant cells may interfere with their proliferation, thus protecting them from DNA damage induced by platinum compounds and ultimately representing a drug-resistance mechanism. The impairment of cell growth may account for low levels of drug resistance that could be relevant in the clinical setting.
An in silico toxicogenomics approach for inferring potential diseases associated with maleic acid
2014, Chemico-Biological Interactions
Citation Excerpt :
The identification of enriched GO terms from a given gene list could give insights into the overrepresented functions of the genes. GO terms have also been utilized as features for machine learning [22,23]. In addition to GO terms, the enrichment analyses of pathways and diseases are also helpful tools for better understanding of the influenced pathways and diseases relevant to maleic acid.
Maleic acid is a multi-functional chemical widely applied in the manufacturing of polymer products including food packaging. However, the contamination of maleic acid in modified starch has raised the concerns about the effects of chronic exposure to maleic acid on human health. This study proposed a novel toxicogenomics approach for inferring functions, pathways and diseases potentially affected by maleic acid on humans by using known interactions between maleic acid and proteins. Neuronal signal transmission and cell metabolism were identified to be most influenced by maleic acid in this study. The top disease categories inferred to be associated with maleic acid were mental disorder, nervous system diseases, cardiovascular diseases, and cancers. The results from the in silico analysis showed that maleic acid could penetrate the blood–brain barrier to affect the nervous system. Several functions and pathways were further analyzed and identified to give insights into the mechanisms of maleic acid-associated diseases. The toxicogenomics approach may offer both a better understanding of the potential risks of maleic-acid exposure to humans and a direction for future toxicological investigation.
Predicting peroxidase subcellular location by hybridizing different descriptors of Chou' pseudo amino acid patterns
2014, Analytical Biochemistry
Peroxidases as universal enzymes are essential for the regulation of reactive oxygen species levels and play major roles in both disease prevention and human pathologies. Automated prediction of functional protein localization is rarely reported and also is important for designing new drugs and drug targets. In this study, we first propose a support vector machine (SVM)-based method to predict peroxidase subcellular localization. Various Chou’ pseudo amino acid descriptors and gene ontology (GO)-homology patterns were selected as input features to multiclass SVM. Prediction results showed that the smoothed PSSM encoding pattern performed better than the other approaches. The best overall prediction accuracy was 87.0% in a jackknife test using a PSSM profile of pattern with width = 5. We also demonstrate that the present GO annotation is far from complete or deep enough for annotating proteins with a specific function.
Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model
2014, Journal of Theoretical Biology
Citation Excerpt :
This problem focuses on predicting protein localization inside a given subcellular organelle with substructures inside, such as submitochondria (Du and Li, 2006), subchloroplast (Du et al., 2009), subnuclear (Huang et al., 2009) and sub-Golgi localization (van Dijk et al., 2008). Micro-level protein sub-subcellular localization appears to be more challenging than subcellular localization (Huang et al., 2009; Du et al, 2011). In the past few years, many studies were devoted to developing methods of predicting sub-subcellular localization, which were reported to achieve promising results (Du et al., 2013; Du and Yu, 2013; Han et al., 2013; Hu and Yan, 2012; Lin et al., 2013; Shi et al., 2011).
Chloroplasts are crucial organelles of green plants and eukaryotic algae since they conduct photosynthesis. Predicting the subchloroplast location of a protein can provide important insights for understanding its biological functions. The performance of subchloroplast location prediction algorithms often depends on deriving predictive and succinct features from genomic and proteomic data. In this work, a novel weighted Gene Ontology (GO) transfer model is proposed to generate discriminating features from sequence data and GO Categories. This model contains two components. First, we transfer the GO terms of the homologous protein, and then assign the bit-score as weights to GO features. Second, we employ term-selection methods to determine weights for GO terms. This model is capable of improving prediction accuracy due to the tolerance of the noise derived from homolog knowledge transfer. The proposed weighted GO transfer method based on bit-score and a logarithmic transformation of CHI-square (WS-LCHI) performs better than the baseline models, and also outperforms the four off-the-shelf subchloroplast prediction methods.
SVM ensemble based transfer learning for large-scale membrane proteins discrimination
2014, Journal of Theoretical Biology
Citation Excerpt :
The first category of methods (referred to as the Optimistic case) require that the target GO information of query protein be available as test input (Shen et al., 2007; Chou and Cai, 2003, 2004; Chou and Shen, 2007; Blum et al., 2009; Tung and Lee, 2009; Lee et al., 2008), posing too demanding data constraint on model to be applicable to novel protein prediction. The second category of methods (referred to as the Moderate/Pessimistic case) require that only the homolog GO information of query protein be transferred as test input (Mei et al., 2011; Mei, 2012a, 2012b; Shen and Chou, 2010b, 2010a; Chou and Shen, 2007, 2010; Huang et al., 2008, 2009; Chou et al., 2011; Xiao et al., 2011a, 2011b). The data constraints are much less demanding because of the easy availability of homolog GO information for novel proteins.
Membrane proteins play important roles in molecular trans-membrane transport, ligand–receptor recognition, cell–cell interaction, enzyme catalysis, host immune defense response and infectious disease pathways. Up to present, discriminating membrane proteins remains a challenging problem from the viewpoints of biological experimental determination and computational modeling. This work presents SVM ensemble based transfer learning model for membrane proteins discrimination (SVM-TLM). To reduce the data constraints on computational modeling, this method investigates the effectiveness of transferring the homolog knowledge to the target membrane proteins under the framework of probability weighted ensemble learning. As compared to multiple kernel learning based transfer learning model, the method takes the advantages of sparseness based SVM optimization on large data, thus more computationally efficient for large protein data analysis. The experiments on large membrane protein benchmark dataset show that SVM-TLM achieves significantly better cross validation performance than the baseline model.
GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou's pseudo-amino acid composition
2013, Journal of Theoretical Biology
Prediction of protein subcellular localization is an important yet challenging problem. Recently, several computational methods based on Gene Ontology (GO) have been proposed to tackle this problem and have demonstrated superiority over methods based on other features. Existing GO-based methods, however, do not fully use the GO information. This paper proposes an efficient GO method called GOASVM that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chou's pseudo-amino acid composition. The method first selects a subset of relevant GO terms to form a GO vector space. Then for each protein, the method uses the accession number (AC) of the protein or the ACs of its homologs to find the number of occurrences of the selected GO terms in the Gene Ontology annotation (GOA) database as a means to construct GO vectors for support vector machines (SVMs) classification. With the advantages of GO term frequencies and a new strategy to incorporate useful homologous information, GOASVM can achieve a prediction accuracy of 72.2% on a new independent test set comprising novel proteins that were added to Swiss-Prot six years later than the creation date of the training set. GOASVM and Supplementary materials are available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.

View all citing articles on Scopus

View full text

Predicting protein subnuclear localization using GO-amino-acid composition features

Abstract

Introduction

Section snippets

Data Sets

Proposed Mining Algorithm GACmining

Effectiveness of Feature Selection

Conclusions

Acknowledgements

J. Mol. Biol.

Biochem. Biophys. Res. Commun.

J. Theor. Biol.

J. Mol. Biol.

Biochem. Biophys. Res. Commun.

Anal. Biochem.

Adv. Enzyme Regul.

BioSystems

BioSystems

Biochem. Biophys. Res. Commun.

Biochim. Biophys. Acta

J. Mol. Biol.

Biochem. Biophys. Res. Commun.

Biochem. Biophys. Res. Commun.

Biochem. Biophys. Res. Commun.

Gapped BLAST and PSIBLAST:a new generation of protein database search programs

Nucleic Acids Res.

UniProt: the Universal Protein knowledgebase

Nucleic Acids Res.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat. Genet.

ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST

Nucleic Acids Res.

Protein classification using probabilistic chain graphs and the Gene Ontology structure

Bioinformatics

Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites

J. Proteome Res.