Elsevier

Biosystems

Volume 98, Issue 2, November 2009, Pages 73-79
Biosystems

Predicting protein subnuclear localization using GO-amino-acid composition features

https://doi.org/10.1016/j.biosystems.2009.06.007Get rights and content

Abstract

The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

Introduction

The cell nucleus is a highly complex organelle that organizes the comprehensive assembly of genes and their corresponding regulatory factors. The nucleus guides life processes of cells by directing their reproduction, controlling their differentiation and regulating their metabolic activities. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments (Heidi et al., 2001). The knowledge of protein subnuclear localization can provide valuable clues about its molecular function, as well as the biological pathway in which it participates (Cocco et al., 2004).

The bulk of computation methods exist in literature for predicting protein subcellular localization and has achieved high accuracy (Bhasin and Raghava, 2004, Cai and Chou, 2004, Chou and Shen, 2006a, Chou and Shen, 2006b, Huang et al., 2008, Nair and Rost, 2005, Nanni and Lumini, 2006, Pierleoni et al., 2006, Sarda et al., 2005), particularly systematically introduced in a recent review (Chou and Shen, 2007b), a step-by-step protocol paper (Chou and Shen, 2008) and a book chapter (Chou, 2009). However, the prediction of protein localization at subnuclear level is far more challenging (Lei and Dai, 2006). We have developed the first ProLoc system using SVM with automatic selection from physicochemical properties for this task using with considerable prediction accuracy (Huang et al., 2007b). In this work, we attempted to improve the performance of the system through the incorporation of information obtained from Gene Ontology (GO).

Gene Ontology, which is a controlled vocabulary of terms split into three related ontology consisting of molecular function, biological processes and cellular components (Ashburner et al., 2000), has been utilized to improve prediction of subcellular (Chou and Shen, 2006a, Chou and Shen, 2006b, Huang et al., 2008) and subnuclear localization (Lei and Dai, 2006, Shen and Chou, 2007b). Additionally, GO annotation has been used for various sequence-based prediction tasks, such as grouping GO terms to improve the assessment of gene set enrichment (Lewin and Grieve, 2006); using GO with probabilistic chain graphs for (Carroll and Pavlovic, 2006, Wolstencroft et al., 2006); using GO for analyzing the mouse basic/Helix-Loop-Helix transcription factor family (Li et al., 2006), using GO for identifying membrane proteins and their types (Cai et al., 2005), predicting the enzymatic attribute of proteins by hybridizing the gene product composition and pseudo amino acid composition (Cai et al., 2005), and predicting the transcription factor DNA binding preference (Qian et al., 2006). Querying a GO library to obtain GO terms requires the accession numbers of proteins. Therefore, the use of GO terms for solving sequence-based prediction problems is still worthy of study, especially when query protein sequences have no accession number or annotated GO term. Two ensemble classifiers Hum-PLoc (Chou and Shen, 2006a) and Euk-OET-PLoc (Chou and Shen, 2006b) directly use the accession numbers of known proteins to obtain GO terms, so they do not work for predicting novel proteins without known accession numbers. The GO-AA (Lei and Dai, 2006) utilizes the GO terms of their homologies that are retrieved by BLAST (Altschul et al., 1990) in predicting the subnuclear localization of novel proteins.

Most, but not all, eukaryotic protein sequences in the UniProtKB/Swiss-Prot database (Apweiler et al., 2004) have annotated GO terms. For example, the percentage of 2423 training proteins whose homologies are not annotated by GO terms is 3.96% (Huang et al., 2008). To predict the proteins that do not have annotated GO terms, existing GO-based prediction methods such as GO-AA (Lei and Dai, 2006), Euk-OET-PLoc (Chou and Shen, 2006b) and Hum-PLoc (Chou and Shen, 2006a), use two separate modules—one that uses GO terms as input features (called the GO-based classifier) and another that uses sequence-based features (called the sequenced-based classifier). The GO-based classifier is used for proteins with annotated GO terms. These proteins are represented as high-dimensional vectors of n binary features, where n is the total number of GO terms in the complete annotation set (a component of 1 indicates that the annotation is hit; otherwise, the component is 0). The sequence-based classifier is applied for proteins that have no corresponding GO terms.

This study proposes a prediction method PGAC for developing a single SVM-based classifier for sequence-based subnuclear localization prediction. First, BLAST is used to obtain homologies with known accession numbers from the query protein to retrieve GO terms. Each protein sequence had η = n + 20 GO-amino-acid composition (GAC) features, comprising 20 features of the conventional amino acid composition (AAC) and n GO terms. Subsequently, a feature mining algorithm, GACmining, which is an extension of GOmining (Huang et al., 2008), was proposed using an intelligent genetic algorithm (Ho et al., 2004a, Ho et al., 2004b) with an SVM classifier to identify simultaneously a small number m of η GAC features and parameter settings of SVM, where m  η.

A data set SNL_35 of 561 subnuclear proteins with 35% sequence identity was established to evaluate the proposed prediction method. The data set SNL_35 was divided into two subsets, one for training (SNL_35L) and the other for independent test (SNL_35T), to avoid homolog bias and any overestimation of value of the methods. PGAC, when applied to the training data set SNL_35L, extracted m = 75 informative GAC features and yielded training and test accuracies of 85.7% and 76.3%, respectively. The Matthews correlation coefficient (MCC) (Hua and Sun, 2001, Huang et al., 2007b, Lei and Dai, 2006) performances were 0.749 and 0.668 for training and independent testing, respectively. Upon comparison with the existing method Nuc-PLoc which combines the amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix (Shen and Chou, 2007b), PGAC yields a leave-one-out cross-validation accuracy of 81.1% (MCC = 0.691), which is better than Nuc-PLoc with 67.4% (MCC = 0.50) using SNL_80. The prediction server that is based on PGAC for protein subnuclear localization has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.

Section snippets

Data Sets

A data set SNL_80 with 80% sequence identity obtained from another work (Shen and Chou, 2007b) has 714 protein sequences in nine subnuclear compartments. The proteins in the data set were screened strictly using the following rules: (1) sequences with a same subnuclear location (SUBCELLULAR LOCATION) in the CC field might be annotated with different terms so that several keywords were used for a same subcellular location, e.g. in search for nuclear envelope proteins, the keywords ‘nuclear

Proposed Mining Algorithm GACmining

The proposed PGAC was implemented based on a mining algorithm, GACmining which is extension of GOmining for feature selection (Huang et al., 2008). An analysis of the selected informative GO terms in the GO graph reveals that GOmining can consider the internal correlation within relevant features rather than individual features using an efficient global optimization method (Huang et al., 2008). GACmining uses an intelligent genetic algorithm (IGA, Ho et al., 2004b) associated with an

Effectiveness of Feature Selection

To evaluate a candidate set of r informative GAC features accompanied with the SVM parameters, the prediction accuracy of 10-CV serves as a fitness function of IGA. Fig. 3 shows the training accuracies of PGAC from r = 40, 41, …, 80, were higher than those of SVM-RBS for SNL_35L and SNL_80L, where SVM-RBS performed by using SVM with a number r of selected informative GAC features by the rank-based selection (RBS) method (Li et al., 2004, Tung and Ho, 2007). One previous work in ProLoc-GO (Huang et

Conclusions

This study not only investigated the prediction of protein subnuclear localization by studying the features of GO annotation, but also developed a generalized method for deriving a GO-based feature set to be used with a specified classifier such as SVM to predict the functions or properties of protein sequences. A single-classifier prediction method PGAC was proposed to predict protein subnuclear localization. The SVM classifier used informative GAC features, consisting of 20 AAC features, nine

Acknowledgements

The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers NSC 97-2218-E-243-002, NSC 96-2628-E-009-141-MY3 and NSC 97-2627-B-009-005.

Contributions: W.L. Huang designed the system, implemented programs, participated in manuscript preparation and carried out the detail study. C.W. Tung, W.L. Huang and H.L. Huang designed the system and implemented programs. S.Y. Ho and W.L. Huang conceived the idea of this work.

References (45)

  • B.W. Matthews

    Comparison of the predicted and observed secondary structure of T4 phage lysozyme

    Biochim. Biophys. Acta

    (1975)
  • R. Nair et al.

    Mimicking cellular sorting improves prediction of subcellular localization

    J. Mol. Biol.

    (2005)
  • Z. Qian et al.

    A novel computational method to predict transcription factor DNA binding preference

    Biochem. Biophys. Res. Commun.

    (2006)
  • H.B. Shen et al.

    Predicting protein subnuclear location with optimized evidence-theoretic K-nearest classifier and pseudo amino acid composition

    Biochem. Biophys. Res. Commun.

    (2005)
  • H.B. Shen et al.

    Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites

    Biochem. Biophys. Res. Commun.

    (2007)
  • S.F. Altschul et al.

    Gapped BLAST and PSIBLAST:a new generation of protein database search programs

    Nucleic Acids Res.

    (1997)
  • R. Apweiler et al.

    UniProt: the Universal Protein knowledgebase

    Nucleic Acids Res.

    (2004)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

    Nat. Genet.

    (2000)
  • M. Bhasin et al.

    ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST

    Nucleic Acids Res.

    (2004)
  • S. Carroll et al.

    Protein classification using probabilistic chain graphs and the Gene Ontology structure

    Bioinformatics

    (2006)
  • Chang, C.C., Lin, C.J., 2001. LIBSVM: A library for support vector machines. Software available at...
  • K.C. Chou et al.

    Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites

    J. Proteome Res.

    (2007)
  • Cited by (28)

    • PKC-alpha modulation by miR-483-3p in platinum-resistant ovarian carcinoma cells

      2016, Toxicology and Applied Pharmacology
      Citation Excerpt :

      We considered only genes with a fold-change ≤ 0.5, with FDR < 5% and P value < 0.05. To test in silico the relevance of specific biological processes throughout our gene lists, we submitted them to DAVID, a web-based open-access software suite able to identify the most significant Gene Ontology (GO) terms enriched in a given list (Huang et al., 2009). The GO project is a collaborative effort to address the need for consistent descriptions of gene products (www.geneontology.org).

    • An in silico toxicogenomics approach for inferring potential diseases associated with maleic acid

      2014, Chemico-Biological Interactions
      Citation Excerpt :

      The identification of enriched GO terms from a given gene list could give insights into the overrepresented functions of the genes. GO terms have also been utilized as features for machine learning [22,23]. In addition to GO terms, the enrichment analyses of pathways and diseases are also helpful tools for better understanding of the influenced pathways and diseases relevant to maleic acid.

    • Robust feature generation for protein subchloroplast location prediction with a weighted GO transfer model

      2014, Journal of Theoretical Biology
      Citation Excerpt :

      This problem focuses on predicting protein localization inside a given subcellular organelle with substructures inside, such as submitochondria (Du and Li, 2006), subchloroplast (Du et al., 2009), subnuclear (Huang et al., 2009) and sub-Golgi localization (van Dijk et al., 2008). Micro-level protein sub-subcellular localization appears to be more challenging than subcellular localization (Huang et al., 2009; Du et al, 2011). In the past few years, many studies were devoted to developing methods of predicting sub-subcellular localization, which were reported to achieve promising results (Du et al., 2013; Du and Yu, 2013; Han et al., 2013; Hu and Yan, 2012; Lin et al., 2013; Shi et al., 2011).

    • SVM ensemble based transfer learning for large-scale membrane proteins discrimination

      2014, Journal of Theoretical Biology
      Citation Excerpt :

      The first category of methods (referred to as the Optimistic case) require that the target GO information of query protein be available as test input (Shen et al., 2007; Chou and Cai, 2003, 2004; Chou and Shen, 2007; Blum et al., 2009; Tung and Lee, 2009; Lee et al., 2008), posing too demanding data constraint on model to be applicable to novel protein prediction. The second category of methods (referred to as the Moderate/Pessimistic case) require that only the homolog GO information of query protein be transferred as test input (Mei et al., 2011; Mei, 2012a, 2012b; Shen and Chou, 2010b, 2010a; Chou and Shen, 2007, 2010; Huang et al., 2008, 2009; Chou et al., 2011; Xiao et al., 2011a, 2011b). The data constraints are much less demanding because of the easy availability of homolog GO information for novel proteins.

    View all citing articles on Scopus
    View full text