Abstract
From the perspective of machine learning, predicting subcellular localization of multi-location proteins is a multi-label classification problem. Conventional multi-label classifiers typically compare some pattern-matching scores with a fixed decision threshold to determine the number of subcellular locations in which a protein will reside. This simple strategy, however, may easily lead to over-prediction due to a large number of false positives. To address this problem, this paper proposes a more powerful multi-label predictor, namely AD–SVM, which incorporates an adaptive-decision (AD) scheme into multi-label support vector machine (SVM) classifiers. Specifically, given a query protein, a term-frequency based gene ontology vector is constructed by successively searching the gene ontology annotation database. Subsequently, the feature vector is classified by AD–SVM, which extends the binary relevance method with an adaptive decision scheme that essentially converts the linear SVMs to piecewise linear SVMs. Experimental results suggest that AD–SVM outperforms existing state-of-the-art multi-location predictors by at least 4 % (absolute) for a stringent virus dataset and 1 % (absolute) for a stringent plant dataset, respectively. Results also show that the adaptive-decision scheme can effectively reduce over-prediction while having insignificant effect on the correctly predicted ones.
Similar content being viewed by others
Notes
SVM scores larger than one means that the test proteins fall beyond the margin of separation; therefore, the confidence is fairly high.
Here, \(N=207\) for the virus dataset and \(N=978\) for the plant dataset.
References
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402
Barrel D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R (2009) The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucl Acids Res 37:D396–D403
Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836
Boutell M, Luo J, Shen X, Brown C (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771
Brady S, Shatkay H (2008) EpiLoc: a (working) text-based system for predicting protein subcellular location. In: Pacific symposium biocomputing, pp 604–615
Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct Funct Genet 43:246–255
Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100
Chou KC, Cai YD (2004) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem Biophys Res Commun 320:1236–1239
Chou KC, Shen HB (2006) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J Proteome Res 5:1888–1897
Chou KC, Shen HB (2010) Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5:e11335
Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, pp 42–53
Dembczynski K, Waegeman W, Cheng W, Hullermeier E (2012) On label dependence and loss minimization in multi-label classification. Mach Learn 88(1–2):5–45
Dietterich TG, Bakari G (1995) Solving multiclass learning problem via error-correcting output codes. J Artif Intell Res 2:263–286
Elisseeff A, Weston J (2001) Kernel methods for multi-labelled classification and categorical regression problems. In: In advances in neural information processing systems, vol 14. MIT Press, Cambridge, MA, pp 681–687
Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016
Foster LJ, De Hoog CL, Zhang Y, Zhang Y, Xie X, Mootha VK, Mann M (2006) A mammalian organelle map by protein correlation profiling. Cell 125:187–199
Freund Y, Schapire R (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612
Fyshe A, Liu Y, Szafron D, Greiner R, Lu P (2008) Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 24:2512–2517
Gao W, Zhou ZH (2011) On the consistency of multi-label learning. In: Proceedings of the 24th annual conference on learning theory, pp 341–358
Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: Proceedings of the 2005 ACM conference on information and knowledge management (CIKM’05), pp 195–200
Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: 1989 IEEE international conference on acoustics, speech, and signal processing (ICASSP’89). IEEE Press, New York, pp 532–535
Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 22–30
Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, Berlin
He J, Gu H, Liu W (2011) Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS ONE 7(6):e37155
Hsu D, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. Adv Neural Inf Process Syst 22:772–780
Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD 2008 discovery challenge
Kressel U (1999) Pairwise classification and support vector machines. In: Advances in kernel methods: support vcector learning, Chap 15. MIT Press, Cambridge, MA
Li LQ, Zhang Y, Zou LY, Li CQ, Yu B, Zheng XQ, Zhou Y (2012) An ensemble classifier for eukaryotic protein subcellular location prediction using Gene Ontology categories and amino acid hydrophobicity. PLoS ONE 7(1):e31057
Li T, Ogihara M (2006) Toward intelligent music information retrieval. IEEE Trans Multimed 8(3):564–574
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556
Mak MW, Guo J, Kung SY (2008) PairProSVM: protein subcellular localization based on local pairwise profile alignment and SVM. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):416–422
Mei S (2012) Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS ONE 7(6):e37716
Millar AH, Carrie C, Pogson B, Whelan J (2009) Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell 21(6):1625–1631
Moskovitch R, Cohenkashi S, Dror U, Levy I, Maimon A, Shahar Y (2006) Multiple hierarchical classification of free-text clinical guidelines. Artif Intell Med 37:177–190
Mott R, Schultz J, Bork P, Ponting CP (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12(8):1168–1174
Mueller JC, Andreoli C, Prokisch H, Meitinger T (2004) Mechanisms for multiple intracellular localization of human mitochondrial proteins. Mitochondrion 3:315–325
Murphy RF (2010) communicating subcellular distributions. Cytometry 77(7):686–92
Nair R, Rost B (2002) Sequence conserved for subcellular localization. Protein Sci 11:2836–2847
Nakai K, Kanehisa M, Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Genet 11(2):95–110
Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238:54–61
Nielsen H, Engelbrecht J, Brunak S, von Heijne G (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst 8:581–599
Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos, CA
Rea S, James D (1997) Moving GLUT4: the biogenesis and trafficking of GLUT4 storage vesicles. Diabetes 46:1667–1677
Read J, Pfahringer B, Holmes G, Frank E (2009) Classifier chains for multi-label classification. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases, pp 254–269
Rousu J, Saunders C, Szedmak S, Shawe-Taylor J (2006) Kernel-based learning of hierarchical multilabel classification methods. J Mach Learn Res 7:1601–1626
Russell R, Bergeron R, Shulman G, Young H (1997) Translocation of myocardial GLUT-4 and increased glucose uptake through activation of AMPK by AICAR. Am J Physiol 277:H643–649
Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2/3):135–168
Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge, MA
Shen HB, Chou KC (2010) Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J Biomol Struct Dyn 26:175–186
Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th annual ACM international conference on multimedia, pp 421–430
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2006) Multilabel classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval, pp 325–330
Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 3:1–13
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach l (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer, Berlin, pp 667–685
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 2(73):185–214
Wan S, Mak MW (2015) Machine learning for protein subcellular localization prediction. De Gruyter, Berlin
Wan S, Mak MW, Kung SY (2011) Protein subcellular localization prediction based on profile alignment and Gene Ontology. In: 2011 IEEE international workshop on machine learning for signal processing (MLSP’11), pp 1–6
Wan S, Mak MW, Kung SY (2012) GOASVM: Protein subcellular localization prediction based on gene ontology annotation and SVM. In: 2012 IEEE international conference on acoustics, speech, and signal processing (ICASSP’12), pp 2229–2232
Wan S, Mak MW, Kung SY (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform 13:290
Wan S, Mak MW, Kung SY (2013) Adaptive thresholding for multi-label SVM classification with application to protein subcellular localization prediction. In: 2013 IEEE international conference on acoustics, speech, and signal processing (ICASSP’13), pp 3547–3551
Wan S, Mak MW, Kung SY (2013) GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J Theor Biol 323:40–48
Wan S, Mak MW, Kung SY (2013) Semantic similarity over gene ontology for multi-label protein subcellular localization. Engineering 5:68–72
Wan S, Mak MW, Kung SY (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS ONE 9(3):e89545
Wan S, Mak MW, Kung SY (2014) R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 360:34–45
Wan S, Mak MW, Kung SY (2015) Mem-mEN: predicting multi-functional types of membrane proteins by interpretable elastic nets. IEEE/ACM Trans Comput Biol Bioinform. doi:10.1109/TCBB.2015.2474407
Wan S, Mak MW, Kung SY (2015) mLASSO-Hum: a LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 382(2015):223–234
Wan S, Mak MW, Kung SY (2015) mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal Biochem 473:14–27
Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2013) An ensemble classifier with random projection for predicting multi-label protein subcellular localization. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 35–42
Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2014) Ensemble random projection for multi-label classification with application to protein subcellular localization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP’14). IEEE Press, New York, pp 5999–6003
Wu ZC, Xiao X, Chou KC (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol BioSyst 7:3287–3297
Xiao X, Wu ZC, Chou KC (2011) iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J Theor Biol 284:42–51
Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: IEEE International conference on granular computing, pp 718–721
Zhang S, Xia XF, Shen JC, Zhou Y, Sun ZR (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinform 9:127
Zhou GP, Doctor K (2003) Subcellular location prediction of apoptosis proteins. Proteins Struct Funct Genet 50:44–48
Acknowledgment
This work was in part supported by the RGC of Hong Kong SAR (Grant No. PolyU 152117/14E).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Wan, S., Mak, MW. Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. Int. J. Mach. Learn. & Cyber. 9, 399–411 (2018). https://doi.org/10.1007/s13042-015-0460-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-015-0460-4