Skip to main content
Log in

Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

From the perspective of machine learning, predicting subcellular localization of multi-location proteins is a multi-label classification problem. Conventional multi-label classifiers typically compare some pattern-matching scores with a fixed decision threshold to determine the number of subcellular locations in which a protein will reside. This simple strategy, however, may easily lead to over-prediction due to a large number of false positives. To address this problem, this paper proposes a more powerful multi-label predictor, namely AD–SVM, which incorporates an adaptive-decision (AD) scheme into multi-label support vector machine (SVM) classifiers. Specifically, given a query protein, a term-frequency based gene ontology vector is constructed by successively searching the gene ontology annotation database. Subsequently, the feature vector is classified by AD–SVM, which extends the binary relevance method with an adaptive decision scheme that essentially converts the linear SVMs to piecewise linear SVMs. Experimental results suggest that AD–SVM outperforms existing state-of-the-art multi-location predictors by at least 4 % (absolute) for a stringent virus dataset and 1 % (absolute) for a stringent plant dataset, respectively. Results also show that the adaptive-decision scheme can effectively reduce over-prediction while having insignificant effect on the correctly predicted ones.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.geneontology.org.

  2. http://www.ebi.ac.uk/GOA.

  3. SVM scores larger than one means that the test proteins fall beyond the margin of separation; therefore, the confidence is fairly high.

  4. Here, \(N=207\) for the virus dataset and \(N=978\) for the plant dataset.

References

  1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 25:3389–3402

    Article  Google Scholar 

  2. Barrel D, Dimmer E, Huntley RP, Binns D, O’Donovan C, Apweiler R (2009) The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucl Acids Res 37:D396–D403

    Article  Google Scholar 

  3. Barutcuoglu Z, Schapire RE, Troyanskaya OG (2006) Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–836

    Article  Google Scholar 

  4. Boutell M, Luo J, Shen X, Brown C (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771

    Article  Google Scholar 

  5. Brady S, Shatkay H (2008) EpiLoc: a (working) text-based system for predicting protein subcellular location. In: Pacific symposium biocomputing, pp 604–615

  6. Chou KC (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct Funct Genet 43:246–255

    Article  Google Scholar 

  7. Chou KC (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol BioSyst 9:1092–1100

    Article  Google Scholar 

  8. Chou KC, Cai YD (2004) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem Biophys Res Commun 320:1236–1239

    Article  Google Scholar 

  9. Chou KC, Shen HB (2006) Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers. J Proteome Res 5:1888–1897

    Article  Google Scholar 

  10. Chou KC, Shen HB (2010) Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5:e11335

    Article  Google Scholar 

  11. Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, pp 42–53

  12. Dembczynski K, Waegeman W, Cheng W, Hullermeier E (2012) On label dependence and loss minimization in multi-label classification. Mach Learn 88(1–2):5–45

    Article  MathSciNet  MATH  Google Scholar 

  13. Dietterich TG, Bakari G (1995) Solving multiclass learning problem via error-correcting output codes. J Artif Intell Res 2:263–286

    MATH  Google Scholar 

  14. Elisseeff A, Weston J (2001) Kernel methods for multi-labelled classification and categorical regression problems. In: In advances in neural information processing systems, vol 14. MIT Press, Cambridge, MA, pp 681–687

  15. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016

    Article  Google Scholar 

  16. Foster LJ, De Hoog CL, Zhang Y, Zhang Y, Xie X, Mootha VK, Mann M (2006) A mammalian organelle map by protein correlation profiling. Cell 125:187–199

    Article  Google Scholar 

  17. Freund Y, Schapire R (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612

    Google Scholar 

  18. Fyshe A, Liu Y, Szafron D, Greiner R, Lu P (2008) Improving subcellular localization prediction using text classification and the gene ontology. Bioinformatics 24:2512–2517

    Article  Google Scholar 

  19. Gao W, Zhou ZH (2011) On the consistency of multi-label learning. In: Proceedings of the 24th annual conference on learning theory, pp 341–358

  20. Ghamrawi N, McCallum A (2005) Collective multi-label classification. In: Proceedings of the 2005 ACM conference on information and knowledge management (CIKM’05), pp 195–200

  21. Gillick L, Cox SJ (1989) Some statistical issues in the comparison of speech recognition algorithms. In: 1989 IEEE international conference on acoustics, speech, and signal processing (ICASSP’89). IEEE Press, New York, pp 532–535

  22. Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 22–30

  23. Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, Berlin

    Book  MATH  Google Scholar 

  24. He J, Gu H, Liu W (2011) Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS ONE 7(6):e37155

    Article  Google Scholar 

  25. Hsu D, Kakade SM, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. Adv Neural Inf Process Syst 22:772–780

    Google Scholar 

  26. Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. In: Proceedings of the ECML/PKDD 2008 discovery challenge

  27. Kressel U (1999) Pairwise classification and support vector machines. In: Advances in kernel methods: support vcector learning, Chap 15. MIT Press, Cambridge, MA

  28. Li LQ, Zhang Y, Zou LY, Li CQ, Yu B, Zheng XQ, Zhou Y (2012) An ensemble classifier for eukaryotic protein subcellular location prediction using Gene Ontology categories and amino acid hydrophobicity. PLoS ONE 7(1):e31057

    Article  Google Scholar 

  29. Li T, Ogihara M (2006) Toward intelligent music information retrieval. IEEE Trans Multimed 8(3):564–574

    Article  Google Scholar 

  30. Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20(4):547–556

    Article  Google Scholar 

  31. Mak MW, Guo J, Kung SY (2008) PairProSVM: protein subcellular localization based on local pairwise profile alignment and SVM. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 5(3):416–422

    Article  Google Scholar 

  32. Mei S (2012) Multi-label multi-kernel transfer learning for human protein subcellular localization. PLoS ONE 7(6):e37716

    Article  Google Scholar 

  33. Millar AH, Carrie C, Pogson B, Whelan J (2009) Exploring the function-location nexus: using multiple lines of evidence in defining the subcellular location of plant proteins. Plant Cell 21(6):1625–1631

    Article  Google Scholar 

  34. Moskovitch R, Cohenkashi S, Dror U, Levy I, Maimon A, Shahar Y (2006) Multiple hierarchical classification of free-text clinical guidelines. Artif Intell Med 37:177–190

    Article  Google Scholar 

  35. Mott R, Schultz J, Bork P, Ponting CP (2002) Predicting protein cellular localization using a domain projection method. Genome Res 12(8):1168–1174

    Article  Google Scholar 

  36. Mueller JC, Andreoli C, Prokisch H, Meitinger T (2004) Mechanisms for multiple intracellular localization of human mitochondrial proteins. Mitochondrion 3:315–325

    Article  Google Scholar 

  37. Murphy RF (2010) communicating subcellular distributions. Cytometry 77(7):686–92

    Article  Google Scholar 

  38. Nair R, Rost B (2002) Sequence conserved for subcellular localization. Protein Sci 11:2836–2847

    Article  Google Scholar 

  39. Nakai K, Kanehisa M, Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins Struct Funct Genet 11(2):95–110

    Article  Google Scholar 

  40. Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J Mol Biol 238:54–61

    Article  Google Scholar 

  41. Nielsen H, Engelbrecht J, Brunak S, von Heijne G (1997) A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst 8:581–599

    Article  Google Scholar 

  42. Quinlan JR (1993) C4.5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos, CA

    Google Scholar 

  43. Rea S, James D (1997) Moving GLUT4: the biogenesis and trafficking of GLUT4 storage vesicles. Diabetes 46:1667–1677

    Article  Google Scholar 

  44. Read J, Pfahringer B, Holmes G, Frank E (2009) Classifier chains for multi-label classification. In: Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases, pp 254–269

  45. Rousu J, Saunders C, Szedmak S, Shawe-Taylor J (2006) Kernel-based learning of hierarchical multilabel classification methods. J Mach Learn Res 7:1601–1626

    MathSciNet  MATH  Google Scholar 

  46. Russell R, Bergeron R, Shulman G, Young H (1997) Translocation of myocardial GLUT-4 and increased glucose uptake through activation of AMPK by AICAR. Am J Physiol 277:H643–649

    Google Scholar 

  47. Schapire RE, Singer Y (2000) Boostexter: a boosting-based system for text categorization. Mach Learn 39(2/3):135–168

    Article  MATH  Google Scholar 

  48. Scholkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge, MA

    MATH  Google Scholar 

  49. Shen HB, Chou KC (2010) Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J Biomol Struct Dyn 26:175–186

    Article  Google Scholar 

  50. Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th annual ACM international conference on multimedia, pp 421–430

  51. Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2006) Multilabel classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval, pp 325–330

  52. Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Warehous Min 3:1–13

    Article  Google Scholar 

  53. Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach l (eds) Data mining and knowledge discovery handbook, 2nd edn. Springer, Berlin, pp 667–685

    Google Scholar 

  54. Vapnik VN (1998) Statistical learning theory. Wiley, New York

    MATH  Google Scholar 

  55. Vens C, Struyf J, Schietgat L, Dzeroski S, Blockeel H (2008) Decision trees for hierarchical multi-label classification. Mach Learn 2(73):185–214

    Article  Google Scholar 

  56. Wan S, Mak MW (2015) Machine learning for protein subcellular localization prediction. De Gruyter, Berlin

    Book  Google Scholar 

  57. Wan S, Mak MW, Kung SY (2011) Protein subcellular localization prediction based on profile alignment and Gene Ontology. In: 2011 IEEE international workshop on machine learning for signal processing (MLSP’11), pp 1–6

  58. Wan S, Mak MW, Kung SY (2012) GOASVM: Protein subcellular localization prediction based on gene ontology annotation and SVM. In: 2012 IEEE international conference on acoustics, speech, and signal processing (ICASSP’12), pp 2229–2232

  59. Wan S, Mak MW, Kung SY (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform 13:290

    Article  Google Scholar 

  60. Wan S, Mak MW, Kung SY (2013) Adaptive thresholding for multi-label SVM classification with application to protein subcellular localization prediction. In: 2013 IEEE international conference on acoustics, speech, and signal processing (ICASSP’13), pp 3547–3551

  61. Wan S, Mak MW, Kung SY (2013) GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou’s pseudo-amino acid composition. J Theor Biol 323:40–48

    Article  MATH  Google Scholar 

  62. Wan S, Mak MW, Kung SY (2013) Semantic similarity over gene ontology for multi-label protein subcellular localization. Engineering 5:68–72

    Article  Google Scholar 

  63. Wan S, Mak MW, Kung SY (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS ONE 9(3):e89545

    Article  Google Scholar 

  64. Wan S, Mak MW, Kung SY (2014) R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J Theor Biol 360:34–45

    Article  MATH  Google Scholar 

  65. Wan S, Mak MW, Kung SY (2015) Mem-mEN: predicting multi-functional types of membrane proteins by interpretable elastic nets. IEEE/ACM Trans Comput Biol Bioinform. doi:10.1109/TCBB.2015.2474407

    Google Scholar 

  66. Wan S, Mak MW, Kung SY (2015) mLASSO-Hum: a LASSO-based interpretable human-protein subcellular localization predictor. J Theor Biol 382(2015):223–234

    Article  MATH  Google Scholar 

  67. Wan S, Mak MW, Kung SY (2015) mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal Biochem 473:14–27

    Article  Google Scholar 

  68. Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2013) An ensemble classifier with random projection for predicting multi-label protein subcellular localization. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 35–42

  69. Wan S, Mak MW, Zhang B, Wang Y, Kung SY (2014) Ensemble random projection for multi-label classification with application to protein subcellular localization. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP’14). IEEE Press, New York, pp 5999–6003

  70. Wu ZC, Xiao X, Chou KC (2011) iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Mol BioSyst 7:3287–3297

    Article  Google Scholar 

  71. Xiao X, Wu ZC, Chou KC (2011) iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J Theor Biol 284:42–51

    Article  Google Scholar 

  72. Zhang ML, Zhou ZH (2005) A k-nearest neighbor based algorithm for multi-label classification. In: IEEE International conference on granular computing, pp 718–721

  73. Zhang S, Xia XF, Shen JC, Zhou Y, Sun ZR (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinform 9:127

    Article  Google Scholar 

  74. Zhou GP, Doctor K (2003) Subcellular location prediction of apoptosis proteins. Proteins Struct Funct Genet 50:44–48

    Article  Google Scholar 

Download references

Acknowledgment

This work was in part supported by the RGC of Hong Kong SAR (Grant No. PolyU 152117/14E).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shibiao Wan or Man-Wai Mak.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wan, S., Mak, MW. Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme. Int. J. Mach. Learn. & Cyber. 9, 399–411 (2018). https://doi.org/10.1007/s13042-015-0460-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-015-0460-4

Keywords

Navigation