Skip to main content
Log in

Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370(1):1–16

    Article  MathSciNet  Google Scholar 

  2. Hu L-L, Feng K-Y, Cai Y-D, Chou K-C (2012) Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett 19(6):644–651

    Article  Google Scholar 

  3. Chou K-C (2009) REVIEW: recent advances in developing web-servers for predicting protein attributes. Nat Sci 1(2):63– 92

    Google Scholar 

  4. Zhang S, Xia X, Shen J, Zhou Y, Sun Z (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinf 9:127

    Article  Google Scholar 

  5. Chou K-C (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol Biosyst 9(6):1092–1100

    Article  Google Scholar 

  6. Du P, Xu C (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 10(3):227–237

    Article  Google Scholar 

  7. Murphy RF, Boland MV, Velliste M (2000) Towards a systematics for protein subcelluar location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc Int Conf Intell Syst Mol Biol 251– 259

  8. Consortium TU (2013) Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res 41(Database issue):D43–D47

    Google Scholar 

  9. Imai K, Nakai K (2010) Prediction of subcellular locations of proteins: where to proceed. Proteomics 10(22):3970–3983

    Article  Google Scholar 

  10. Chou K-C (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273(1):236–247

    Article  MathSciNet  Google Scholar 

  11. Du P, Li T, Wang X (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 8(3):391– 404

    Article  Google Scholar 

  12. Chou K-C, Cai Y-D (2005) Predicting protein localization in budding yeast. Bioinformatics 21(7):944–950

    Article  Google Scholar 

  13. Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL (2005) PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617– 623

    Article  Google Scholar 

  14. Blum T, Briesemeister S, Kohlbacher O (2009) MultiLoc2: integrating phylogeny and gene ontology terms improves subcellular protein localization prediction. BMC Bioinf 10:274

    Article  Google Scholar 

  15. Wan S, Mak M-W, Kung S-Y (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinf 13(1):290

    Article  Google Scholar 

  16. Cao J, Liu W, He J, Gu H (2013) Identifying the singleplex and multiplex proteins based on transductive learning for protein subcellular localization prediction. Biotechnol Lett 35(7):1107–1113

    Article  Google Scholar 

  17. Lin W-Z, Fang J-A, Xiao X, Chou K-C (2013) iLoc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Mol Biosyst 9(4):634–644

    Article  Google Scholar 

  18. Wang X, Li G-Z (2013) Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM Trans Comput Biol Bioinform 10(2):436–446. https://doi.org/10.1109/TCBB.2013.21

    Article  Google Scholar 

  19. Pacharawongsakda E, Theeramunkong T (2013) Predict subcellular locations of singleplex and multiplex proteins by semi-supervised learning and dimension-reducing general mode of Chou’s PseAAC. IEEE Trans Nanobiosci 12 (4):311–320. https://doi.org/10.1109/TNB.2013.2272014

    Article  Google Scholar 

  20. Wan S, Mak M-W, Kung S-Y (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 9(3):e89545

    Article  Google Scholar 

  21. Zhang S-W, Liu Y-F, Yu Y, Zhang T-H, Fan X-N (2014) MSLoc-DT: a new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 449:164–171

    Article  Google Scholar 

  22. Simha R, Shatkay H (2014) Protein (multi-)location prediction: using location inter-dependencies in a probabilistic framework. Algorithms Mol Biol 9(1):8

    Article  Google Scholar 

  23. Huang C, Yuan J (2013) Using radial basis function on the general form of Chou’s pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites. Biosystems 113(1):50–57

    Article  Google Scholar 

  24. Xu Q, Pan S-J, Xue HH, Yang Q (2011) Multitask learning for protein subcellular location prediction. IEEE/ACM Trans Comput Biol Bioinform 8(3):748–759. https://doi.org/10.1109/TCBB.2010.22

    Article  Google Scholar 

  25. Lin T, Murphy R, Bar-Joseph Z (2011) Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Trans Comput Biol Bioinform 8(2):441–451. https://doi.org/10.1109/TCBB.2009.82

    Article  Google Scholar 

  26. Yoon Y, Lee GG (2012) Subcellular localization prediction through boosting association rules. IEEE/ACM Trans Comput Biol Bioinform 9(2):609–618. https://doi.org/10.1109/TCBB.2011.131

    Article  Google Scholar 

  27. Qu X-M, Wang D, Chen Y-H, Qiao S-P, Zhao Q (2016) Predicting the subcellular localization of proteins with multiple sites based on multiple features fusion. IEEE/ACM Trans Comput Biol Bioinform 13(1):36–42. https://doi.org/10.1109/TCBB.2015.2485207

    Article  Google Scholar 

  28. Dietterichl T (2002) Ensemble learning. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 405–408

    Google Scholar 

  29. Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227

    Google Scholar 

  30. Brown T, Koplowitz J (1979) The weighted nearest neighbor rule for class dependent sample sizes. IEEE Trans Inf Theory 25(5):617–619

    Article  Google Scholar 

  31. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the IEEE international conference neural networks (ICNN’95), pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968

  32. Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53(4):331–44

    Article  Google Scholar 

  33. Chou K-C, Shen H-B (2007) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites research articles. J Proteome Res 6(5):1728–1734

    Article  Google Scholar 

  34. Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19

    Article  MathSciNet  Google Scholar 

  35. Saravanan V, Lakshmi PTV (2013) APSLAP: an adaptive boosting technique for predicting subcellular localization of apoptosis protein. Acta Biotheor 61(4):481–497

    Article  Google Scholar 

  36. Nakashima H, Nishikawa K, Ooi T (1986) The folding type of a protein is relevant to the amino acid composition. J Biochem 99(1):153–162

    Article  Google Scholar 

  37. Carr K, Murray E, Armah E, He RL, Yau SS-T (2010) A rapid method for characterization of protein relatedness using feature vectors. PLoS One 5(3):e9550

    Article  Google Scholar 

  38. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  39. Chou K-C, Wu Z-C, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6(3):e18258

    Article  Google Scholar 

  40. Chou K-C, Zhang C-T (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349

    Article  Google Scholar 

  41. Tsoumakas G, Katakis I, Vlahavas I (2010) In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer US, pp 667–685

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant No. 61302128) and the Science and Technology Foundation of University of Jinan (Grant No. XKY1402), and JL was supported in part by the National Science Foundation grant [III1162374] and the National Institutes of Health (HG008632).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shanping Qiao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiao, S., Yan, B. & Li, J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features. Appl Intell 48, 1813–1824 (2018). https://doi.org/10.1007/s10489-017-1029-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1029-6

Keywords

Navigation