Skip to main content
Log in

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Availability of data and material

Gram-Positive benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/. And Gram-Negative benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/ as on date 20 August 2020.

References

  1. Bernardes J, Pedreira C (2013) A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol 7:122–141. https://doi.org/10.2174/18722083113079990006

    Article  CAS  PubMed  Google Scholar 

  2. Weimer A, Kohlstedt M, Volke DC et al (2020) Industrial biotechnology of Pseudomonas putida: advances and prospects. Appl Microbiol Biotechnol 104:7745–7766. https://doi.org/10.1007/s00253-020-10811-9

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Zhang T, Ding Y, Chou KC (2006) Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 30:367–371. https://doi.org/10.1016/j.compbiolchem.2006.08.003

    Article  CAS  PubMed  Google Scholar 

  4. Cong H, Liu H, Chen Y, Cao Y (2020) Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization. Med Biol Eng Compu. https://doi.org/10.1007/s11517-020-02275-w

    Article  Google Scholar 

  5. Zhang W, Xu J, Zou X (2019) Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and GO annotation data. IEEE/ACM Trans Comput Biol Bioinf 5963:1–1. https://doi.org/10.1109/tcbb.2019.2916038

    Article  CAS  Google Scholar 

  6. Ijaq J, Malik G, Kumar A et al (2019) A model to predict the function of hypothetical proteins through a nine-point classification scoring schema. BMC Bioinformatics 20:1–8. https://doi.org/10.1186/s12859-018-2554-y

    Article  Google Scholar 

  7. Vijaya PA, Murty MN, Subramanian DK (2006) Efficient median based clustering and classification techniques for protein sequences. Pattern Anal Appl 9:243–255. https://doi.org/10.1007/s10044-006-0040-z

    Article  Google Scholar 

  8. Ding YS, Zhang TL (2008) Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recogn Lett 29:1887–1892. https://doi.org/10.1016/j.patrec.2008.06.007

    Article  CAS  Google Scholar 

  9. Demichelis F, Magni P, Piergiorgi P et al (2006) A hierarchical Naïve Bayes model for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC Bioinformatics 7:1–12. https://doi.org/10.1186/1471-2105-7-514

    Article  CAS  Google Scholar 

  10. Lodhi H, Muggleton S, Sternberg MJE (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio ’09 17:22–26. https://doi.org/10.1145/1562090.1562095

  11. Ghanty P, Pal NR (2009) Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans Nanobiosci 8:100–110. https://doi.org/10.1109/TNB.2009.2016488

    Article  Google Scholar 

  12. Wang T, Yang J (2009) Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Diversity 13:475–481. https://doi.org/10.1007/s11030-009-9134-z

    Article  CAS  Google Scholar 

  13. Wang S, Liu S (2015) Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int J Mol Sci 16:30343–30361. https://doi.org/10.3390/ijms161226237

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Larrañaga P, Calvo B, Santana R et al (2006) Machine learning in bioinformatics. Brief Bioinform 7:86–112. https://doi.org/10.1093/bib/bbk007

    Article  CAS  PubMed  Google Scholar 

  15. Dorn M, De Souza ON (2010) A3N: an artificial neural network n-gram-based method to approximate 3-D polypeptides structure prediction. Expert Syst Appl 37:7497–7508. https://doi.org/10.1016/j.eswa.2010.04.096

    Article  Google Scholar 

  16. Piovesan D, Giollo M, Leonardi E et al (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43:W134–W140. https://doi.org/10.1093/nar/gkv523

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Gabaldón T, Huynen MA (2004) Prediction of protein function and pathways in the genome era. Cell Mol Life Sci 61:930–944. https://doi.org/10.1007/s00018-003-3387-y

    Article  CAS  PubMed  Google Scholar 

  18. Dehzangi A, Heffernan R, Sharma A et al (2015) Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J Theor Biol 364:284–294. https://doi.org/10.1016/j.jtbi.2014.09.029

    Article  CAS  PubMed  Google Scholar 

  19. Mak M, Guo J, Kung S (2008) PairProSVM : protein subcellular localization based on local pairwise profile alignment and SVM 5:416–422

    CAS  Google Scholar 

  20. Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Compu 53:331–344. https://doi.org/10.1007/s11517-014-1238-7

    Article  Google Scholar 

  21. Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124:3381–3392. https://doi.org/10.1242/jcs.089110

    Article  CAS  PubMed  Google Scholar 

  22. Chou KC, Bin SH (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16. https://doi.org/10.1016/j.ab.2007.07.006

    Article  CAS  PubMed  Google Scholar 

  23. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6:1–10. https://doi.org/10.1371/journal.pone.0018258

    Article  CAS  Google Scholar 

  24. Shen H-B, Chou K-C (2009) Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. Protein Pept Lett 16:1478–1484. https://doi.org/10.2174/092986609789839322

    Article  CAS  PubMed  Google Scholar 

  25. Bin SH, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333. https://doi.org/10.1016/j.jtbi.2010.01.018

    Article  CAS  Google Scholar 

  26. Bhasin M, Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262–23266. https://doi.org/10.1074/jbc.M401932200

    Article  CAS  PubMed  Google Scholar 

  27. Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9:27–36. https://doi.org/10.1093/protein/9.1.27

    Article  CAS  PubMed  Google Scholar 

  28. Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19. https://doi.org/10.1093/bioinformatics/bth466

    Article  CAS  PubMed  Google Scholar 

  29. Chou KC, Bin SH (2007) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345. https://doi.org/10.1016/j.bbrc.2007.06.027

    Article  CAS  PubMed  Google Scholar 

  30. Chen Z, Zhao P, Li F et al (2018) IFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502. https://doi.org/10.1093/bioinformatics/bty140

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Jing X, Dong Q, HONG D, Lu R (2019) Amino acid encoding methods for protein sequences: a comprehensive review and assessment. IEEE/ACM Transactions on Computational Biology and Bioinformatics PP:1–1. https://doi.org/10.1109/tcbb.2019.2911677

  32. Wang J, Yang B, Revote J et al (2017) POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33:2756–2758. https://doi.org/10.1093/bioinformatics/btx302

    Article  CAS  PubMed  Google Scholar 

  33. Biochem J, Professi AM (1986) The folding acid type of a protein is relevant to the amino composition * School of Allied Medical Professi Previous analyses of amino have shown that the amino acid composition acid composition data of a protein contains information about protein char ac. 99:153–162

  34. Chou KC (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 278:477–483. https://doi.org/10.1006/bbrc.2000.3815

    Article  CAS  PubMed  Google Scholar 

  35. Ruan X, Zhou D, Nie R et al (2019) Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm. Med Biol Eng Compu 57:2553–2565. https://doi.org/10.1007/s11517-019-02045-3

    Article  Google Scholar 

  36. Han LY, Cai CZ, Lo SL et al (2004) Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 10:355–368. https://doi.org/10.1261/rna.5890304

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Wang S, Li W, Fei Y et al (2019) An improved process for generating uniform PSSMs and its application in protein subcellular localization via various global dimension reduction techniques. IEEE Access 7:42384–42395. https://doi.org/10.1109/ACCESS.2019.2907642

    Article  Google Scholar 

  38. Liang S, Ma A, Yang S et al (2018) A review of matched-pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 16:88–97. https://doi.org/10.1016/j.csbj.2018.02.005

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Feng YE, Kou GS (2015) Identify beta-hairpin motifs with quadratic discriminant algorithm based on the chemical shifts. PLoS ONE 10:1–10. https://doi.org/10.1371/journal.pone.0139280

    Article  CAS  Google Scholar 

  40. Wang T, Yang J (2009) Predicting subcellular localization of Gram-negative bacterial proteins by linear dimensionality reduction method. Protein Pept Lett 17:32–37. https://doi.org/10.2174/092986610789909494

    Article  Google Scholar 

  41. Wan S, Duan Y, Zou Q (2017) HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17: https://doi.org/10.1002/pmic.201700262

  42. Soleimani H, Miller DJ (2019) Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Appl 22:299–309. https://doi.org/10.1007/s10044-017-0629-4

    Article  Google Scholar 

  43. Chen X, Hu X, Yi W et al (2019) Prediction of apoptosis protein subcellular localization with multilayer sparse coding and oversampling approach. BioMed Research International. https://doi.org/10.1155/2019/2436924

  44. Xiao X, Cheng X, Chen G et al (2019) pLoc_bal-mGpos: predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics 111:886–892. https://doi.org/10.1016/j.ygeno.2018.05.017

    Article  CAS  PubMed  Google Scholar 

  45. Zhang S, Duan X (2018) Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J Theor Biol 437:239–250. https://doi.org/10.1016/j.jtbi.2017.10.030

    Article  CAS  PubMed  Google Scholar 

  46. Ruan X, Zhou D, Nie R, Guo Y (2020) Predictions of apoptosis proteins by integrating different features based on improving pseudo-position-specific scoring matrix. BioMed Research International 2020: https://doi.org/10.1155/2020/4071508

  47. Sharma R, Dehzangi A, Lyons J et al (2015) Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC. IEEE Trans Nanobiosci 14:915–926. https://doi.org/10.1109/TNB.2015.2500186

    Article  Google Scholar 

  48. Chen H, Huang N, Sun Z (2006) SubLoc: a server/client suite for protein subcellular location based on SOAP. Bioinformatics 22:376–377. https://doi.org/10.1093/bioinformatics/bti822

    Article  CAS  PubMed  Google Scholar 

  49. Emanuelsson O, Nielsen H, Brunak S, Von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016. https://doi.org/10.1006/jmbi.2000.3903

    Article  CAS  PubMed  Google Scholar 

  50. Pierleoni A, Martelli PL, Fariselli P, Casadio R (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22:408–416. https://doi.org/10.1093/bioinformatics/btl222

    Article  Google Scholar 

  51. Chen J, Xu H, He P, an et al (2016) A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously. BioSystems 139:37–45. https://doi.org/10.1016/j.biosystems.2015.12.002

    Article  CAS  PubMed  Google Scholar 

  52. Bin SH, Chou KC (2007) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46. https://doi.org/10.1093/protein/gzl053

    Article  Google Scholar 

  53. Chou KC, Bin SH (2006) Large-scale predictions of gram-negative bacterial protein subcellular locations. J Proteome Res 5:3420–3428. https://doi.org/10.1021/pr060404b

    Article  CAS  PubMed  Google Scholar 

  54. Rahman J, Mondal MNI, Ben IMK, Hasan MAM (2016) Feature fusion based SVM classifier for protein subcellular localization prediction. J Integr Bioinform 13:288. https://doi.org/10.2390/biecoll-jib-2016-288

    Article  PubMed  Google Scholar 

  55. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure. Function and Genetics 43:246–255. https://doi.org/10.1002/prot.1035

    Article  CAS  Google Scholar 

  56. Shen H, Chou K (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. 20:561–567. https://doi.org/10.1093/protein/gzm057

  57. Yu B, Li S, Qiu W et al (2018) Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 19:1–17. https://doi.org/10.1186/s12864-018-4849-9

    Article  CAS  Google Scholar 

  58. Martinez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23:228–233. https://doi.org/10.1109/34.908974

    Article  Google Scholar 

  59. Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77. https://doi.org/10.1109/TKDE.2006.17

    Article  Google Scholar 

  60. Toussi CA, Haddadnia J, Matta CF (2020) Drug design by machine - trained elastic networks : predicting Ser / Thr - protein kinase inhibitors ’ activities. Mol Diversity. https://doi.org/10.1007/s11030-020-10074-6

    Article  Google Scholar 

  61. Ahamed TKS, Rajan VK, Sabira K, Muraleedharan K (2018) QSAR classification-based virtual screening followed by molecular docking studies for identification of potential inhibitors of 5-lipoxygenase. Comput Biol Chem 77:154–166. https://doi.org/10.1016/j.compbiolchem.2018.10.002

    Article  CAS  Google Scholar 

  62. Pe A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation 32:569–575

    Google Scholar 

  63. Obozinski G, Lanckriet G, Grant C et al (2008) Consistent probabilistic outputs for protein function prediction. Genome Biol 9:1–19. https://doi.org/10.1186/gb-2008-9-s1-s6

    Article  CAS  Google Scholar 

  64. Lu W-C, Jin Y, Niu B et al (2008) Predicting subcellular localization with AdaBoost Learner. Protein Pept Lett 15:286–289. https://doi.org/10.2174/092986608783744234

    Article  PubMed  Google Scholar 

  65. Rawi R, Mall R, Kunji K et al (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34:1092–1098. https://doi.org/10.1093/bioinformatics/btx662

    Article  CAS  PubMed  Google Scholar 

  66. Hakala K, Kaewphan S, Bjorne J et al (2020) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2020.3044230

    Article  Google Scholar 

  67. Chaitra P, Kumar RS (2018) A review of multi-class classification algorithms. International Journal of Pure and Applied Mathematics 118:17–26

    Google Scholar 

  68. Uddin MR, Sharma A, Farid DM et al (2018) EvoStruct-Sub: an accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. J Theor Biol 443:138–146. https://doi.org/10.1016/j.jtbi.2018.02.002

    Article  CAS  PubMed  Google Scholar 

  69. Cheng X, Xiao X, Chou KC (2018) pLoc-mGneg: predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 110:231–239. https://doi.org/10.1016/j.ygeno.2017.10.002

    Article  CAS  Google Scholar 

  70. Sinha AK, Singh P, Prakash A et al (2017) Putative drug and vaccine target identification in Leishmania donovani membrane proteins using Naïve Bayes probabilistic classifier. IEEE/ACM Trans Comput Biol Bioinf 14:204–211. https://doi.org/10.1109/TCBB.2016.2570217

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saurabh Agrawal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agrawal, S., Sisodia, D.S. & Nagwani, N.K. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 59, 2297–2310 (2021). https://doi.org/10.1007/s11517-021-02436-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-021-02436-5

Keywords