Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.




Similar content being viewed by others
Availability of data and material
Gram-Positive benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/. And Gram-Negative benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/ as on date 20 August 2020.
References
Bernardes J, Pedreira C (2013) A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol 7:122–141. https://doi.org/10.2174/18722083113079990006
Weimer A, Kohlstedt M, Volke DC et al (2020) Industrial biotechnology of Pseudomonas putida: advances and prospects. Appl Microbiol Biotechnol 104:7745–7766. https://doi.org/10.1007/s00253-020-10811-9
Zhang T, Ding Y, Chou KC (2006) Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 30:367–371. https://doi.org/10.1016/j.compbiolchem.2006.08.003
Cong H, Liu H, Chen Y, Cao Y (2020) Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization. Med Biol Eng Compu. https://doi.org/10.1007/s11517-020-02275-w
Zhang W, Xu J, Zou X (2019) Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and GO annotation data. IEEE/ACM Trans Comput Biol Bioinf 5963:1–1. https://doi.org/10.1109/tcbb.2019.2916038
Ijaq J, Malik G, Kumar A et al (2019) A model to predict the function of hypothetical proteins through a nine-point classification scoring schema. BMC Bioinformatics 20:1–8. https://doi.org/10.1186/s12859-018-2554-y
Vijaya PA, Murty MN, Subramanian DK (2006) Efficient median based clustering and classification techniques for protein sequences. Pattern Anal Appl 9:243–255. https://doi.org/10.1007/s10044-006-0040-z
Ding YS, Zhang TL (2008) Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recogn Lett 29:1887–1892. https://doi.org/10.1016/j.patrec.2008.06.007
Demichelis F, Magni P, Piergiorgi P et al (2006) A hierarchical Naïve Bayes model for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC Bioinformatics 7:1–12. https://doi.org/10.1186/1471-2105-7-514
Lodhi H, Muggleton S, Sternberg MJE (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio ’09 17:22–26. https://doi.org/10.1145/1562090.1562095
Ghanty P, Pal NR (2009) Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans Nanobiosci 8:100–110. https://doi.org/10.1109/TNB.2009.2016488
Wang T, Yang J (2009) Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Diversity 13:475–481. https://doi.org/10.1007/s11030-009-9134-z
Wang S, Liu S (2015) Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int J Mol Sci 16:30343–30361. https://doi.org/10.3390/ijms161226237
Larrañaga P, Calvo B, Santana R et al (2006) Machine learning in bioinformatics. Brief Bioinform 7:86–112. https://doi.org/10.1093/bib/bbk007
Dorn M, De Souza ON (2010) A3N: an artificial neural network n-gram-based method to approximate 3-D polypeptides structure prediction. Expert Syst Appl 37:7497–7508. https://doi.org/10.1016/j.eswa.2010.04.096
Piovesan D, Giollo M, Leonardi E et al (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43:W134–W140. https://doi.org/10.1093/nar/gkv523
Gabaldón T, Huynen MA (2004) Prediction of protein function and pathways in the genome era. Cell Mol Life Sci 61:930–944. https://doi.org/10.1007/s00018-003-3387-y
Dehzangi A, Heffernan R, Sharma A et al (2015) Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J Theor Biol 364:284–294. https://doi.org/10.1016/j.jtbi.2014.09.029
Mak M, Guo J, Kung S (2008) PairProSVM : protein subcellular localization based on local pairwise profile alignment and SVM 5:416–422
Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Compu 53:331–344. https://doi.org/10.1007/s11517-014-1238-7
Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124:3381–3392. https://doi.org/10.1242/jcs.089110
Chou KC, Bin SH (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16. https://doi.org/10.1016/j.ab.2007.07.006
Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6:1–10. https://doi.org/10.1371/journal.pone.0018258
Shen H-B, Chou K-C (2009) Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. Protein Pept Lett 16:1478–1484. https://doi.org/10.2174/092986609789839322
Bin SH, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333. https://doi.org/10.1016/j.jtbi.2010.01.018
Bhasin M, Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262–23266. https://doi.org/10.1074/jbc.M401932200
Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9:27–36. https://doi.org/10.1093/protein/9.1.27
Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19. https://doi.org/10.1093/bioinformatics/bth466
Chou KC, Bin SH (2007) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345. https://doi.org/10.1016/j.bbrc.2007.06.027
Chen Z, Zhao P, Li F et al (2018) IFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502. https://doi.org/10.1093/bioinformatics/bty140
Jing X, Dong Q, HONG D, Lu R (2019) Amino acid encoding methods for protein sequences: a comprehensive review and assessment. IEEE/ACM Transactions on Computational Biology and Bioinformatics PP:1–1. https://doi.org/10.1109/tcbb.2019.2911677
Wang J, Yang B, Revote J et al (2017) POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33:2756–2758. https://doi.org/10.1093/bioinformatics/btx302
Biochem J, Professi AM (1986) The folding acid type of a protein is relevant to the amino composition * School of Allied Medical Professi Previous analyses of amino have shown that the amino acid composition acid composition data of a protein contains information about protein char ac. 99:153–162
Chou KC (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 278:477–483. https://doi.org/10.1006/bbrc.2000.3815
Ruan X, Zhou D, Nie R et al (2019) Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm. Med Biol Eng Compu 57:2553–2565. https://doi.org/10.1007/s11517-019-02045-3
Han LY, Cai CZ, Lo SL et al (2004) Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 10:355–368. https://doi.org/10.1261/rna.5890304
Wang S, Li W, Fei Y et al (2019) An improved process for generating uniform PSSMs and its application in protein subcellular localization via various global dimension reduction techniques. IEEE Access 7:42384–42395. https://doi.org/10.1109/ACCESS.2019.2907642
Liang S, Ma A, Yang S et al (2018) A review of matched-pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 16:88–97. https://doi.org/10.1016/j.csbj.2018.02.005
Feng YE, Kou GS (2015) Identify beta-hairpin motifs with quadratic discriminant algorithm based on the chemical shifts. PLoS ONE 10:1–10. https://doi.org/10.1371/journal.pone.0139280
Wang T, Yang J (2009) Predicting subcellular localization of Gram-negative bacterial proteins by linear dimensionality reduction method. Protein Pept Lett 17:32–37. https://doi.org/10.2174/092986610789909494
Wan S, Duan Y, Zou Q (2017) HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17: https://doi.org/10.1002/pmic.201700262
Soleimani H, Miller DJ (2019) Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Appl 22:299–309. https://doi.org/10.1007/s10044-017-0629-4
Chen X, Hu X, Yi W et al (2019) Prediction of apoptosis protein subcellular localization with multilayer sparse coding and oversampling approach. BioMed Research International. https://doi.org/10.1155/2019/2436924
Xiao X, Cheng X, Chen G et al (2019) pLoc_bal-mGpos: predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics 111:886–892. https://doi.org/10.1016/j.ygeno.2018.05.017
Zhang S, Duan X (2018) Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J Theor Biol 437:239–250. https://doi.org/10.1016/j.jtbi.2017.10.030
Ruan X, Zhou D, Nie R, Guo Y (2020) Predictions of apoptosis proteins by integrating different features based on improving pseudo-position-specific scoring matrix. BioMed Research International 2020: https://doi.org/10.1155/2020/4071508
Sharma R, Dehzangi A, Lyons J et al (2015) Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC. IEEE Trans Nanobiosci 14:915–926. https://doi.org/10.1109/TNB.2015.2500186
Chen H, Huang N, Sun Z (2006) SubLoc: a server/client suite for protein subcellular location based on SOAP. Bioinformatics 22:376–377. https://doi.org/10.1093/bioinformatics/bti822
Emanuelsson O, Nielsen H, Brunak S, Von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016. https://doi.org/10.1006/jmbi.2000.3903
Pierleoni A, Martelli PL, Fariselli P, Casadio R (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22:408–416. https://doi.org/10.1093/bioinformatics/btl222
Chen J, Xu H, He P, an et al (2016) A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously. BioSystems 139:37–45. https://doi.org/10.1016/j.biosystems.2015.12.002
Bin SH, Chou KC (2007) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46. https://doi.org/10.1093/protein/gzl053
Chou KC, Bin SH (2006) Large-scale predictions of gram-negative bacterial protein subcellular locations. J Proteome Res 5:3420–3428. https://doi.org/10.1021/pr060404b
Rahman J, Mondal MNI, Ben IMK, Hasan MAM (2016) Feature fusion based SVM classifier for protein subcellular localization prediction. J Integr Bioinform 13:288. https://doi.org/10.2390/biecoll-jib-2016-288
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure. Function and Genetics 43:246–255. https://doi.org/10.1002/prot.1035
Shen H, Chou K (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. 20:561–567. https://doi.org/10.1093/protein/gzm057
Yu B, Li S, Qiu W et al (2018) Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 19:1–17. https://doi.org/10.1186/s12864-018-4849-9
Martinez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23:228–233. https://doi.org/10.1109/34.908974
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77. https://doi.org/10.1109/TKDE.2006.17
Toussi CA, Haddadnia J, Matta CF (2020) Drug design by machine - trained elastic networks : predicting Ser / Thr - protein kinase inhibitors ’ activities. Mol Diversity. https://doi.org/10.1007/s11030-020-10074-6
Ahamed TKS, Rajan VK, Sabira K, Muraleedharan K (2018) QSAR classification-based virtual screening followed by molecular docking studies for identification of potential inhibitors of 5-lipoxygenase. Comput Biol Chem 77:154–166. https://doi.org/10.1016/j.compbiolchem.2018.10.002
Pe A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation 32:569–575
Obozinski G, Lanckriet G, Grant C et al (2008) Consistent probabilistic outputs for protein function prediction. Genome Biol 9:1–19. https://doi.org/10.1186/gb-2008-9-s1-s6
Lu W-C, Jin Y, Niu B et al (2008) Predicting subcellular localization with AdaBoost Learner. Protein Pept Lett 15:286–289. https://doi.org/10.2174/092986608783744234
Rawi R, Mall R, Kunji K et al (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34:1092–1098. https://doi.org/10.1093/bioinformatics/btx662
Hakala K, Kaewphan S, Bjorne J et al (2020) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2020.3044230
Chaitra P, Kumar RS (2018) A review of multi-class classification algorithms. International Journal of Pure and Applied Mathematics 118:17–26
Uddin MR, Sharma A, Farid DM et al (2018) EvoStruct-Sub: an accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. J Theor Biol 443:138–146. https://doi.org/10.1016/j.jtbi.2018.02.002
Cheng X, Xiao X, Chou KC (2018) pLoc-mGneg: predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 110:231–239. https://doi.org/10.1016/j.ygeno.2017.10.002
Sinha AK, Singh P, Prakash A et al (2017) Putative drug and vaccine target identification in Leishmania donovani membrane proteins using Naïve Bayes probabilistic classifier. IEEE/ACM Trans Comput Biol Bioinf 14:204–211. https://doi.org/10.1109/TCBB.2016.2570217
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Agrawal, S., Sisodia, D.S. & Nagwani, N.K. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 59, 2297–2310 (2021). https://doi.org/10.1007/s11517-021-02436-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-021-02436-5