Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences

Agrawal, Saurabh; Sisodia, Dilip Singh; Nagwani, Naresh Kumar

doi:10.1007/s11517-021-02436-5

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences

Original Article
Published: 20 September 2021

Volume 59, pages 2297–2310, (2021)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Saurabh Agrawal¹,
Dilip Singh Sisodia¹ &
Naresh Kumar Nagwani¹

374 Accesses
Explore all metrics

Abstract

Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Multi-function Prediction of Unknown Protein Sequences Using Multilabel Classifiers and Augmented Sequence Features

Article 04 May 2021

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Article 01 March 2024

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Availability of data and material

Gram-Positive benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/. And Gram-Negative benchmark dataset is available in the web link http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/ as on date 20 August 2020.

References

Bernardes J, Pedreira C (2013) A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol 7:122–141. https://doi.org/10.2174/18722083113079990006
Article CAS PubMed Google Scholar
Weimer A, Kohlstedt M, Volke DC et al (2020) Industrial biotechnology of Pseudomonas putida: advances and prospects. Appl Microbiol Biotechnol 104:7745–7766. https://doi.org/10.1007/s00253-020-10811-9
Article CAS PubMed PubMed Central Google Scholar
Zhang T, Ding Y, Chou KC (2006) Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 30:367–371. https://doi.org/10.1016/j.compbiolchem.2006.08.003
Article CAS PubMed Google Scholar
Cong H, Liu H, Chen Y, Cao Y (2020) Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization. Med Biol Eng Compu. https://doi.org/10.1007/s11517-020-02275-w
Article Google Scholar
Zhang W, Xu J, Zou X (2019) Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and GO annotation data. IEEE/ACM Trans Comput Biol Bioinf 5963:1–1. https://doi.org/10.1109/tcbb.2019.2916038
Article CAS Google Scholar
Ijaq J, Malik G, Kumar A et al (2019) A model to predict the function of hypothetical proteins through a nine-point classification scoring schema. BMC Bioinformatics 20:1–8. https://doi.org/10.1186/s12859-018-2554-y
Article Google Scholar
Vijaya PA, Murty MN, Subramanian DK (2006) Efficient median based clustering and classification techniques for protein sequences. Pattern Anal Appl 9:243–255. https://doi.org/10.1007/s10044-006-0040-z
Article Google Scholar
Ding YS, Zhang TL (2008) Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: an approach with immune genetic algorithm-based ensemble classifier. Pattern Recogn Lett 29:1887–1892. https://doi.org/10.1016/j.patrec.2008.06.007
Article CAS Google Scholar
Demichelis F, Magni P, Piergiorgi P et al (2006) A hierarchical Naïve Bayes model for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMC Bioinformatics 7:1–12. https://doi.org/10.1186/1471-2105-7-514
Article CAS Google Scholar
Lodhi H, Muggleton S, Sternberg MJE (2009) Multi-class protein fold recognition using large margin logic based divide and conquer learning. Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio ’09 17:22–26. https://doi.org/10.1145/1562090.1562095
Ghanty P, Pal NR (2009) Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers. IEEE Trans Nanobiosci 8:100–110. https://doi.org/10.1109/TNB.2009.2016488
Article Google Scholar
Wang T, Yang J (2009) Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Diversity 13:475–481. https://doi.org/10.1007/s11030-009-9134-z
Article CAS Google Scholar
Wang S, Liu S (2015) Protein sub-nuclear localization based on effective fusion representations and dimension reduction algorithm LDA. Int J Mol Sci 16:30343–30361. https://doi.org/10.3390/ijms161226237
Article CAS PubMed PubMed Central Google Scholar
Larrañaga P, Calvo B, Santana R et al (2006) Machine learning in bioinformatics. Brief Bioinform 7:86–112. https://doi.org/10.1093/bib/bbk007
Article CAS PubMed Google Scholar
Dorn M, De Souza ON (2010) A3N: an artificial neural network n-gram-based method to approximate 3-D polypeptides structure prediction. Expert Syst Appl 37:7497–7508. https://doi.org/10.1016/j.eswa.2010.04.096
Article Google Scholar
Piovesan D, Giollo M, Leonardi E et al (2015) INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res 43:W134–W140. https://doi.org/10.1093/nar/gkv523
Article CAS PubMed PubMed Central Google Scholar
Gabaldón T, Huynen MA (2004) Prediction of protein function and pathways in the genome era. Cell Mol Life Sci 61:930–944. https://doi.org/10.1007/s00018-003-3387-y
Article CAS PubMed Google Scholar
Dehzangi A, Heffernan R, Sharma A et al (2015) Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC. J Theor Biol 364:284–294. https://doi.org/10.1016/j.jtbi.2014.09.029
Article CAS PubMed Google Scholar
Mak M, Guo J, Kung S (2008) PairProSVM : protein subcellular localization based on local pairwise profile alignment and SVM 5:416–422
CAS Google Scholar
Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Compu 53:331–344. https://doi.org/10.1007/s11517-014-1238-7
Article Google Scholar
Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124:3381–3392. https://doi.org/10.1242/jcs.089110
Article CAS PubMed Google Scholar
Chou KC, Bin SH (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16. https://doi.org/10.1016/j.ab.2007.07.006
Article CAS PubMed Google Scholar
Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6:1–10. https://doi.org/10.1371/journal.pone.0018258
Article CAS Google Scholar
Shen H-B, Chou K-C (2009) Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. Protein Pept Lett 16:1478–1484. https://doi.org/10.2174/092986609789839322
Article CAS PubMed Google Scholar
Bin SH, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264:326–333. https://doi.org/10.1016/j.jtbi.2010.01.018
Article CAS Google Scholar
Bhasin M, Raghava GPS (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 279:23262–23266. https://doi.org/10.1074/jbc.M401932200
Article CAS PubMed Google Scholar
Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9:27–36. https://doi.org/10.1093/protein/9.1.27
Article CAS PubMed Google Scholar
Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19. https://doi.org/10.1093/bioinformatics/bth466
Article CAS PubMed Google Scholar
Chou KC, Bin SH (2007) MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem Biophys Res Commun 360:339–345. https://doi.org/10.1016/j.bbrc.2007.06.027
Article CAS PubMed Google Scholar
Chen Z, Zhao P, Li F et al (2018) IFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 34:2499–2502. https://doi.org/10.1093/bioinformatics/bty140
Article CAS PubMed PubMed Central Google Scholar
Jing X, Dong Q, HONG D, Lu R (2019) Amino acid encoding methods for protein sequences: a comprehensive review and assessment. IEEE/ACM Transactions on Computational Biology and Bioinformatics PP:1–1. https://doi.org/10.1109/tcbb.2019.2911677
Wang J, Yang B, Revote J et al (2017) POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33:2756–2758. https://doi.org/10.1093/bioinformatics/btx302
Article CAS PubMed Google Scholar
Biochem J, Professi AM (1986) The folding acid type of a protein is relevant to the amino composition * School of Allied Medical Professi Previous analyses of amino have shown that the amino acid composition acid composition data of a protein contains information about protein char ac. 99:153–162
Chou KC (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem Biophys Res Commun 278:477–483. https://doi.org/10.1006/bbrc.2000.3815
Article CAS PubMed Google Scholar
Ruan X, Zhou D, Nie R et al (2019) Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm. Med Biol Eng Compu 57:2553–2565. https://doi.org/10.1007/s11517-019-02045-3
Article Google Scholar
Han LY, Cai CZ, Lo SL et al (2004) Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA 10:355–368. https://doi.org/10.1261/rna.5890304
Article CAS PubMed PubMed Central Google Scholar
Wang S, Li W, Fei Y et al (2019) An improved process for generating uniform PSSMs and its application in protein subcellular localization via various global dimension reduction techniques. IEEE Access 7:42384–42395. https://doi.org/10.1109/ACCESS.2019.2907642
Article Google Scholar
Liang S, Ma A, Yang S et al (2018) A review of matched-pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 16:88–97. https://doi.org/10.1016/j.csbj.2018.02.005
Article CAS PubMed PubMed Central Google Scholar
Feng YE, Kou GS (2015) Identify beta-hairpin motifs with quadratic discriminant algorithm based on the chemical shifts. PLoS ONE 10:1–10. https://doi.org/10.1371/journal.pone.0139280
Article CAS Google Scholar
Wang T, Yang J (2009) Predicting subcellular localization of Gram-negative bacterial proteins by linear dimensionality reduction method. Protein Pept Lett 17:32–37. https://doi.org/10.2174/092986610789909494
Article Google Scholar
Wan S, Duan Y, Zou Q (2017) HPSLPred: an ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17: https://doi.org/10.1002/pmic.201700262
Soleimani H, Miller DJ (2019) Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification. Pattern Anal Appl 22:299–309. https://doi.org/10.1007/s10044-017-0629-4
Article Google Scholar
Chen X, Hu X, Yi W et al (2019) Prediction of apoptosis protein subcellular localization with multilayer sparse coding and oversampling approach. BioMed Research International. https://doi.org/10.1155/2019/2436924
Xiao X, Cheng X, Chen G et al (2019) pLoc_bal-mGpos: predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics 111:886–892. https://doi.org/10.1016/j.ygeno.2018.05.017
Article CAS PubMed Google Scholar
Zhang S, Duan X (2018) Prediction of protein subcellular localization with oversampling approach and Chou’s general PseAAC. J Theor Biol 437:239–250. https://doi.org/10.1016/j.jtbi.2017.10.030
Article CAS PubMed Google Scholar
Ruan X, Zhou D, Nie R, Guo Y (2020) Predictions of apoptosis proteins by integrating different features based on improving pseudo-position-specific scoring matrix. BioMed Research International 2020: https://doi.org/10.1155/2020/4071508
Sharma R, Dehzangi A, Lyons J et al (2015) Predict gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC. IEEE Trans Nanobiosci 14:915–926. https://doi.org/10.1109/TNB.2015.2500186
Article Google Scholar
Chen H, Huang N, Sun Z (2006) SubLoc: a server/client suite for protein subcellular location based on SOAP. Bioinformatics 22:376–377. https://doi.org/10.1093/bioinformatics/bti822
Article CAS PubMed Google Scholar
Emanuelsson O, Nielsen H, Brunak S, Von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300:1005–1016. https://doi.org/10.1006/jmbi.2000.3903
Article CAS PubMed Google Scholar
Pierleoni A, Martelli PL, Fariselli P, Casadio R (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22:408–416. https://doi.org/10.1093/bioinformatics/btl222
Article Google Scholar
Chen J, Xu H, He P, an et al (2016) A multiple information fusion method for predicting subcellular locations of two different types of bacterial protein simultaneously. BioSystems 139:37–45. https://doi.org/10.1016/j.biosystems.2015.12.002
Article CAS PubMed Google Scholar
Bin SH, Chou KC (2007) Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. Protein Eng Des Sel 20:39–46. https://doi.org/10.1093/protein/gzl053
Article Google Scholar
Chou KC, Bin SH (2006) Large-scale predictions of gram-negative bacterial protein subcellular locations. J Proteome Res 5:3420–3428. https://doi.org/10.1021/pr060404b
Article CAS PubMed Google Scholar
Rahman J, Mondal MNI, Ben IMK, Hasan MAM (2016) Feature fusion based SVM classifier for protein subcellular localization prediction. J Integr Bioinform 13:288. https://doi.org/10.2390/biecoll-jib-2016-288
Article PubMed Google Scholar
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure. Function and Genetics 43:246–255. https://doi.org/10.1002/prot.1035
Article CAS Google Scholar
Shen H, Chou K (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. 20:561–567. https://doi.org/10.1093/protein/gzm057
Yu B, Li S, Qiu W et al (2018) Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 19:1–17. https://doi.org/10.1186/s12864-018-4849-9
Article CAS Google Scholar
Martinez AM, Kak AC (2001) PCA versus LDA. IEEE Trans Pattern Anal Mach Intell 23:228–233. https://doi.org/10.1109/34.908974
Article Google Scholar
Zhou ZH, Liu XY (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18:63–77. https://doi.org/10.1109/TKDE.2006.17
Article Google Scholar
Toussi CA, Haddadnia J, Matta CF (2020) Drug design by machine - trained elastic networks : predicting Ser / Thr - protein kinase inhibitors ’ activities. Mol Diversity. https://doi.org/10.1007/s11030-020-10074-6
Article Google Scholar
Ahamed TKS, Rajan VK, Sabira K, Muraleedharan K (2018) QSAR classification-based virtual screening followed by molecular docking studies for identification of potential inhibitors of 5-lipoxygenase. Comput Biol Chem 77:154–166. https://doi.org/10.1016/j.compbiolchem.2018.10.002
Article CAS Google Scholar
Pe A, Lozano JA (2010) Sensitivity analysis of k-fold cross validation in prediction error estimation 32:569–575
Google Scholar
Obozinski G, Lanckriet G, Grant C et al (2008) Consistent probabilistic outputs for protein function prediction. Genome Biol 9:1–19. https://doi.org/10.1186/gb-2008-9-s1-s6
Article CAS Google Scholar
Lu W-C, Jin Y, Niu B et al (2008) Predicting subcellular localization with AdaBoost Learner. Protein Pept Lett 15:286–289. https://doi.org/10.2174/092986608783744234
Article PubMed Google Scholar
Rawi R, Mall R, Kunji K et al (2018) PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformatics 34:1092–1098. https://doi.org/10.1093/bioinformatics/btx662
Article CAS PubMed Google Scholar
Hakala K, Kaewphan S, Bjorne J et al (2020) Neural network and random forest models in protein function prediction. IEEE/ACM Trans Comput Biol Bioinf. https://doi.org/10.1109/TCBB.2020.3044230
Article Google Scholar
Chaitra P, Kumar RS (2018) A review of multi-class classification algorithms. International Journal of Pure and Applied Mathematics 118:17–26
Google Scholar
Uddin MR, Sharma A, Farid DM et al (2018) EvoStruct-Sub: an accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. J Theor Biol 443:138–146. https://doi.org/10.1016/j.jtbi.2018.02.002
Article CAS PubMed Google Scholar
Cheng X, Xiao X, Chou KC (2018) pLoc-mGneg: predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 110:231–239. https://doi.org/10.1016/j.ygeno.2017.10.002
Article CAS Google Scholar
Sinha AK, Singh P, Prakash A et al (2017) Putative drug and vaccine target identification in Leishmania donovani membrane proteins using Naïve Bayes probabilistic classifier. IEEE/ACM Trans Comput Biol Bioinf 14:204–211. https://doi.org/10.1109/TCBB.2016.2570217
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
Saurabh Agrawal, Dilip Singh Sisodia & Naresh Kumar Nagwani

Authors

Saurabh Agrawal
View author publications
You can also search for this author inPubMed Google Scholar
Dilip Singh Sisodia
View author publications
You can also search for this author inPubMed Google Scholar
Naresh Kumar Nagwani
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Saurabh Agrawal.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (CSV 83 KB)

Supplementary file2 (CSV 1531 KB)

Supplementary file3 (CSV 27 KB)

Supplementary file4 (CSV 310 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agrawal, S., Sisodia, D.S. & Nagwani, N.K. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 59, 2297–2310 (2021). https://doi.org/10.1007/s11517-021-02436-5

Download citation

Received: 08 November 2020
Accepted: 29 August 2021
Published: 20 September 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11517-021-02436-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-function Prediction of Unknown Protein Sequences Using Multilabel Classifiers and Augmented Sequence Features

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

ProPythia: A Python Automated Platform for the Classification of Proteins Using Machine Learning

Availability of data and material

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary file1 (CSV 83 KB)

Supplementary file2 (CSV 1531 KB)

Supplementary file3 (CSV 27 KB)

Supplementary file4 (CSV 310 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now