Abstract
Accurate identification of protein-nucleotide binding residues is crucial for the study of drug structure and protein functional annotation. The study of protein-nucleotide binding residues is a typical problem of sample imbalance. The minority class (binding residues) are far less than the majority class (non-binding residues). The traditional machine learning algorithm is not universal for this kind of research, the results will be seriously biased to majority class. To deal with the serious imbalance problem, we propose a new computational method to identify protein-nucleotide binding residues via Graph Regularized k-local Hyperplane Distance Nearest Neighbor (GHKNN). On the training set, we compare the performance of the basic classifier, the ensemble classifier and the single classifier. On the independent test sets, we compare the performance with other existing models. The experimental results prove that our proposed method has higher accuracy in the identification of protein-nucleotide binding residues and is more prominent than other existing models. The data and material are freely available at https://github.com/guofei-tju/GHKNN.
Similar content being viewed by others
References
Gao M, Skolnick J (2012) The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc Natl Acad USA 109(10):3784–3789
Kokubo H, Tanaka T, Okamoto Y (2011) Ab initio prediction of protein-ligand binding structures by replica-exchange umbrella sampling simulations. J Comput Chem 32(13):2810–2821
Rose PW, Andreas P, Chunxiao B, Bluhm WF, et al. (2015) The rcsb protein data bank: views of structural biology for basic and applied research and education. Nuclc Acids Res 43(D1):345–56
Ding YJ, Tang JJ, Guo F (2020) Identification of drug–target interactions via fuzzy bipartite local model. Neural Comput Applic 32:10303–10319
Ding YJ, Tang JJ, Guo F (2020) Identification of drug-target interactions via dual laplacian regularized least squares with multiple kernel fusion. Knowl-Based Syst 204:106254
Ding YJ, Tang JJ, Guo F (2021) Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing, page https://doi.org/10.1016/j.neucom.2021.05.100
Wang H, Ding YJ, Tang JJ, Guo F (2020) Identification of membrane protein types via multivariate information fusion with hilbert–schmidt independence criterion. Neurocomputing 383:257–269
Shen YN, Tang JJ, Guo F (2019) Identification of protein subcellular localization via integrating evolutionary and physicochemical information into chou’s general pseaac. Journal of Theoretical Biology 462:230–239
Ding YJ, Tang JJ, Guo F (2020) Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput 96:106596
Ding YJ, Tang JJ, Guo F (2019) Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, page https://doi.org/10.1109/TCBB.2019.2954826.
Lin H, Liang Z-Y, Tang H, Chen W (2019) Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 16(4):1316–1321
Lin H, Deng E-Z, Ding H, Chen W, Chou K-C (2014) ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Molecular BioSystems 42(21):961–972
Chen W, Yang H, Feng P, Ding H, Lin H (2017) iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 33(22):3518–3523
Tal P, Bell RE, Itay M, Fabian G, Nir BT (2002) Rate4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics, (18), pp S71–s77
Aharon A, Dan G, Nir BT (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 307(1):447–463
Yu DJ, Hu J, Huang Y, et al. (2013) Targetatpsite: a template-free method for atp-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem 34(11):974–985
Ding YJ, Tang JJ, Guo F (2017) Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 57(12):3149–3161
Zhao Z, Xu Y, Zhao Y (2019) SXGBsite: prediction of protein-ligand binding sites using sequence information and extreme gradient boosting. Genes 10(12):965
Hu J, Rao L, Fan X (2020) Identification of ligand-binding residues using protein sequence profile alignment and query-specific support vector machine model. Anal Biochem 604:113799
Song J, Liu G, Jiang J (2021) Prediction of protein–ATP binding residues based on ensemble of deep convolutional neural networks and lightGBM algorithm. Int J Mol Sci 22(2):939
Hendlich M (1997) Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 15:359–363
Dundas J, Ouyang Z, Tseng J, Binkowski T, Turpaz Y, Liang J (2006) Castp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res 34:116–118
Levitt DG, Banaszak LJ (1992) Pocket: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234
Laskowski RA (1995) Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–330
Laurie ATR, Jackson MR (2005) Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics 21(9):1908–1916
Hernandez M, Ghersi D, Sanchez R (2009) Sitehound-web: a server for ligand binding site identification in protein structures. Nucleic Acids Res 37(2):413–416
Hoffmann B, Zaslavskiy M, Vert JP, Stoven V (2010) A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3d: application to ligand prediction. Bmc Bioinformatics 11 (1):1–16
Yu DJ, Hu J, Tang ZM, et al. (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190
Chen K, Mizianty MJ, Kurgan L (2011) Atpsite: sequence-based prediction of atp-binding residues. Proteome Sci 9(1):1–8
Chen K, Marcin JM, Lukasz K (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3):331–41
Yu DJ, Hu J, Huang Y, et al. (2013) Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(4):994–1008
Yang JY, Ambrish R, Zhang Y (2013) Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20):2588–2595
Huang B, Schroeder M (2006) Ligsitecsc: predicting ligand binding sites using the connolly surface and degree of conservation. Bmc Structural Biology 6(1):19–19
Glaser F, Morris RJ, Najmanovich RJ et al (2010) A method for localizing ligand binding pockets in protein structures. Proteins-structure Function and Bioinformatics 62(2):479–488
Hu J, Yang L, Yang Z, Yu DJ (2018) ATPBind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons. J Chem Inform Model 58(2):501–510
Ahmed NN, Natarajan T, Rao KR (2006) Discrete cosine transform. IEEE Trans Comput C-23(1):90–93
Loris N, Alessandra L, Sheryl B (2014) An empirical study of different approaches for protein classification. Sci World J 2014:236717
Vincent P, Bengio Y (2002) K-local hyperplane and convex distance nearest neighbor algorithms. Adv Neural Inform Process Syst 14:985–992
Yang JY, Roy A, Zhang Y (2013) Biolip: a semi-manually curated database for biologically relevant ligandprotein interactions. Nuclc Acids Res 41(D1):1096–1103
Altschul SF, Madden TL, Schäffer AA, Zhang JH, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein databases search programs. Nucleic Acids Res 25(17):3389–3402
Shandar A, Michael G, Akinori S (2010) Real value prediction of solvent accessibility from amino acid sequence. Proteins-structure Function and Bioinformatics 50(4):629–635
Joo K, Lee SJ, Lee J (2012) Sann: solvent accessibility prediction of proteins by nearest neighbor method. Proteins-structure Function and Bioinformatics 80(7):1791–1797
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13 (1):21–27
Leo B (2001) Random forests. Machine Learn 45(1):5–32
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29 (5):1189–1232
Dua D, Graff C (2017) UCI machine learning repository
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC 61902271, 61772362 and 61972280), the Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014) and the National Key R&D Program of China (2020YFA0908400).
The author would like to thank Professor Dong-jun Yu for providing the dataset, which helped improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of Interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yijie Ding and Chao Yang have the same contribution, they are joint first authors.
Rights and permissions
About this article
Cite this article
Ding, Y., Yang, C., Tang, J. et al. Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. Appl Intell 52, 6598–6612 (2022). https://doi.org/10.1007/s10489-021-02737-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02737-0