Skip to main content
Log in

Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Accurate identification of protein-nucleotide binding residues is crucial for the study of drug structure and protein functional annotation. The study of protein-nucleotide binding residues is a typical problem of sample imbalance. The minority class (binding residues) are far less than the majority class (non-binding residues). The traditional machine learning algorithm is not universal for this kind of research, the results will be seriously biased to majority class. To deal with the serious imbalance problem, we propose a new computational method to identify protein-nucleotide binding residues via Graph Regularized k-local Hyperplane Distance Nearest Neighbor (GHKNN). On the training set, we compare the performance of the basic classifier, the ensemble classifier and the single classifier. On the independent test sets, we compare the performance with other existing models. The experimental results prove that our proposed method has higher accuracy in the identification of protein-nucleotide binding residues and is more prominent than other existing models. The data and material are freely available at https://github.com/guofei-tju/GHKNN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Gao M, Skolnick J (2012) The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc Natl Acad USA 109(10):3784–3789

    Article  Google Scholar 

  2. Kokubo H, Tanaka T, Okamoto Y (2011) Ab initio prediction of protein-ligand binding structures by replica-exchange umbrella sampling simulations. J Comput Chem 32(13):2810–2821

    Article  Google Scholar 

  3. Rose PW, Andreas P, Chunxiao B, Bluhm WF, et al. (2015) The rcsb protein data bank: views of structural biology for basic and applied research and education. Nuclc Acids Res 43(D1):345–56

    Article  Google Scholar 

  4. Ding YJ, Tang JJ, Guo F (2020) Identification of drug–target interactions via fuzzy bipartite local model. Neural Comput Applic 32:10303–10319

    Article  Google Scholar 

  5. Ding YJ, Tang JJ, Guo F (2020) Identification of drug-target interactions via dual laplacian regularized least squares with multiple kernel fusion. Knowl-Based Syst 204:106254

    Article  Google Scholar 

  6. Ding YJ, Tang JJ, Guo F (2021) Identification of drug-target interactions via multi-view graph regularized link propagation model. Neurocomputing, page https://doi.org/10.1016/j.neucom.2021.05.100

  7. Wang H, Ding YJ, Tang JJ, Guo F (2020) Identification of membrane protein types via multivariate information fusion with hilbert–schmidt independence criterion. Neurocomputing 383:257–269

    Article  Google Scholar 

  8. Shen YN, Tang JJ, Guo F (2019) Identification of protein subcellular localization via integrating evolutionary and physicochemical information into chou’s general pseaac. Journal of Theoretical Biology 462:230–239

    Article  MATH  Google Scholar 

  9. Ding YJ, Tang JJ, Guo F (2020) Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput 96:106596

    Article  Google Scholar 

  10. Ding YJ, Tang JJ, Guo F (2019) Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, page https://doi.org/10.1109/TCBB.2019.2954826.

  11. Lin H, Liang Z-Y, Tang H, Chen W (2019) Identifying sigma70 promoters with novel pseudo nucleotide composition. IEEE/ACM Transactions on Computational Biology and Bioinformatics 16(4):1316–1321

    Article  Google Scholar 

  12. Lin H, Deng E-Z, Ding H, Chen W, Chou K-C (2014) ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Molecular BioSystems 42(21):961–972

    Google Scholar 

  13. Chen W, Yang H, Feng P, Ding H, Lin H (2017) iDNA4mC: identifying DNA N4-methylcytosine sites based on nucleotide chemical properties. Bioinformatics 33(22):3518–3523

    Article  Google Scholar 

  14. Tal P, Bell RE, Itay M, Fabian G, Nir BT (2002) Rate4site: an algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics, (18), pp S71–s77

  15. Aharon A, Dan G, Nir BT (2001) Consurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 307(1):447–463

    Article  Google Scholar 

  16. Yu DJ, Hu J, Huang Y, et al. (2013) Targetatpsite: a template-free method for atp-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem 34(11):974–985

    Article  Google Scholar 

  17. Ding YJ, Tang JJ, Guo F (2017) Identification of protein–ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 57(12):3149–3161

    Article  Google Scholar 

  18. Zhao Z, Xu Y, Zhao Y (2019) SXGBsite: prediction of protein-ligand binding sites using sequence information and extreme gradient boosting. Genes 10(12):965

    Article  Google Scholar 

  19. Hu J, Rao L, Fan X (2020) Identification of ligand-binding residues using protein sequence profile alignment and query-specific support vector machine model. Anal Biochem 604:113799

    Article  Google Scholar 

  20. Song J, Liu G, Jiang J (2021) Prediction of protein–ATP binding residues based on ensemble of deep convolutional neural networks and lightGBM algorithm. Int J Mol Sci 22(2):939

    Article  Google Scholar 

  21. Hendlich M (1997) Ligsite: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 15:359–363

    Article  Google Scholar 

  22. Dundas J, Ouyang Z, Tseng J, Binkowski T, Turpaz Y, Liang J (2006) Castp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res 34:116–118

    Article  Google Scholar 

  23. Levitt DG, Banaszak LJ (1992) Pocket: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph 10(4):229–234

    Article  Google Scholar 

  24. Laskowski RA (1995) Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph 13(5):323–330

    Article  Google Scholar 

  25. Laurie ATR, Jackson MR (2005) Q-sitefinder: an energy-based method for the prediction of protein–ligand binding sites. Bioinformatics 21(9):1908–1916

    Article  Google Scholar 

  26. Hernandez M, Ghersi D, Sanchez R (2009) Sitehound-web: a server for ligand binding site identification in protein structures. Nucleic Acids Res 37(2):413–416

    Article  Google Scholar 

  27. Hoffmann B, Zaslavskiy M, Vert JP, Stoven V (2010) A new protein binding pocket similarity measure based on comparison of clouds of atoms in 3d: application to ligand prediction. Bmc Bioinformatics 11 (1):1–16

    Article  Google Scholar 

  28. Yu DJ, Hu J, Tang ZM, et al. (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190

    Article  Google Scholar 

  29. Chen K, Mizianty MJ, Kurgan L (2011) Atpsite: sequence-based prediction of atp-binding residues. Proteome Sci 9(1):1–8

    Google Scholar 

  30. Chen K, Marcin JM, Lukasz K (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3):331–41

    Article  Google Scholar 

  31. Yu DJ, Hu J, Huang Y, et al. (2013) Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10(4):994–1008

    Article  Google Scholar 

  32. Yang JY, Ambrish R, Zhang Y (2013) Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20):2588–2595

    Article  Google Scholar 

  33. Huang B, Schroeder M (2006) Ligsitecsc: predicting ligand binding sites using the connolly surface and degree of conservation. Bmc Structural Biology 6(1):19–19

    Article  Google Scholar 

  34. Glaser F, Morris RJ, Najmanovich RJ et al (2010) A method for localizing ligand binding pockets in protein structures. Proteins-structure Function and Bioinformatics 62(2):479–488

    Article  Google Scholar 

  35. Hu J, Yang L, Yang Z, Yu DJ (2018) ATPBind: accurate protein-ATP binding site prediction by combining sequence-profiling and structure-based comparisons. J Chem Inform Model 58(2):501–510

    Article  Google Scholar 

  36. Ahmed NN, Natarajan T, Rao KR (2006) Discrete cosine transform. IEEE Trans Comput C-23(1):90–93

    Article  MathSciNet  MATH  Google Scholar 

  37. Loris N, Alessandra L, Sheryl B (2014) An empirical study of different approaches for protein classification. Sci World J 2014:236717

    Google Scholar 

  38. Vincent P, Bengio Y (2002) K-local hyperplane and convex distance nearest neighbor algorithms. Adv Neural Inform Process Syst 14:985–992

    Google Scholar 

  39. Yang JY, Roy A, Zhang Y (2013) Biolip: a semi-manually curated database for biologically relevant ligandprotein interactions. Nuclc Acids Res 41(D1):1096–1103

    Article  Google Scholar 

  40. Altschul SF, Madden TL, Schäffer AA, Zhang JH, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein databases search programs. Nucleic Acids Res 25(17):3389–3402

    Article  Google Scholar 

  41. Shandar A, Michael G, Akinori S (2010) Real value prediction of solvent accessibility from amino acid sequence. Proteins-structure Function and Bioinformatics 50(4):629–635

    Google Scholar 

  42. Joo K, Lee SJ, Lee J (2012) Sann: solvent accessibility prediction of proteins by nearest neighbor method. Proteins-structure Function and Bioinformatics 80(7):1791–1797

    Article  Google Scholar 

  43. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27

    Article  Google Scholar 

  44. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13 (1):21–27

    Article  MATH  Google Scholar 

  45. Leo B (2001) Random forests. Machine Learn 45(1):5–32

    Article  MATH  Google Scholar 

  46. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29 (5):1189–1232

    Article  MathSciNet  MATH  Google Scholar 

  47. Dua D, Graff C (2017) UCI machine learning repository

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC 61902271, 61772362 and 61972280), the Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014) and the National Key R&D Program of China (2020YFA0908400).

The author would like to thank Professor Dong-jun Yu for providing the dataset, which helped improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yijie Ding or Fei Guo.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Yijie Ding and Chao Yang have the same contribution, they are joint first authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, Y., Yang, C., Tang, J. et al. Identification of protein-nucleotide binding residues via graph regularized k-local hyperplane distance nearest neighbor model. Appl Intell 52, 6598–6612 (2022). https://doi.org/10.1007/s10489-021-02737-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02737-0

Keywords

Navigation