Skip to main content
Log in

Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

RNA-binding proteins play an important role in the biological process. However, the traditional experiment technology to predict RNA-binding residues is time-consuming and expensive, so the development of an effective computational approach can provide a strategy to solve this issue. In recent years, most of the computational approaches are constructed on protein sequence information, but the protein structure has not been considered. In this paper, we use a novel computational model of RNA-binding residues prediction, using protein sequence and structure information. Our hybrid features are encoded by local sequence and structure feature extraction models. Our predictor is built by employing the Granular Multiple Kernel Support Vector Machine with Repetitive Under-sampling (GMKSVM-RU). In order to evaluate our method, we use fivefold cross-validation on the RBP129, our method achieves better experimental performance with MCC of 0.3367 and accuracy of 88.84%. In order to further evaluate our model, an independent data set (RBP60) is employed, and our method achieves MCC of 0.3921 and accuracy of 87.52%. Above results demonstrate that integrating sequence and structure information is beneficial to improve the prediction ability of RNA-binding residues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

The data sets, codes and corresponding results are available at https://github.com/guofei-tju/GMKSVM.

References

  1. Chen Y, Varani G (2005) Protein families and RNA recognition. FEBS J 272(9):2088–2097

    Google Scholar 

  2. Glisovic T, Bachorik JL, Yong J, Dreyfuss G (2008) RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett 582(14):1977–1986

    Google Scholar 

  3. Ding Y, Tang J, Guo F (2020) Identification of drug-target interactions via dual Laplacian regularized least squares with multiple kernel fusion. Knowl Based Syst 204:106254

    Google Scholar 

  4. Ding Y, Tang J, Guo F (2020) Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput 96:106596

    Google Scholar 

  5. Zou Y, Wu H, Guo X et al (2020) K-FSVM-SVDD: a multiple kernel-based Fuzzy SVM model for predicting DNA-binding proteins via support vector data description. Curr Bioinform. https://doi.org/10.2174/1574893615999200607173829

    Article  Google Scholar 

  6. Ding Y, Tang J, Guo F (2019) Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2019.2954826

    Article  Google Scholar 

  7. Ding Y, Tang J, Guo F (2019) Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J Biomed Health Inform 23(6):2619–2632

    Google Scholar 

  8. Ding Y, Tang J, Guo F (2019) Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 325:211–224

    Google Scholar 

  9. Ding Y, Tang J, Guo F (2017) Identification of drug-target interactions via multiple information integration. Inf Sci 418:546–560

    Google Scholar 

  10. Ding Y, Tang J, Guo F (2019) Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl 418:1–17

    Google Scholar 

  11. Wang H, Ding Y, Tang J, Guo F (2020) Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt independence criterion. Neurocomputing 383:257–269

    Google Scholar 

  12. Zhang J, Zhang Z, Pu L et al (2019) AIEpred: an ensemble predictive model of classifier chain to identify anti-inflammatory peptides. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2020.2968419

    Article  Google Scholar 

  13. Kurgan L, Razib AA, Aghakhani S (2009) Meta prediction of protein crystallization propensity. BMC Struct Biol 9(1):50

    Google Scholar 

  14. Mizianty MJ, Kurgan L (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. Biochem Biophys Res Commun 390:10

    Google Scholar 

  15. Yang J, Roy A, Zhang Y (2013) Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20):2588–2595

    Google Scholar 

  16. Chen K, Mizianty MJ, Kurgan L (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3):331–341

    Google Scholar 

  17. Yu DJ, Hu J, Huang Y, Shen HB, Qi Y, Tang ZM, Yang JY (2013) TargetATPsite: a template-free method for ATP-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem 34(11):974–985

    Google Scholar 

  18. Yu DJ, Hu J, Tang ZM, Shen HB, Yang J, Yang JY (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190

    Google Scholar 

  19. Zhu YH, Hu J, Song XN, Yu DJ (2019) DNAPred: accurate identification of dna-binding sites from protein sequence by ensembling hyperplane-distance-based support vector machines. J Chem Inf Model 59(6):3057–3071

    Google Scholar 

  20. Kumar M, Gromiha MM, Raghava GPS (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins 71(1):189–194

    Google Scholar 

  21. Spriggs RV, Murakami Y, Nakamura H, Jones S (2009) Protein function annotation from sequence: prediction of residues interacting with RNA. Bioinformatics 25(12):1492–1497

    Google Scholar 

  22. Wang C, Fang Y, Xiao J, Li M (2011) Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids 40(1):239–248

    Google Scholar 

  23. Wang L, Huang C, Yang MQ, Yang JY (2010) BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst Biol 4(S1):S3

    Google Scholar 

  24. Xiong D, Zeng J, Gong H (2015) RBRIdent: an algorithm for improved identification of RNA-binding residues in proteins from primary sequences. Proteins 83(6):1068–1077

    Google Scholar 

  25. Tang Y, Liu D, Wang Z, Wen T, Deng L (2017) A boosting approach for prediction of protein-RNA binding residues. BMC Bioinform 18(13):465

    Google Scholar 

  26. Lewis BA, Walia RR, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D (2010) PRIDB: a protein-RNA interface database. Nucleic Acids Res 39(suppl-1):D277–D282

    Google Scholar 

  27. Walia RR, Xue LC, Wilkins K, El-Manzalawy Y, Dobbs D, Honavar V (2014) RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins. PLoS ONE 9(5):e97725

    Google Scholar 

  28. Miao Z, Westhof E (2015) Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score. Nucleic Acids Res 43(11):5340–5351

    Google Scholar 

  29. Miao Z, Westhof E (2015) A large-scale assessment of nucleic acids binding site prediction programs. PLoS Comput Biol 11(12):e1004639

    Google Scholar 

  30. Terribilini M, Lee J-H, Yan C, Jernigan RL, Honavar V, Dobbs D (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA 12(8):1450–1462

    Google Scholar 

  31. Cheng C-W, Su EC-Y, Hwang J-K, Sung T-Y, Hsu W-L (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinform 9(12):S6

    Google Scholar 

  32. Liu Z-P, Wu L-Y, Wang Y, Zhang X-S, Chen L (2010) Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics 26(13):1616–1622

    Google Scholar 

  33. Yang X, Wang J, Sun J, Liu R (2015) Snbrfinder: a sequence-based hybridalgorithm for enhanced prediction of nucleic acid-binding residues. PLoS ONE 10(7):0133260

    Google Scholar 

  34. Kim OT, Yura K, Go N (2006) Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction. Nucleic Acids Res 34(22):6450–6460

    Google Scholar 

  35. Chen YC, Lim C (2008) Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res 34:e29

    Google Scholar 

  36. Towfic F, Caragea C, Gemperline DC, Dobbs D, Honavar V (2010) Struct-NB: predicting protein-RNA binding sites using structural features. Int J Data Min Bioinform 4:21–43

    Google Scholar 

  37. Yang XX, Deng ZL, Liu R (2014) RBRDetector: improved prediction of binding residues on RNA-binding protein structures using complementary feature- and template-based strategies. Proteins 82:2455–2471

    Google Scholar 

  38. Maetschke SR, Yuan Z (2009) Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinform 10:341

    Google Scholar 

  39. Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–2199

    Google Scholar 

  40. Gabb HA, Jackson RM, Sternberg MJ (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272:106–120

    Google Scholar 

  41. Ritchie DW, Kemp GJ (2000) Protein docking using spherical polar Fourier correlations. Proteins 39:178–194

    Google Scholar 

  42. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242

    Google Scholar 

  43. Limin F, Beifang N, Zhengwei Z, Sitao W, Weizhong L (2012) CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics 28(23):3150–3152

    Google Scholar 

  44. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659

    Google Scholar 

  45. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Google Scholar 

  46. Gish W, States DJ (1993) Identification of protein coding regions by database similarity search. Nat Genet 3(3):266–272

    Google Scholar 

  47. Allers J, Shamoo Y (2001) Structure-based analysis of protein-RNA interactions using the program ENTANGLE. J Mol Biol 311:75–86

    Google Scholar 

  48. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Google Scholar 

  49. Joosten RP, Te B, Tim AH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) A series of PDB related databases for everyday needs. Nucleic Acids Res 39(suppl-1):D411–D419

    Google Scholar 

  50. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637

    Google Scholar 

  51. Guo F, Zou Q, Yang G, Wang D, Tang J, Xu J (2019) Identifying protein-protein interface via a novel multi-scale local sequence and structural representation. BMC Bioinform 20(15):1–11

    Google Scholar 

  52. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  53. Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39:281–288

    Google Scholar 

  54. Tang Y, Zhang YQ (2006) Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. In: IEEE international conference on granular computing, pp 457–460

  55. Ding Y, Tang J, Guo F (2017) Identification of protein-ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 57(12):3149–3161

    Google Scholar 

Download references

Acknowledgements

This work is supported by a grant from the National Natural Science Foundation of China (NSFC 61902271, 61772362 and 61972280), and National Key R&D Program of China (2020YFA0908401, 2020YFA0908400, 2018YFC0910405, 2017YFC0908400), and the Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014).

Authors also thank professor Gong, Haipeng for kindly providing data sets on his website.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yijie Ding or Fei Guo.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, C., Ding, Y., Meng, Q. et al. Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information. Neural Comput & Applic 33, 11387–11399 (2021). https://doi.org/10.1007/s00521-020-05573-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05573-4

Keywords

Navigation