Abstract
RNA-binding proteins play an important role in the biological process. However, the traditional experiment technology to predict RNA-binding residues is time-consuming and expensive, so the development of an effective computational approach can provide a strategy to solve this issue. In recent years, most of the computational approaches are constructed on protein sequence information, but the protein structure has not been considered. In this paper, we use a novel computational model of RNA-binding residues prediction, using protein sequence and structure information. Our hybrid features are encoded by local sequence and structure feature extraction models. Our predictor is built by employing the Granular Multiple Kernel Support Vector Machine with Repetitive Under-sampling (GMKSVM-RU). In order to evaluate our method, we use fivefold cross-validation on the RBP129, our method achieves better experimental performance with MCC of 0.3367 and accuracy of 88.84%. In order to further evaluate our model, an independent data set (RBP60) is employed, and our method achieves MCC of 0.3921 and accuracy of 87.52%. Above results demonstrate that integrating sequence and structure information is beneficial to improve the prediction ability of RNA-binding residues.
Similar content being viewed by others
Availability of data and materials
The data sets, codes and corresponding results are available at https://github.com/guofei-tju/GMKSVM.
References
Chen Y, Varani G (2005) Protein families and RNA recognition. FEBS J 272(9):2088–2097
Glisovic T, Bachorik JL, Yong J, Dreyfuss G (2008) RNA-binding proteins and post-transcriptional gene regulation. FEBS Lett 582(14):1977–1986
Ding Y, Tang J, Guo F (2020) Identification of drug-target interactions via dual Laplacian regularized least squares with multiple kernel fusion. Knowl Based Syst 204:106254
Ding Y, Tang J, Guo F (2020) Human protein subcellular localization identification via fuzzy model on kernelized neighborhood representation. Appl Soft Comput 96:106596
Zou Y, Wu H, Guo X et al (2020) K-FSVM-SVDD: a multiple kernel-based Fuzzy SVM model for predicting DNA-binding proteins via support vector data description. Curr Bioinform. https://doi.org/10.2174/1574893615999200607173829
Ding Y, Tang J, Guo F (2019) Protein crystallization identification via fuzzy model on linear neighborhood representation. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2019.2954826
Ding Y, Tang J, Guo F (2019) Identification of drug-side effect association via semisupervised model and multiple kernel learning. IEEE J Biomed Health Inform 23(6):2619–2632
Ding Y, Tang J, Guo F (2019) Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing 325:211–224
Ding Y, Tang J, Guo F (2017) Identification of drug-target interactions via multiple information integration. Inf Sci 418:546–560
Ding Y, Tang J, Guo F (2019) Identification of drug-target interactions via fuzzy bipartite local model. Neural Comput Appl 418:1–17
Wang H, Ding Y, Tang J, Guo F (2020) Identification of membrane protein types via multivariate information fusion with Hilbert–Schmidt independence criterion. Neurocomputing 383:257–269
Zhang J, Zhang Z, Pu L et al (2019) AIEpred: an ensemble predictive model of classifier chain to identify anti-inflammatory peptides. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2020.2968419
Kurgan L, Razib AA, Aghakhani S (2009) Meta prediction of protein crystallization propensity. BMC Struct Biol 9(1):50
Mizianty MJ, Kurgan L (2009) CRYSTALP2: sequence-based protein crystallization propensity prediction. Biochem Biophys Res Commun 390:10
Yang J, Roy A, Zhang Y (2013) Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 29(20):2588–2595
Chen K, Mizianty MJ, Kurgan L (2012) Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics 28(3):331–341
Yu DJ, Hu J, Huang Y, Shen HB, Qi Y, Tang ZM, Yang JY (2013) TargetATPsite: a template-free method for ATP-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem 34(11):974–985
Yu DJ, Hu J, Tang ZM, Shen HB, Yang J, Yang JY (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190
Zhu YH, Hu J, Song XN, Yu DJ (2019) DNAPred: accurate identification of dna-binding sites from protein sequence by ensembling hyperplane-distance-based support vector machines. J Chem Inf Model 59(6):3057–3071
Kumar M, Gromiha MM, Raghava GPS (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins 71(1):189–194
Spriggs RV, Murakami Y, Nakamura H, Jones S (2009) Protein function annotation from sequence: prediction of residues interacting with RNA. Bioinformatics 25(12):1492–1497
Wang C, Fang Y, Xiao J, Li M (2011) Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids 40(1):239–248
Wang L, Huang C, Yang MQ, Yang JY (2010) BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst Biol 4(S1):S3
Xiong D, Zeng J, Gong H (2015) RBRIdent: an algorithm for improved identification of RNA-binding residues in proteins from primary sequences. Proteins 83(6):1068–1077
Tang Y, Liu D, Wang Z, Wen T, Deng L (2017) A boosting approach for prediction of protein-RNA binding residues. BMC Bioinform 18(13):465
Lewis BA, Walia RR, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D (2010) PRIDB: a protein-RNA interface database. Nucleic Acids Res 39(suppl-1):D277–D282
Walia RR, Xue LC, Wilkins K, El-Manzalawy Y, Dobbs D, Honavar V (2014) RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins. PLoS ONE 9(5):e97725
Miao Z, Westhof E (2015) Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score. Nucleic Acids Res 43(11):5340–5351
Miao Z, Westhof E (2015) A large-scale assessment of nucleic acids binding site prediction programs. PLoS Comput Biol 11(12):e1004639
Terribilini M, Lee J-H, Yan C, Jernigan RL, Honavar V, Dobbs D (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA 12(8):1450–1462
Cheng C-W, Su EC-Y, Hwang J-K, Sung T-Y, Hsu W-L (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinform 9(12):S6
Liu Z-P, Wu L-Y, Wang Y, Zhang X-S, Chen L (2010) Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics 26(13):1616–1622
Yang X, Wang J, Sun J, Liu R (2015) Snbrfinder: a sequence-based hybridalgorithm for enhanced prediction of nucleic acid-binding residues. PLoS ONE 10(7):0133260
Kim OT, Yura K, Go N (2006) Amino acid residue doublet propensity in the protein-RNA interface and its application to RNA interface prediction. Nucleic Acids Res 34(22):6450–6460
Chen YC, Lim C (2008) Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res 34:e29
Towfic F, Caragea C, Gemperline DC, Dobbs D, Honavar V (2010) Struct-NB: predicting protein-RNA binding sites using structural features. Int J Data Min Bioinform 4:21–43
Yang XX, Deng ZL, Liu R (2014) RBRDetector: improved prediction of binding residues on RNA-binding protein structures using complementary feature- and template-based strategies. Proteins 82:2455–2471
Maetschke SR, Yuan Z (2009) Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinform 10:341
Katchalski-Katzir E, Shariv I, Eisenstein M, Friesem AA, Aflalo C, Vakser IA (1992) Molecular surface recognition: determination of geometric fit between proteins and their ligands by correlation techniques. Proc Natl Acad Sci USA 89:2195–2199
Gabb HA, Jackson RM, Sternberg MJ (1997) Modelling protein docking using shape complementarity, electrostatics and biochemical information. J Mol Biol 272:106–120
Ritchie DW, Kemp GJ (2000) Protein docking using spherical polar Fourier correlations. Proteins 39:178–194
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28:235–242
Limin F, Beifang N, Zhengwei Z, Sitao W, Weizhong L (2012) CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics 28(23):3150–3152
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13):1658–1659
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Gish W, States DJ (1993) Identification of protein coding regions by database similarity search. Nat Genet 3(3):266–272
Allers J, Shamoo Y (2001) Structure-based analysis of protein-RNA interactions using the program ENTANGLE. J Mol Biol 311:75–86
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Joosten RP, Te B, Tim AH, Krieger E, Hekkelman ML, Hooft RWW, Schneider R, Sander C, Vriend G (2010) A series of PDB related databases for everyday needs. Nucleic Acids Res 39(suppl-1):D411–D419
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637
Guo F, Zou Q, Yang G, Wang D, Tang J, Xu J (2019) Identifying protein-protein interface via a novel multi-scale local sequence and structural representation. BMC Bioinform 20(15):1–11
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39:281–288
Tang Y, Zhang YQ (2006) Granular SVM with repetitive undersampling for highly imbalanced protein homology prediction. In: IEEE international conference on granular computing, pp 457–460
Ding Y, Tang J, Guo F (2017) Identification of protein-ligand binding sites by sequence information and ensemble classifier. J Chem Inf Model 57(12):3149–3161
Acknowledgements
This work is supported by a grant from the National Natural Science Foundation of China (NSFC 61902271, 61772362 and 61972280), and National Key R&D Program of China (2020YFA0908401, 2020YFA0908400, 2018YFC0910405, 2017YFC0908400), and the Natural Science Research of Jiangsu Higher Education Institutions of China (19KJB520014).
Authors also thank professor Gong, Haipeng for kindly providing data sets on his website.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, C., Ding, Y., Meng, Q. et al. Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information. Neural Comput & Applic 33, 11387–11399 (2021). https://doi.org/10.1007/s00521-020-05573-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05573-4