skip to main content
10.1145/3405758.3405764acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbbtConference Proceedingsconference-collections
research-article

Study of Data Imbalanced Problem in Protein-peptide Binding Prediction

Authors Info & Claims
Published:10 July 2020Publication History

ABSTRACT

Peptide-binding proteins are excessive in living cells and proteinpeptide interactions mediate a wide range of cellular functions. Prediction of protein-peptide binding residues has been vital and popular in the past decades and machine learning methods have gained more attention in recent years. However, the data imbalance problem has not been dealt with effectively. On this matter, we study the effects of sampling methods and degrees of imbalance on data classes on construction of prediction model. We first developed the NearMiss under-sampling method (NMUS) as a way to screen out a given number of quality data samples from majority class to balance the data sets. The remarkable sensitivity (SEN) with 0.818 shows the advantage of NMUS in handling class imbalance problem. This research carried on valuable analysis on data imbalance problem and achieved a better prediction of protein-peptide binding interaction.

References

  1. Curatolo, W., et al., Protein-lipid interactions: recombinants of the proteolipid apoprotein of myelin with dimyristoyllecithin. Biochemistry, 1977. 16(11): p. 2313--2319.Google ScholarGoogle Scholar
  2. Arquier, N., et al., Drosophila ALS Regulates Growth and Metabolism through Functional Interaction with Insulin-Like Peptides. Cell Metabolism, 2008. 7(4): p. 333--338.Google ScholarGoogle Scholar
  3. Orengo, C.A., et al., CATH -- a hierarchic classification of protein domain structures. Structure, 1997. 5(8): p. 1093--1109.Google ScholarGoogle Scholar
  4. Ponting, C.P., et al., SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Research, 1999. 27(1): p. 229--232.Google ScholarGoogle Scholar
  5. Chen, S., et al., Location of a folding protein and shape changes in GroEL--GroES complexes imaged by cryoelectron microscopy. Nature, 1994. 371(6494): p. 261--264.Google ScholarGoogle Scholar
  6. Li, B.-Q., et al., Prediction of Protein-Peptide Interactions with a Nearest Neighbor Algorithm. Current Bioinformatics, 2018. 13(1): p. 14--24.Google ScholarGoogle Scholar
  7. Taherzadeh, G., et al., Sequence-based prediction of protein--peptide binding sites using support vector machine. Journal of Computational Chemistry, 2016. 37(13): p. 1223--1229.Google ScholarGoogle Scholar
  8. Zhao, Z., Z. Peng, and J. Yang, Improving Sequence-Based Prediction of Protein--Peptide Binding Residues by Introducing Intrinsic Disorder and a Consensus Method. Journal of Chemical Information and Modeling, 2018. 58(7): p. 1459--1468.Google ScholarGoogle Scholar
  9. Zhang, J.P. and Mani, I. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets, Washington DC, 21 August 2003.Google ScholarGoogle Scholar
  10. Anowar, F., S. Sadaoui, and M. Mouhoub. Auction Fraud Classification Based on Clustering and Sampling Techniques. in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018.Google ScholarGoogle Scholar
  11. Bao, L., et al., Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets. Neurocomputing, 2016. 172: p. 198--206.Google ScholarGoogle Scholar
  12. Dubchak, I., et al., Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences, 1995. 92(19): p. 8700.Google ScholarGoogle ScholarCross RefCross Ref
  13. Govindan, G. and A.S. Nair. Composition, Transition and Distribution (CTD) --- A dynamic feature for predictions based on hierarchical structure of cellular sorting. in 2011 Annual IEEE India Conference. 2011.Google ScholarGoogle ScholarCross RefCross Ref
  14. Cai, C.Z., et al., SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research, 2003. 31(13): p. 3692--3697.Google ScholarGoogle Scholar
  15. Tomii, K. and M. Kanehisa, Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Engineering, Design and Selection, 1996. 9(1): p. 27--36.Google ScholarGoogle Scholar
  16. Breiman, L., Random Forests. Machine Learning, 2001. 45(1): p. 5--32.Google ScholarGoogle Scholar
  17. Li, B., et al., Characterization of local geometry of protein surfaces with the visibility criterion. Proteins: Structure, Function, and Bioinformatics, 2008. 71(2): p. 670--683.Google ScholarGoogle Scholar
  18. Liang, S., et al., Protein binding site prediction using an empirical scoring function. Nucleic Acids Research, 2006. 34(13): p. 3698--3707.Google ScholarGoogle Scholar
  19. Petsalaki, E., et al., Accurate Prediction of Peptide Binding Sites on Protein Surfaces. PLOS Computational Biology, 2009. 5(3): p. e1000335.Google ScholarGoogle Scholar
  20. Lavi, A., et al., Detection of peptide-binding sites on protein surfaces: The first step toward the modeling and targeting of peptide-mediated interactions. Proteins: Structure, Function, and Bioinformatics, 2013. 81(12): p. 2096--2105.Google ScholarGoogle Scholar

Index Terms

  1. Study of Data Imbalanced Problem in Protein-peptide Binding Prediction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        ICBBT '20: Proceedings of the 2020 12th International Conference on Bioinformatics and Biomedical Technology
        May 2020
        163 pages
        ISBN:9781450375719
        DOI:10.1145/3405758

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 July 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited
      • Article Metrics

        • Downloads (Last 12 months)8
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader