skip to main content
10.1145/1854776.1854797acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Random forest-based prediction of protein sumoylation sites from sequence features

Published:02 August 2010Publication History

ABSTRACT

Protein sumoylation play essential roles in the eukaryotic cell and any alterations in this process may cause various human diseases. This paper describes a new machine learning approach for the sumoylation site prediction from protein sequence information. Random Forests (RFs), which can handle a large number of input variables and avoid model overfitting, were trained with the data collected from literature. To construct accurate classifiers, forty sequence features were selected for input vector encoding. The results suggested that RF classifier performance was affected by the sequence context of sumoylation sites, and the use of eighteen residues with the core motif ψKXE in the middle gave the highest performance (ROC AUC = 0.9328). The RF classifiers were also found to outperform support vector machine (SVM) models on the same dataset. Thus, the RF algorithm appears to be the best choice for accurate prediction of protein sumoylation sites from sequence features.

References

  1. Geiss-Friedlander, R. and Melchior, F. 2007. Concepts in sumoylation: a decade on. Nat. Rev. Mol. Cell. Biol. 8 (Dec. 2007), 947--956.Google ScholarGoogle ScholarCross RefCross Ref
  2. Sarge, K. D. and Park-Sarge, O. K. 2009. Sumoylation and human disease pathogenesis. Trends. Biochem. Sci. 34 (Apr. 2009), 200--205.Google ScholarGoogle Scholar
  3. Martin, S., Wilkinson, K. A., Nishimune, A. and Henley, J. M. 2007. Emerging extranuclear roles of protein SUMOylation in neuronal function and dysfunction. Nat. Rev. Neurosci. 8 (Dec. 2007), 948--959.Google ScholarGoogle ScholarCross RefCross Ref
  4. Yang, S. H., Galanis, A., Witty, J. and Sharrocks, A. D. 2006. An extended consensus motif enhances the specificity of substrate modification by SUMO. EMBO J. 25 (Nov. 2006), 5083--5093.Google ScholarGoogle ScholarCross RefCross Ref
  5. Hietakangas, V., Anckar, J., Blomster, H. A., Fujimoto, M., Palvimo, J. J., Nakai, A. and Sistonen, L. 2006. PDSM, a motif for phosphorylation-dependent SUMO modification. Proc. Natl. Acad. Sci. U S A. 103 (Jan. 2006), 45--50.Google ScholarGoogle ScholarCross RefCross Ref
  6. Stankovic-Valentin, N., Deltour, S., Seeler, J., Pinte, S., Vergoten, G., Guerardel, C., Dejean, A. and Leprince, D. 2007. An acetylation/ deacetylation-SUMOylation switch through a phylogenetically conserved psiKXEP motif in the tumor suppressor HIC1 regulates transcriptional repression activity. Mol. Cell. Biol. 27 (Apr. 2007), 2661--2675.Google ScholarGoogle ScholarCross RefCross Ref
  7. Liu, B., Li, S., Wang, Y., Lu, L., Li, Y. and Cai, Y. 2007. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem. Biophys. Res. Commun. 358 (Jun. 2007), 136--139.Google ScholarGoogle ScholarCross RefCross Ref
  8. Xue, Y., Zhou, F., Fu, C., Xu, Y. and Yao, X. 2006. SUMOsp: a web server for sumoylation site prediction. Nucleic. Acids. Res. 34 (Jul. 2006), W254--257.Google ScholarGoogle ScholarCross RefCross Ref
  9. Xu, J., He, Y., Qiang, B., Yuan, J., Peng, X. and Pan, X. M. 2008. A novel method for high accuracy sumoylation site prediction from protein sequences. BMC Bioinformatics. 9 (Jan. 2008), 8.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Teng, A. K. Srivastava, L. Wang, "Sequence feature-based prediction of protein stability changes upon amino acid substitutions", BMC Genomics, 2010, in press.Google ScholarGoogle ScholarCross RefCross Ref
  11. Wang, L. and Brown, S. J. 2006. Prediction of DNA-binding residues from sequence features. J. Bioinfom. Comput. Biol. 4 (Dec. 2006), 1141--1158.Google ScholarGoogle ScholarCross RefCross Ref
  12. Wang, L. and Brown, S. J. 2006. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic. Acids. Res. 34 (Jul. 2006), W243--248.Google ScholarGoogle ScholarCross RefCross Ref
  13. Wang, L., Yang, M. Q. and Yang, J. Y. 2009. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 10 (Jul. 2009), S1.Google ScholarGoogle ScholarCross RefCross Ref
  14. Schneider, T. D. and Stephens, R. M. 1990. Sequence logos: a new way to display consensus sequences. Nucleic. Acids. Res. 18 (Oct. 1990), 6097--6100.Google ScholarGoogle ScholarCross RefCross Ref
  15. Gorodkin, J., Heyer, L. J., Brunak, S. and Stormo, G. D. 1997. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci. 13 (Dec. 1997), 583--586.Google ScholarGoogle Scholar
  16. Gasteiger E., H. C., Gattiker A., Duvaud S., Wilkins M. R., Appel R. D., Bairoch A. 2005. The Proteomics Protocols Handbook. Humana Press.Google ScholarGoogle Scholar
  17. Kawashima, S. and Kanehisa, M. 2000. AAindex: amino acid index database. Nucleic. Acids. Res. 28 (Jan. 2000), 374.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ahmad, S. and Sarai, A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6 (Feb. 2005), 33.Google ScholarGoogle ScholarCross RefCross Ref
  19. Noble, W. S. 2006. What is a support vector machine? Nat. Biotechnol. 24 (Dec. 2006), 1565--1567.Google ScholarGoogle Scholar
  20. Swets, J. A. 1988. Measuring the accuracy of diagnostic systems. Science. 240 (Jun. 1988), 1285--1293.Google ScholarGoogle Scholar
  21. Bradley, A. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern. Recognition. 30 (Jul. 1997), 1145--1159 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Random forest-based prediction of protein sumoylation sites from sequence features

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
          August 2010
          705 pages
          ISBN:9781450304382
          DOI:10.1145/1854776

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 2 August 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate254of885submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader