skip to main content
10.1145/1854776.1854797acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

Random forest-based prediction of protein sumoylation sites from sequence features

Published: 02 August 2010 Publication History

Abstract

Protein sumoylation play essential roles in the eukaryotic cell and any alterations in this process may cause various human diseases. This paper describes a new machine learning approach for the sumoylation site prediction from protein sequence information. Random Forests (RFs), which can handle a large number of input variables and avoid model overfitting, were trained with the data collected from literature. To construct accurate classifiers, forty sequence features were selected for input vector encoding. The results suggested that RF classifier performance was affected by the sequence context of sumoylation sites, and the use of eighteen residues with the core motif ψKXE in the middle gave the highest performance (ROC AUC = 0.9328). The RF classifiers were also found to outperform support vector machine (SVM) models on the same dataset. Thus, the RF algorithm appears to be the best choice for accurate prediction of protein sumoylation sites from sequence features.

References

[1]
Geiss-Friedlander, R. and Melchior, F. 2007. Concepts in sumoylation: a decade on. Nat. Rev. Mol. Cell. Biol. 8 (Dec. 2007), 947--956.
[2]
Sarge, K. D. and Park-Sarge, O. K. 2009. Sumoylation and human disease pathogenesis. Trends. Biochem. Sci. 34 (Apr. 2009), 200--205.
[3]
Martin, S., Wilkinson, K. A., Nishimune, A. and Henley, J. M. 2007. Emerging extranuclear roles of protein SUMOylation in neuronal function and dysfunction. Nat. Rev. Neurosci. 8 (Dec. 2007), 948--959.
[4]
Yang, S. H., Galanis, A., Witty, J. and Sharrocks, A. D. 2006. An extended consensus motif enhances the specificity of substrate modification by SUMO. EMBO J. 25 (Nov. 2006), 5083--5093.
[5]
Hietakangas, V., Anckar, J., Blomster, H. A., Fujimoto, M., Palvimo, J. J., Nakai, A. and Sistonen, L. 2006. PDSM, a motif for phosphorylation-dependent SUMO modification. Proc. Natl. Acad. Sci. U S A. 103 (Jan. 2006), 45--50.
[6]
Stankovic-Valentin, N., Deltour, S., Seeler, J., Pinte, S., Vergoten, G., Guerardel, C., Dejean, A. and Leprince, D. 2007. An acetylation/ deacetylation-SUMOylation switch through a phylogenetically conserved psiKXEP motif in the tumor suppressor HIC1 regulates transcriptional repression activity. Mol. Cell. Biol. 27 (Apr. 2007), 2661--2675.
[7]
Liu, B., Li, S., Wang, Y., Lu, L., Li, Y. and Cai, Y. 2007. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem. Biophys. Res. Commun. 358 (Jun. 2007), 136--139.
[8]
Xue, Y., Zhou, F., Fu, C., Xu, Y. and Yao, X. 2006. SUMOsp: a web server for sumoylation site prediction. Nucleic. Acids. Res. 34 (Jul. 2006), W254--257.
[9]
Xu, J., He, Y., Qiang, B., Yuan, J., Peng, X. and Pan, X. M. 2008. A novel method for high accuracy sumoylation site prediction from protein sequences. BMC Bioinformatics. 9 (Jan. 2008), 8.
[10]
S. Teng, A. K. Srivastava, L. Wang, "Sequence feature-based prediction of protein stability changes upon amino acid substitutions", BMC Genomics, 2010, in press.
[11]
Wang, L. and Brown, S. J. 2006. Prediction of DNA-binding residues from sequence features. J. Bioinfom. Comput. Biol. 4 (Dec. 2006), 1141--1158.
[12]
Wang, L. and Brown, S. J. 2006. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic. Acids. Res. 34 (Jul. 2006), W243--248.
[13]
Wang, L., Yang, M. Q. and Yang, J. Y. 2009. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 10 (Jul. 2009), S1.
[14]
Schneider, T. D. and Stephens, R. M. 1990. Sequence logos: a new way to display consensus sequences. Nucleic. Acids. Res. 18 (Oct. 1990), 6097--6100.
[15]
Gorodkin, J., Heyer, L. J., Brunak, S. and Stormo, G. D. 1997. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci. 13 (Dec. 1997), 583--586.
[16]
Gasteiger E., H. C., Gattiker A., Duvaud S., Wilkins M. R., Appel R. D., Bairoch A. 2005. The Proteomics Protocols Handbook. Humana Press.
[17]
Kawashima, S. and Kanehisa, M. 2000. AAindex: amino acid index database. Nucleic. Acids. Res. 28 (Jan. 2000), 374.
[18]
Ahmad, S. and Sarai, A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6 (Feb. 2005), 33.
[19]
Noble, W. S. 2006. What is a support vector machine? Nat. Biotechnol. 24 (Dec. 2006), 1565--1567.
[20]
Swets, J. A. 1988. Measuring the accuracy of diagnostic systems. Science. 240 (Jun. 1988), 1285--1293.
[21]
Bradley, A. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern. Recognition. 30 (Jul. 1997), 1145--1159

Cited By

View all
  • (2022)Neural Network and Random Forest Models in Protein Function PredictionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.304423019:3(1772-1781)Online publication date: 1-May-2022
  • (2020)Towards combining data prediction and internet of things to manage milk production on dairy cowsComputers and Electronics in Agriculture10.1016/j.compag.2019.105156169(105156)Online publication date: Feb-2020
  • (2014)Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event dataBMC Bioinformatics10.1186/1471-2105-15-5815:1Online publication date: 26-Feb-2014
  • Show More Cited By

Index Terms

  1. Random forest-based prediction of protein sumoylation sites from sequence features

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
        August 2010
        705 pages
        ISBN:9781450304382
        DOI:10.1145/1854776
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 02 August 2010

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. evolutionary information
        2. protein sumoylation site prediction
        3. random forest
        4. sequence features
        5. support vector machine

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        BCB'10
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 254 of 885 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 17 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Neural Network and Random Forest Models in Protein Function PredictionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.304423019:3(1772-1781)Online publication date: 1-May-2022
        • (2020)Towards combining data prediction and internet of things to manage milk production on dairy cowsComputers and Electronics in Agriculture10.1016/j.compag.2019.105156169(105156)Online publication date: Feb-2020
        • (2014)Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event dataBMC Bioinformatics10.1186/1471-2105-15-5815:1Online publication date: 26-Feb-2014
        • (2011)Win percentageProceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2147805.2147809(29-38)Online publication date: 1-Aug-2011

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media