research-article

Random forest-based prediction of protein sumoylation sites from sequence features

Authors:

Liangjiang WangAuthors Info & Claims

BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

Pages 120 - 126

https://doi.org/10.1145/1854776.1854797

Published: 02 August 2010 Publication History

Abstract

Protein sumoylation play essential roles in the eukaryotic cell and any alterations in this process may cause various human diseases. This paper describes a new machine learning approach for the sumoylation site prediction from protein sequence information. Random Forests (RFs), which can handle a large number of input variables and avoid model overfitting, were trained with the data collected from literature. To construct accurate classifiers, forty sequence features were selected for input vector encoding. The results suggested that RF classifier performance was affected by the sequence context of sumoylation sites, and the use of eighteen residues with the core motif ψKXE in the middle gave the highest performance (ROC AUC = 0.9328). The RF classifiers were also found to outperform support vector machine (SVM) models on the same dataset. Thus, the RF algorithm appears to be the best choice for accurate prediction of protein sumoylation sites from sequence features.

References

[1]

Geiss-Friedlander, R. and Melchior, F. 2007. Concepts in sumoylation: a decade on. Nat. Rev. Mol. Cell. Biol. 8 (Dec. 2007), 947--956.

[2]

Sarge, K. D. and Park-Sarge, O. K. 2009. Sumoylation and human disease pathogenesis. Trends. Biochem. Sci. 34 (Apr. 2009), 200--205.

[3]

Martin, S., Wilkinson, K. A., Nishimune, A. and Henley, J. M. 2007. Emerging extranuclear roles of protein SUMOylation in neuronal function and dysfunction. Nat. Rev. Neurosci. 8 (Dec. 2007), 948--959.

[4]

Yang, S. H., Galanis, A., Witty, J. and Sharrocks, A. D. 2006. An extended consensus motif enhances the specificity of substrate modification by SUMO. EMBO J. 25 (Nov. 2006), 5083--5093.

[5]

Hietakangas, V., Anckar, J., Blomster, H. A., Fujimoto, M., Palvimo, J. J., Nakai, A. and Sistonen, L. 2006. PDSM, a motif for phosphorylation-dependent SUMO modification. Proc. Natl. Acad. Sci. U S A. 103 (Jan. 2006), 45--50.

[6]

Stankovic-Valentin, N., Deltour, S., Seeler, J., Pinte, S., Vergoten, G., Guerardel, C., Dejean, A. and Leprince, D. 2007. An acetylation/ deacetylation-SUMOylation switch through a phylogenetically conserved psiKXEP motif in the tumor suppressor HIC1 regulates transcriptional repression activity. Mol. Cell. Biol. 27 (Apr. 2007), 2661--2675.

[7]

Liu, B., Li, S., Wang, Y., Lu, L., Li, Y. and Cai, Y. 2007. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem. Biophys. Res. Commun. 358 (Jun. 2007), 136--139.

[8]

Xue, Y., Zhou, F., Fu, C., Xu, Y. and Yao, X. 2006. SUMOsp: a web server for sumoylation site prediction. Nucleic. Acids. Res. 34 (Jul. 2006), W254--257.

[9]

Xu, J., He, Y., Qiang, B., Yuan, J., Peng, X. and Pan, X. M. 2008. A novel method for high accuracy sumoylation site prediction from protein sequences. BMC Bioinformatics. 9 (Jan. 2008), 8.

[10]

S. Teng, A. K. Srivastava, L. Wang, "Sequence feature-based prediction of protein stability changes upon amino acid substitutions", BMC Genomics, 2010, in press.

[11]

Wang, L. and Brown, S. J. 2006. Prediction of DNA-binding residues from sequence features. J. Bioinfom. Comput. Biol. 4 (Dec. 2006), 1141--1158.

[12]

Wang, L. and Brown, S. J. 2006. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic. Acids. Res. 34 (Jul. 2006), W243--248.

[13]

Wang, L., Yang, M. Q. and Yang, J. Y. 2009. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 10 (Jul. 2009), S1.

[14]

Schneider, T. D. and Stephens, R. M. 1990. Sequence logos: a new way to display consensus sequences. Nucleic. Acids. Res. 18 (Oct. 1990), 6097--6100.

[15]

Gorodkin, J., Heyer, L. J., Brunak, S. and Stormo, G. D. 1997. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci. 13 (Dec. 1997), 583--586.

[16]

Gasteiger E., H. C., Gattiker A., Duvaud S., Wilkins M. R., Appel R. D., Bairoch A. 2005. The Proteomics Protocols Handbook. Humana Press.

[17]

Kawashima, S. and Kanehisa, M. 2000. AAindex: amino acid index database. Nucleic. Acids. Res. 28 (Jan. 2000), 374.

[18]

Ahmad, S. and Sarai, A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6 (Feb. 2005), 33.

[19]

Noble, W. S. 2006. What is a support vector machine? Nat. Biotechnol. 24 (Dec. 2006), 1565--1567.

[20]

Swets, J. A. 1988. Measuring the accuracy of diagnostic systems. Science. 240 (Jun. 1988), 1285--1293.

[21]

Bradley, A. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern. Recognition. 30 (Jul. 1997), 1145--1159

Digital Library

Cited By

Hakala KKaewphan SBjorne JMehryary FMoen HTolvanen MSalakoski TGinter F(2022)Neural Network and Random Forest Models in Protein Function PredictionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.304423019:3(1772-1781)Online publication date: 1-May-2022
https://doi.org/10.1109/TCBB.2020.3044230
da Rosa Righi RGoldschmidt GKunst RDeon CAndré da Costa C(2020)Towards combining data prediction and internet of things to manage milk production on dairy cowsComputers and Electronics in Agriculture10.1016/j.compag.2019.105156169(105156)Online publication date: Feb-2020
https://doi.org/10.1016/j.compag.2019.105156
Sariyar MHoffmann IBinder H(2014)Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event dataBMC Bioinformatics10.1186/1471-2105-15-5815:1Online publication date: 26-Feb-2014
https://doi.org/10.1186/1471-2105-15-58
Show More Cited By

Index Terms

Random forest-based prediction of protein sumoylation sites from sequence features
1. Applied computing
  1. Life and medical sciences

Recommendations

Structure-based prediction of protein-protein interaction sites
Prediction of protein–RNA binding sites by a random forest method with combined features

Motivation: Protein–RNA interactions play a key role in a number of biological processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. As a result, a reliable identification of RNA binding ...
Sequence-Based Prediction of Protein Folding Rates Using Contacts, Secondary Structures and Support Vector Machines
BIBM '09: Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine

Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '10: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology

August 2010

705 pages

ISBN:9781450304382

DOI:10.1145/1854776

General Chairs:
Aidong Zhang
SUNY at Buffalo
,
Mark Borodovsky
Georgia Tech
,
Program Chairs:
Gultekin Ozsoyoglu
Case Western Reserve University
,
Armin Mikler
University of North Texas

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

BCB'10

Sponsor:

SIGBio

BCB'10: ACM International Conference on Bioinformatics and Computational Biology

August 2 - 4, 2010

New York, Niagara Falls

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
183
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hakala KKaewphan SBjorne JMehryary FMoen HTolvanen MSalakoski TGinter F(2022)Neural Network and Random Forest Models in Protein Function PredictionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.304423019:3(1772-1781)Online publication date: 1-May-2022
https://doi.org/10.1109/TCBB.2020.3044230
da Rosa Righi RGoldschmidt GKunst RDeon CAndré da Costa C(2020)Towards combining data prediction and internet of things to manage milk production on dairy cowsComputers and Electronics in Agriculture10.1016/j.compag.2019.105156169(105156)Online publication date: Feb-2020
https://doi.org/10.1016/j.compag.2019.105156
Sariyar MHoffmann IBinder H(2014)Combining techniques for screening and evaluating interaction terms on high-dimensional time-to-event dataBMC Bioinformatics10.1186/1471-2105-15-5815:1Online publication date: 26-Feb-2014
https://doi.org/10.1186/1471-2105-15-58
Parry RPhan JWang MGrossman RRzhetsky AKim SWang W(2011)Win percentageProceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2147805.2147809(29-38)Online publication date: 1-Aug-2011
https://dl.acm.org/doi/10.1145/2147805.2147809

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents