ABSTRACT
Protein sumoylation play essential roles in the eukaryotic cell and any alterations in this process may cause various human diseases. This paper describes a new machine learning approach for the sumoylation site prediction from protein sequence information. Random Forests (RFs), which can handle a large number of input variables and avoid model overfitting, were trained with the data collected from literature. To construct accurate classifiers, forty sequence features were selected for input vector encoding. The results suggested that RF classifier performance was affected by the sequence context of sumoylation sites, and the use of eighteen residues with the core motif ψKXE in the middle gave the highest performance (ROC AUC = 0.9328). The RF classifiers were also found to outperform support vector machine (SVM) models on the same dataset. Thus, the RF algorithm appears to be the best choice for accurate prediction of protein sumoylation sites from sequence features.
- Geiss-Friedlander, R. and Melchior, F. 2007. Concepts in sumoylation: a decade on. Nat. Rev. Mol. Cell. Biol. 8 (Dec. 2007), 947--956.Google ScholarCross Ref
- Sarge, K. D. and Park-Sarge, O. K. 2009. Sumoylation and human disease pathogenesis. Trends. Biochem. Sci. 34 (Apr. 2009), 200--205.Google Scholar
- Martin, S., Wilkinson, K. A., Nishimune, A. and Henley, J. M. 2007. Emerging extranuclear roles of protein SUMOylation in neuronal function and dysfunction. Nat. Rev. Neurosci. 8 (Dec. 2007), 948--959.Google ScholarCross Ref
- Yang, S. H., Galanis, A., Witty, J. and Sharrocks, A. D. 2006. An extended consensus motif enhances the specificity of substrate modification by SUMO. EMBO J. 25 (Nov. 2006), 5083--5093.Google ScholarCross Ref
- Hietakangas, V., Anckar, J., Blomster, H. A., Fujimoto, M., Palvimo, J. J., Nakai, A. and Sistonen, L. 2006. PDSM, a motif for phosphorylation-dependent SUMO modification. Proc. Natl. Acad. Sci. U S A. 103 (Jan. 2006), 45--50.Google ScholarCross Ref
- Stankovic-Valentin, N., Deltour, S., Seeler, J., Pinte, S., Vergoten, G., Guerardel, C., Dejean, A. and Leprince, D. 2007. An acetylation/ deacetylation-SUMOylation switch through a phylogenetically conserved psiKXEP motif in the tumor suppressor HIC1 regulates transcriptional repression activity. Mol. Cell. Biol. 27 (Apr. 2007), 2661--2675.Google ScholarCross Ref
- Liu, B., Li, S., Wang, Y., Lu, L., Li, Y. and Cai, Y. 2007. Predicting the protein SUMO modification sites based on Properties Sequential Forward Selection (PSFS). Biochem. Biophys. Res. Commun. 358 (Jun. 2007), 136--139.Google ScholarCross Ref
- Xue, Y., Zhou, F., Fu, C., Xu, Y. and Yao, X. 2006. SUMOsp: a web server for sumoylation site prediction. Nucleic. Acids. Res. 34 (Jul. 2006), W254--257.Google ScholarCross Ref
- Xu, J., He, Y., Qiang, B., Yuan, J., Peng, X. and Pan, X. M. 2008. A novel method for high accuracy sumoylation site prediction from protein sequences. BMC Bioinformatics. 9 (Jan. 2008), 8.Google ScholarCross Ref
- S. Teng, A. K. Srivastava, L. Wang, "Sequence feature-based prediction of protein stability changes upon amino acid substitutions", BMC Genomics, 2010, in press.Google ScholarCross Ref
- Wang, L. and Brown, S. J. 2006. Prediction of DNA-binding residues from sequence features. J. Bioinfom. Comput. Biol. 4 (Dec. 2006), 1141--1158.Google ScholarCross Ref
- Wang, L. and Brown, S. J. 2006. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic. Acids. Res. 34 (Jul. 2006), W243--248.Google ScholarCross Ref
- Wang, L., Yang, M. Q. and Yang, J. Y. 2009. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 10 (Jul. 2009), S1.Google ScholarCross Ref
- Schneider, T. D. and Stephens, R. M. 1990. Sequence logos: a new way to display consensus sequences. Nucleic. Acids. Res. 18 (Oct. 1990), 6097--6100.Google ScholarCross Ref
- Gorodkin, J., Heyer, L. J., Brunak, S. and Stormo, G. D. 1997. Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci. 13 (Dec. 1997), 583--586.Google Scholar
- Gasteiger E., H. C., Gattiker A., Duvaud S., Wilkins M. R., Appel R. D., Bairoch A. 2005. The Proteomics Protocols Handbook. Humana Press.Google Scholar
- Kawashima, S. and Kanehisa, M. 2000. AAindex: amino acid index database. Nucleic. Acids. Res. 28 (Jan. 2000), 374.Google ScholarCross Ref
- Ahmad, S. and Sarai, A. 2005. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics. 6 (Feb. 2005), 33.Google ScholarCross Ref
- Noble, W. S. 2006. What is a support vector machine? Nat. Biotechnol. 24 (Dec. 2006), 1565--1567.Google Scholar
- Swets, J. A. 1988. Measuring the accuracy of diagnostic systems. Science. 240 (Jun. 1988), 1285--1293.Google Scholar
- Bradley, A. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern. Recognition. 30 (Jul. 1997), 1145--1159 Google ScholarDigital Library
Index Terms
- Random forest-based prediction of protein sumoylation sites from sequence features
Recommendations
Prediction of protein–RNA binding sites by a random forest method with combined features
Motivation: Protein–RNA interactions play a key role in a number of biological processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. As a result, a reliable identification of RNA binding ...
Sequence-Based Prediction of Protein Folding Rates Using Contacts, Secondary Structures and Support Vector Machines
BIBM '09: Proceedings of the 2009 IEEE International Conference on Bioinformatics and BiomedicinePredicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish ...
Comments