Abstract
One of the important goals of bioinformatics is to classify and predict the functions of proteins that have no sequence homolog of known functions. The purpose of this paper is to classify protein function by using multi-parametric feature, without sequence similarity. Firstly, we propose a method for generating novel features that present various local information of protein sequence based on positively and negatively charged residues. Then, we introduce a process of making optimal feature subset through combination of traditional and novel features extracted from protein sequence. Finally, we classify ligase enzymes by support vector machine (SVM). In experiment, only 375 out of 483 features were selected by feature selection, and the classification accuracy for 4th sub-classes in Enzyme Commission (EC) number is 98.35%. Our results demonstrate that most of novel features are valuable for specific enzyme function classification.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999)
Bairoch, A.: The Enzyme Database in 2000. Nucleic Acids Res. 28, 304–305 (2000)
Cai, C.Z., Wang, W.L., Sun, L.Z., Chen, Y.Z.: Protein function classification via support vector machine approach. Math. Biosci. 185, 111–122 (2003a)
Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31, 3692–3697 (2003b)
Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Automated data-driven discovery of motif-based protein function classifiers. Inf. Sci (ISCI) 155, 1–18 (2003)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 35, 3389–3402 (1997)
Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Structure Comparison and Structure Patterns. J. Comput. Biol. 7, 685–716 (2000)
Syed, U., Yona, G.: Enzyme function prediction with interpretable models. In: Methods in Molecular Biology: Computational Systems Biology, pp. 1–33. Humana Press (2007)
Dobson, P.D., Doig, A.J.: Predicting Enzyme Class from Protein Structure without Alignments. J. Mol. Biol. 345, 187–199 (2005)
Han, L.Y., Cai, C.Z., Ji, Z.L., Cao, Z.W., Cui, J., Chen, Y.Z.: Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res. 32, 6437–6444 (2004)
Noble, W.S., Ben-Hur, A.: Integrating Informmation for protein function prediction, Bioinformatics-From Genomes Therapies. In: Lengauer, T. (ed.), WILE-VCH, Weinheim, vol. 3, pp. 1297–1314 (2007)
Guyon, I.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Chawla, N.V.: C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proc. International Conference on Machine Learning (ICML), Workshop on learning from imbalanced datasets II (2003)
Borro, L.C., Oliveira, S.R.M., Yamagishi, M.E.B., Mancini, A.L., Jardine, J.G., Mazoni, I., Santos, E.H.D., Higa, R.H., Kuser, P.R., Neshich, G.: Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202 (2006)
Ian, H.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/ml/weka/
Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M.R., Appel, R.D., Bairoch, A.: Protein Identification and Analysis Tools on the ExPASy Server. In: John, M.W. (ed.) The Proteomics Protocols Handbook, pp. 571–607. Humana Press (2005)
Al-Shahib, A., Breitling, R., Gilbert, D.: Feature Selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4, 195–203 (2005a)
Al-Shahib, A., Breitling, R., Gilbert, D.: FRANKSUM: New feature selection method for protein funciton prediction. Int. J. Neural Syst. 15, 250–275 (2005b)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. In: Goldstein, et al. (eds.), pp. 163–298. Addison Wesley, Reading (2006)
Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85, 2444–2448 (1988)
Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)
Kawabata, T.: MATRAS: a program for protein 3D structure comparison. Nucleic Acids Res. 31, 3367–3369 (2003)
Holm, L., Sande, C.: Dali: a network tool for protein structure comparison. Trends Biochem. Sci. 20, 478–480 (1995)
Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proc. International Conference on Machine Learning (ICML), Workshop on Learning from Imbalanced Datasets II (2003)
Lapinsh, M., Gutcaits, A., Prusis, P., Post, C., Lundstedt, T., Wikberg, J.E.S.: Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci. 11, 795–805 (2002)
Claeyssens, M., Henrissat, B.: Specificity mapping of cellulolytic enzymes: Classification into families of structurally related proteins confirmed by biochemical analysis. Protein Sci. 1, 1293–1297 (1992)
Jensen, L.J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Stærfeldt, H.H., Rapacki, K., Workman, C., Andersen, C.A.F., Knudsen, S., Krogh, A., Valencia, A., Brunak, S.: Prediction of Human Protein Function from Post-translational Modifications and Localization Features. J. Mol. Biol. 319, 1257–1265 (2002a)
Jensen, L.J., Skovgaard, M., Brunak, S.: Prediction of novel archaeal enzymes from sequence-derived features. Protein Sci. 3, 2894–2898 (2002b)
Truniger, V., Lazaro, J.M., Esteban, F.J., Blanco, L., Salas, M.: A positively charged residue of φ29 DNA polymerase, highly conserved in DNA polymerases from families A and B, is involoved in binding the incoming nucleotide. Nucleic Acids Res. 30, 1483–1492 (2002)
Pawlowski, K., Jaroszewski, L., Rychlewski, L., Godzik, A.: Sensitive sequence comparison as protein function predictor. In: Proc. pacific Symposium on Biocomputing, vol. 5, pp. 42–53 (2000)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines, Software (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Yasser E.L.M.: WLSVM (2005), http://www.cs.iastate.edu/~yasser/wlsvm/
Bendtsen, J.D., Jensen, L.J., Blom, N., Heijne, G.V., Brunak, S.: Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng. Des. Sel. 17, 349–356 (2004)
Russell, R.B., Saqi, M.A., Bates, P.A., Sayle, R.A., Sternberg, M.J.: Recognition of analogous and homologous protein folds-assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng. 11, 1–9 (1998)
Todd, A.E., Orengo, C.A., Thornton, J.M.: Evolution of Function in Protein Superfamilies, from a Structural Perspective. J. Mol. Biol. 307, 1113–1143 (2001)
Hall, M.A.: Correlation-based feature selection for machine learning, Ph.D. thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand (1998)
Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proc. of the 17th Int. Conf. on Machine Learning (ICML2000), pp. 359–366. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Hall, M.A., Holmes, G.: Benchmarking Feature Selection Techniques for Discrete Class Data Mining. IEEE Transactions on Knowledge and Data Engineering 15, 1–16 (2003)
Lee, B.J., Lee, H.G., Lee, J.Y., Ryu, K.H.: Classification of Enzyme Function from Protein Sequence based on Feature Representation. In: Proc. of the 7th IEEE Int. Conf. on Bioinformatics and Bioengineering (BIBE 2007), vol. 2, pp. 741–752 (2007)
Lee, B.J., Lee, H.G., Kim, D.S., Ryu, K.H.: Feature Extraction in Spatially-Conserved Regions and Protein Functional Classification. In: Proc. of the 2th Int. Conf. on Frontiers in the Convergence of Bioscience and Information Technologies (FBIT 2007), vol. 1, pp. 165–170 (2007)
Kim, S.S., Kang, J.W., Chung, Y.J., Li, J.Y., Ryu, K.H.: Clustering orthologous proteins across phylogenetically distant species. Proteins 71, 1113–1122 (2008)
Kim, S.S., Jung, K.S., Ryu, K.H.: Automatic Orthologous-Protein-Clustering from Multiple Complete-Genomes by the Best Reciprocal BLAST Hits. In: Li, J., Yang, Q., Tan, A.-H. (eds.) BioDM 2006. LNCS (LNBI), vol. 3916, pp. 60–70. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, B.J., Lee, H.G., Shin, M.S., Ryu, K.H. (2008). Classification of Ligase Function Based on Multi-parametric Feature Extracted from Protein Sequence. In: Gervasi, O., Murgante, B., Laganà, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2008. ICCSA 2008. Lecture Notes in Computer Science, vol 5073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69848-7_87
Download citation
DOI: https://doi.org/10.1007/978-3-540-69848-7_87
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69840-1
Online ISBN: 978-3-540-69848-7
eBook Packages: Computer ScienceComputer Science (R0)