Skip to main content

Classification of Ligase Function Based on Multi-parametric Feature Extracted from Protein Sequence

  • Conference paper
Computational Science and Its Applications – ICCSA 2008 (ICCSA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5073))

Included in the following conference series:

Abstract

One of the important goals of bioinformatics is to classify and predict the functions of proteins that have no sequence homolog of known functions. The purpose of this paper is to classify protein function by using multi-parametric feature, without sequence similarity. Firstly, we propose a method for generating novel features that present various local information of protein sequence based on positively and negatively charged residues. Then, we introduce a process of making optimal feature subset through combination of traditional and novel features extracted from protein sequence. Finally, we classify ligase enzymes by support vector machine (SVM). In experiment, only 375 out of 483 features were selected by feature selection, and the classification accuracy for 4th sub-classes in Enzyme Commission (EC) number is 98.35%. Our results demonstrate that most of novel features are valuable for specific enzyme function classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999)

    Article  Google Scholar 

  2. Bairoch, A.: The Enzyme Database in 2000. Nucleic Acids Res. 28, 304–305 (2000)

    Article  Google Scholar 

  3. Cai, C.Z., Wang, W.L., Sun, L.Z., Chen, Y.Z.: Protein function classification via support vector machine approach. Math. Biosci. 185, 111–122 (2003a)

    Article  MATH  MathSciNet  Google Scholar 

  4. Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31, 3692–3697 (2003b)

    Article  Google Scholar 

  5. Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Automated data-driven discovery of motif-based protein function classifiers. Inf. Sci (ISCI) 155, 1–18 (2003)

    Article  MathSciNet  Google Scholar 

  6. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)

    Google Scholar 

  7. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 35, 3389–3402 (1997)

    Article  Google Scholar 

  8. Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Structure Comparison and Structure Patterns. J. Comput. Biol. 7, 685–716 (2000)

    Article  Google Scholar 

  9. Syed, U., Yona, G.: Enzyme function prediction with interpretable models. In: Methods in Molecular Biology: Computational Systems Biology, pp. 1–33. Humana Press (2007)

    Google Scholar 

  10. Dobson, P.D., Doig, A.J.: Predicting Enzyme Class from Protein Structure without Alignments. J. Mol. Biol. 345, 187–199 (2005)

    Article  Google Scholar 

  11. Han, L.Y., Cai, C.Z., Ji, Z.L., Cao, Z.W., Cui, J., Chen, Y.Z.: Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res. 32, 6437–6444 (2004)

    Article  Google Scholar 

  12. Noble, W.S., Ben-Hur, A.: Integrating Informmation for protein function prediction, Bioinformatics-From Genomes Therapies. In: Lengauer, T. (ed.), WILE-VCH, Weinheim, vol. 3, pp. 1297–1314 (2007)

    Google Scholar 

  13. Guyon, I.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    Article  MATH  Google Scholar 

  14. Chawla, N.V.: C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proc. International Conference on Machine Learning (ICML), Workshop on learning from imbalanced datasets II (2003)

    Google Scholar 

  15. Borro, L.C., Oliveira, S.R.M., Yamagishi, M.E.B., Mancini, A.L., Jardine, J.G., Mazoni, I., Santos, E.H.D., Higa, R.H., Kuser, P.R., Neshich, G.: Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202 (2006)

    Google Scholar 

  16. Ian, H.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/ml/weka/

    MATH  Google Scholar 

  17. Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M.R., Appel, R.D., Bairoch, A.: Protein Identification and Analysis Tools on the ExPASy Server. In: John, M.W. (ed.) The Proteomics Protocols Handbook, pp. 571–607. Humana Press (2005)

    Google Scholar 

  18. Al-Shahib, A., Breitling, R., Gilbert, D.: Feature Selection and the class imbalance problem in predicting protein function from sequence. Appl. Bioinformatics 4, 195–203 (2005a)

    Google Scholar 

  19. Al-Shahib, A., Breitling, R., Gilbert, D.: FRANKSUM: New feature selection method for protein funciton prediction. Int. J. Neural Syst. 15, 250–275 (2005b)

    Article  Google Scholar 

  20. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. In: Goldstein, et al. (eds.), pp. 163–298. Addison Wesley, Reading (2006)

    Google Scholar 

  21. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85, 2444–2448 (1988)

    Article  Google Scholar 

  22. Rost, B.: Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999)

    Article  Google Scholar 

  23. Kawabata, T.: MATRAS: a program for protein 3D structure comparison. Nucleic Acids Res. 31, 3367–3369 (2003)

    Article  Google Scholar 

  24. Holm, L., Sande, C.: Dali: a network tool for protein structure comparison. Trends Biochem. Sci. 20, 478–480 (1995)

    Article  Google Scholar 

  25. Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proc. International Conference on Machine Learning (ICML), Workshop on Learning from Imbalanced Datasets II (2003)

    Google Scholar 

  26. Lapinsh, M., Gutcaits, A., Prusis, P., Post, C., Lundstedt, T., Wikberg, J.E.S.: Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci. 11, 795–805 (2002)

    Article  Google Scholar 

  27. Claeyssens, M., Henrissat, B.: Specificity mapping of cellulolytic enzymes: Classification into families of structurally related proteins confirmed by biochemical analysis. Protein Sci. 1, 1293–1297 (1992)

    Article  Google Scholar 

  28. Jensen, L.J., Gupta, R., Blom, N., Devos, D., Tamames, J., Kesmir, C., Nielsen, H., Stærfeldt, H.H., Rapacki, K., Workman, C., Andersen, C.A.F., Knudsen, S., Krogh, A., Valencia, A., Brunak, S.: Prediction of Human Protein Function from Post-translational Modifications and Localization Features. J. Mol. Biol. 319, 1257–1265 (2002a)

    Article  Google Scholar 

  29. Jensen, L.J., Skovgaard, M., Brunak, S.: Prediction of novel archaeal enzymes from sequence-derived features. Protein Sci. 3, 2894–2898 (2002b)

    Article  Google Scholar 

  30. Truniger, V., Lazaro, J.M., Esteban, F.J., Blanco, L., Salas, M.: A positively charged residue of φ29 DNA polymerase, highly conserved in DNA polymerases from families A and B, is involoved in binding the incoming nucleotide. Nucleic Acids Res. 30, 1483–1492 (2002)

    Article  Google Scholar 

  31. Pawlowski, K., Jaroszewski, L., Rychlewski, L., Godzik, A.: Sensitive sequence comparison as protein function predictor. In: Proc. pacific Symposium on Biocomputing, vol. 5, pp. 42–53 (2000)

    Google Scholar 

  32. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines, Software (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm

  33. Yasser E.L.M.: WLSVM (2005), http://www.cs.iastate.edu/~yasser/wlsvm/

  34. Bendtsen, J.D., Jensen, L.J., Blom, N., Heijne, G.V., Brunak, S.: Feature-based prediction of non-classical and leaderless protein secretion. Protein Eng. Des. Sel. 17, 349–356 (2004)

    Article  Google Scholar 

  35. Russell, R.B., Saqi, M.A., Bates, P.A., Sayle, R.A., Sternberg, M.J.: Recognition of analogous and homologous protein folds-assessment of prediction success and associated alignment accuracy using empirical substitution matrices. Protein Eng. 11, 1–9 (1998)

    Article  Google Scholar 

  36. Todd, A.E., Orengo, C.A., Thornton, J.M.: Evolution of Function in Protein Superfamilies, from a Structural Perspective. J. Mol. Biol. 307, 1113–1143 (2001)

    Article  Google Scholar 

  37. Hall, M.A.: Correlation-based feature selection for machine learning, Ph.D. thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand (1998)

    Google Scholar 

  38. Hall, M.A.: Correlation-based feature selection for discrete and numeric class machine learning. In: Proc. of the 17th Int. Conf. on Machine Learning (ICML2000), pp. 359–366. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    Google Scholar 

  39. Hall, M.A., Holmes, G.: Benchmarking Feature Selection Techniques for Discrete Class Data Mining. IEEE Transactions on Knowledge and Data Engineering 15, 1–16 (2003)

    Article  Google Scholar 

  40. Lee, B.J., Lee, H.G., Lee, J.Y., Ryu, K.H.: Classification of Enzyme Function from Protein Sequence based on Feature Representation. In: Proc. of the 7th IEEE Int. Conf. on Bioinformatics and Bioengineering (BIBE 2007), vol. 2, pp. 741–752 (2007)

    Google Scholar 

  41. Lee, B.J., Lee, H.G., Kim, D.S., Ryu, K.H.: Feature Extraction in Spatially-Conserved Regions and Protein Functional Classification. In: Proc. of the 2th Int. Conf. on Frontiers in the Convergence of Bioscience and Information Technologies (FBIT 2007), vol. 1, pp. 165–170 (2007)

    Google Scholar 

  42. Kim, S.S., Kang, J.W., Chung, Y.J., Li, J.Y., Ryu, K.H.: Clustering orthologous proteins across phylogenetically distant species. Proteins 71, 1113–1122 (2008)

    Article  Google Scholar 

  43. Kim, S.S., Jung, K.S., Ryu, K.H.: Automatic Orthologous-Protein-Clustering from Multiple Complete-Genomes by the Best Reciprocal BLAST Hits. In: Li, J., Yang, Q., Tan, A.-H. (eds.) BioDM 2006. LNCS (LNBI), vol. 3916, pp. 60–70. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Osvaldo Gervasi Beniamino Murgante Antonio Laganà David Taniar Youngsong Mun Marina L. Gavrilova

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, B.J., Lee, H.G., Shin, M.S., Ryu, K.H. (2008). Classification of Ligase Function Based on Multi-parametric Feature Extracted from Protein Sequence. In: Gervasi, O., Murgante, B., Laganà, A., Taniar, D., Mun, Y., Gavrilova, M.L. (eds) Computational Science and Its Applications – ICCSA 2008. ICCSA 2008. Lecture Notes in Computer Science, vol 5073. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69848-7_87

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69848-7_87

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69840-1

  • Online ISBN: 978-3-540-69848-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics