Abstract
String kernels are popular tools for analyzing protein sequence data and they have been successfully applied to many computational biology problems. The traditional string kernels assume that different substrings are independent. However, substrings can be highly correlated due to their substructure relationship or common physico-chemical properties. This paper proposes two kinds of weighted spectrum kernels: The correlation spectrum kernel and the AA spectrum kernel. We evaluate their performances by predicting glycan-binding proteins of 12 glycans. The results show that the correlation spectrum kernel and the AA spectrum kernel perform significantly better than the spectrum kernel for nearly all the 12 glycans. By comparing the predictive power of AA spectrum kernels constructed by different physico-chemical properties, the authors can also identify the physicochemical properties which contributes the most to the glycan-protein binding. The results indicate that physico-chemical properties of amino acids in proteins play an important role in the mechanism of glycan-protein binding.
Similar content being viewed by others
References
Leslie C, Eskin E, and Noble W S, The spectrum kernel: A string kernel for svm protein classification, Proceedings of the Pacific Biocomputing Symposium, 2002, 7: 566–575.
Leslie C, Eskin E, Weston J, and Noble W S, Mismatch string kernels for discriminative protein classification, Bioinformatics, 2003, 20(4): 467–476.
Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller K, Sommer R, and Schölkopf B, Improving the caenorhabditis elegans genome annotation using machine learning, PLoS Computational Biology, 2007, 3: e20.
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong C, Philips P, Bona F, Hartmann L, Bohlen A, Krger N, Sonnenburg S, and Ratsch G, Mgene: Accurate svm-based gene finding with an application to nematode genomes, Genome Res., 2009, 19(11): 2133–2143.
Schultheiss S, Busch W, Lohmann J, Kohlbacher O, and Rätsch G, Kirmes: Kernel-based identification of regulatory modules in euchromatic sequences, Bioinformatics, 2009, 25(16): 2126–2133.
Roth V and Fischer B, Improved functional prediction of proteins by learning kernel combinations in multilabel settings, BMC Bioinformatics, 2007, 8(Supp 2): S12.
Ong C and Zien A, An automated combination of kernels for predicting protein subcellular localization, Proceedings of the 8th Workshop on Algorithms in Bioinformatics (WABI), Lecture Notes in Bioinformatics, Springer, 2008, 168–179.
Röttig M, Rausch C, and Kohlbacher O, Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families, PLoS Computational Biology, 2010, 6: e1000636.
Someya S, Kakuta M, Morita M, Sumikoshi K, Cao W, Ge Z, Hirose O, Nakamura S, Terada T, and Shimizu K, Prediction of carbohydrate-binding proteins from sequences using support vector machines, Advances in Bioinformatics, 2010, 1, DOI: 10.1155/2010/289301.
Jin Y T B and Zhang Y, Support vector machines with genetic fuzzy feature transformation for biomedical data classification, Information Sciences, 2007, 476–489.
Vapnik V N, The Nature of Statistical Learning Theory, Springer, New York, 1995.
Noble W, What is a support vector machine?, Nat Biotech, 2006, 24(12): 1565–1567.
Li L, Ching W, Chan Y, and Mamitsuka H, On network-based kernel methods for protein-protein interactions with applications in protein functions prediction, Journal of Systems Science and Complexity, 2010, 23(4): 917–930.
Argos J R A and Hargrave P, Structural prediction of membrane-bound proteins, International Journal of Peptide and Protein Research, 1982, 128: 565–575.
Toussaint N C, Widmer C, Kohlbacher O, and Rätsch G, Exploiting physico-chemical properties in string kernels, BMC Bioinformatics, 2010, 11(Suppl 8): S7.
Jiang H, Ching W, and Zheng Z, Kernel techniques in support vector machines for classification of biological data, International Journal of Information Technology and Computer Science, 2011, 2: 1–8.
Vapnik V and Chervonenkis A, Theory of Pattern Recognition [in Russian], Nauka, Moscow, 1974, (German Translation: Wapnik W and Tscherwonenkis A), Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979.
Schölkopf B and Smola A J, Learning with Kernels, MIT Press, Cambridge, MA, 2002.
Schölkopf B, Tsuda K, and Vert J P, Kernel Methods in Computational Biology, MIT Press, Cambridge, Massachusetts, 2004.
Cortes C and Vapnik V, Support vector networks, Machine Learning, 1995, 20: 273–297.
Kuhn H W and Tucker A W, Nonlinear programming, Proc. 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, University of California Press, Berkeley, 1951, 481–492.
Varki A, Cummings R, Esko J, Freeze H, Hart G, and Etzler M E, Essentials of Glycobiology, 2nd Edition, Cold Spring Harbor Laboratory Press, New York, 2008.
Feizi T, Fazio F, Chai W, and Wong C, Carbohydrate microarrays — A new set of technologies at the frontiers of glycomics, Curr. Opin. Struct. Biol., 2003, 13: 637–645.
Paulson J C, Blixt O, and Collins B E, Sweet spots in functional glycomics, Nat. Chem. Biol., 2006, 2: 238–248.
Oyelaran O and Gildersleeve J C, Glycan arrays: Recent advances and future challenges, Curr. Opin. Chem. Biol., 2009, 13: 406–413.
Kawashima S and Kanehisa M, Aaindex: Amino acid index database, Nucleic Acids Res., 2000, 28: 374.
Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, and Hirakawa M, From genomics to chemical genomics: New developments in kegg, Nucleic Acids Res., 2006, 34: 354–357.
Chang C C and Lin C J, Libsvm: A library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Hisamatsu K, Tsuda N, Goda S, and Hatakeyama T, Characterization of the alpha-helix region in domain 3 of the haemolytic lectin cel-iii: Implications for self-oligomerization and haemolytic processes, J. Biochem., 2008, 143: 79–86.
Chandra N R, Prabu M M, Suguna K, and Vijayan M, Structural similarity and functional diversity in proteins containing the legume lectin fold, Protein Engineering, 2001, 14: 857–866.
Hamelryck T W, Loris R, Bouckaert J, and Wyns L, Structural features of the legume lectins, Trends in Glycoscience and Glycotechnology, 1998, 10: 349–360.
Hester G, Kaku H, Goldstein I J, and Wright C S, Structure of mannose-specific snowdrop (galanthus nivalis) lectin is representative of a new plant lectin family, Nature Structural Biology, 1995, 2: 472–479.
Sharon N and Lisi H, Lectins, Springer, 2nd edition, Dordrecht, The Netherlands, 2003.
Wright L M, Damme E J M V, Barre A, et al., Isolation, characterization, molecular cloning and molecular modelling of two lectins of different specificities from bluebell (scilla campanulata) bulbs, Biochemical Journal, 1999, 340: 299–308.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported in part by Research Grants Council of Hong Kong under Grant No. 17301214 and HKU CERG Grants and Hung Hing Ying Physical Research Grant, and the Research Funds of Renmin University of China, and the National Natural Science Foundation of China under Grant Nos. 11271144, 11101382, 11471256, and S201201009985.
This paper was recommended for publication by Editor ZOU Guohua.
Rights and permissions
About this article
Cite this article
Li, L., Aoki-Kinoshita, K.F., Ching, WK. et al. On using physico-chemical properties of amino acids in string kernels for protein classification via support vector machines. J Syst Sci Complex 28, 504–516 (2015). https://doi.org/10.1007/s11424-015-2156-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11424-015-2156-y