Abstract
We consider the problem of classifying string data faster and more accurately. This problem naturally arises in various fields that involve the analysis of huge amount of strings such as computational biology. Our solution, a new string kernel we call gapped spectrum kernel, yields a kind of sequence of kernels that interpolates faster and less accurate string kernels such as the spectrum kernel and slower and more accurate ones such as the wildcard kernel. As a result, we obtain an algorithm to compute the wildcard kernel that is provably faster than the state-of-the-art method. The recently introduced b-suffix array data structure plays an important role here. Another result is a better trade-off between the speed and accuracy of classification, which we demonstrate by protein classification experiment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ben-Hur, A., Ong, C.S., Sonnenburg, S., Schölkopf, B., Rätsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10), e1000173 (2008)
Asa, B.-H., Noble, W.S.: Kernel methods for predicting protein-protein interactions. In: ISMB (Supplement of Bioinformatics), pp. 38–46 (2005)
Chandonia, J.-M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E.: The ASTRAL Compendium in 2004. Nucleic Acids Research 32(Database-Issue), 189–192 (2004)
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM TIST 2(3), 27 (2011)
Farach, M.: Optimal Suffix Tree Construction with Large Alphabets. In: FOCS, pp. 137–143. IEEE Computer Society (1997)
Jaakkola, T., Diekhans, M., Haussler, D.: Using the Fisher Kernel Method to Detect Remote Protein Homologies. In: Lengauer, T., Schneider, R., Bork, P., Brutlag, D.L., Glasgow, J.I., Mewes, H.-W., Zimmer, R. (eds.) ISMB, pp. 149–158. AAAI (1999)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.S.: Profile-Based String Kernels for Remote Homology Detection and Motif Extraction. In: CSB, pp. 152–160. IEEE Computer Society (2004)
Kuksa, P.P., Huang, P.-H., Pavlovic, V.: Scalable Algorithms for String Kernels with Inexact Matching. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) NIPS, pp. 881–888. Curran Associates, Inc. (2008)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Leslie, C.S., Eskin, E., Noble, W.S.: The Spectrum Kernel: A String Kernel for SVM Protein Classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Leslie, C.S., Eskin, E., Weston, J., Noble, W.S.: Mismatch String Kernels for SVM Protein Classification. In: Becker, S., Thrun, S., Obermayer, K. (eds.) NIPS, pp. 1417–1424. MIT Press (2002)
Leslie, C.S., Kuang, R.: Fast Kernels for Inexact String Matching. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT/Kernel 2003. LNCS (LNAI), vol. 2777, pp. 114–128. Springer, Heidelberg (2003)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: RECOMB, pp. 225–232 (2002)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text Classification using String Kernels. Journal of Machine Learning Research 2, 419–444 (2002)
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. In: Johnson, D.S. (ed.) SODA, pp. 319–327. SIAM (1990)
Metz, C.E.: Basic principles of ROC analysis. Seminars in Nuclear Medicine 8(4), 283–298 (1978)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247(4), 536–540 (1995)
Noble, W.S., Kuehn, S., Thurman, R.E., Yu, M., Stamatoyannopoulos, J.A.: Predicting the in vivo signature of human gene regulatory sequence. In: ISMB (Supplement of Bioinformatics), pp. 328–343 (2005)
Onodera, T., Shibuya, T.: An Index Structure for Spaced Seed Search. In: Asano, T., Nakano, S., Okamoto, Y., Watanabe, O. (eds.) ISAAC 2011. LNCS, vol. 7074, pp. 764–772. Springer, Heidelberg (2011)
Swamidass, S.J., Chen, J.H., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. In: ISMB (Supplement of Bioinformatics), pp. 359–368 (2005)
Vapnik, V.: Statistical learning theory (1998)
Weiner, P.: Linear Pattern Matching Algorithms. In: SWAT (FOCS), pp. 1–11. IEEE Computer Society (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Onodera, T., Shibuya, T. (2013). The Gapped Spectrum Kernel for Support Vector Machines. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2013. Lecture Notes in Computer Science(), vol 7988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39712-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-39712-7_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39711-0
Online ISBN: 978-3-642-39712-7
eBook Packages: Computer ScienceComputer Science (R0)