Abstract
The support vector machine (SVM) is a powerful learning algorithm, e.g., for classification and clustering tasks, that works even for complex data structures such as strings, trees, lists and general graphs. It is based on the usage of a kernel function for measuring scalar products between data units. For analyzing string data Lodhi et al. (J Mach Learn Res 2:419–444, 2002) have introduced a String Subsequence kernel (SSK). In this paper we propose an approximation to SSK based on dropping higher orders terms (i.e., subsequences which are spread out more than a certain threshold) that reduces the computational burden of SSK. As we are also concerned with practical application of complex kernels with high computational complexity and memory consumption, we provide an empirical model to predict runtime and memory of the approximation as well as the original SSK, based on easily measurable properties of input data. We provide extensive results on the properties of the proposed approximation, SSK-LP, with respect to prediction accuracy, runtime and memory consumption. Using some real-life datasets of text mining tasks, we show that models based on SSK and SSK-LP perform similarly for a set of real-life learning tasks, and that the empirical runtime model is also useful in roughly determining total learning time for a SVM using either kernel.
Similar content being viewed by others
References
Collins M and Duffy N (2002). Convolution kernels for natural language. In: Dietterich, TG, Becker, S and Ghahramani, Z (eds) Advances in neural information processing systems 14 (2001), pp 625–632. MIT Press, Cambridge
Cortes C, Haffner P and Mohri M (2004). Rational kernels: theory and algorithms. J Mach Learn Res 5: 1035–1062
Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines. Cambridge University Press, Cambridge
Frank E, Witten IH (2005) Data mining—practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann/Elsevier, San Francisco/Amsterdam 2005. http://www.cs.waikato.ac.nz/~ml/weka/
Haussler D (1999). Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, Baskin School of Engineering. University of California, Santa Cruz
Joachims T (2002) Learning to classify text using support vector machines. Kluwer, Dordrecht
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing 2002, pp 564–575
Lodhi H, Saunders C, Shawe-Taylor J, Christianini N and Watkins C (2002). Text classification using string kernels. J Mach Learn Res 2: 419–444
Minsky M, Papert SA (1969) Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, expanded edition, reprinted 1988
Pillet V, Zehnder M, Seewald AK, Veuthey A-L and Petrak J (2005). GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005(21): 1743–1744
Platt J (1998) Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf B, Burges C, Smola A (eds). Advances in kernel methods—support vector learning. MIT Press, Cambridge
Rousu J and Shawe-Taylor J (2005). Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6: 1323–1344
Seewald AK (2003) Recognizing domain and species from MEDLINE proteomics publications. Workshop on data mining and text mining for bioinformatics. In: 14th European conference on machine learning (ECML-2003), Dubrovnik-Cavtat, Croatia
Seewald AK (2004) Ranking for medical annotation: investigating performance, local search and homonymy recognition. In: Proceedings of the symposium on knowledge exploration in life science informatics (KELSI 2004), Milano, Italy
Seewald AK (2007). An evaluation of Naive Bayes Variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5): 497–524
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Seewald, A.K., Kleedorfer, F. Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering. ADAC 1, 221–239 (2007). https://doi.org/10.1007/s11634-007-0012-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-007-0012-1