Skip to main content
Log in

Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

The support vector machine (SVM) is a powerful learning algorithm, e.g., for classification and clustering tasks, that works even for complex data structures such as strings, trees, lists and general graphs. It is based on the usage of a kernel function for measuring scalar products between data units. For analyzing string data Lodhi et al. (J Mach Learn Res 2:419–444, 2002) have introduced a String Subsequence kernel (SSK). In this paper we propose an approximation to SSK based on dropping higher orders terms (i.e., subsequences which are spread out more than a certain threshold) that reduces the computational burden of SSK. As we are also concerned with practical application of complex kernels with high computational complexity and memory consumption, we provide an empirical model to predict runtime and memory of the approximation as well as the original SSK, based on easily measurable properties of input data. We provide extensive results on the properties of the proposed approximation, SSK-LP, with respect to prediction accuracy, runtime and memory consumption. Using some real-life datasets of text mining tasks, we show that models based on SSK and SSK-LP perform similarly for a set of real-life learning tasks, and that the empirical runtime model is also useful in roughly determining total learning time for a SVM using either kernel.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Collins M and Duffy N (2002). Convolution kernels for natural language. In: Dietterich, TG, Becker, S and Ghahramani, Z (eds) Advances in neural information processing systems 14 (2001), pp 625–632. MIT Press, Cambridge

    Google Scholar 

  • Cortes C, Haffner P and Mohri M (2004). Rational kernels: theory and algorithms. J Mach Learn Res 5: 1035–1062

    MathSciNet  Google Scholar 

  • Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines. Cambridge University Press, Cambridge

    Google Scholar 

  • Frank E, Witten IH (2005) Data mining—practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann/Elsevier, San Francisco/Amsterdam 2005. http://www.cs.waikato.ac.nz/~ml/weka/

  • Haussler D (1999). Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, Baskin School of Engineering. University of California, Santa Cruz

    Google Scholar 

  • Joachims T (2002) Learning to classify text using support vector machines. Kluwer, Dordrecht

  • Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing 2002, pp 564–575

  • Lodhi H, Saunders C, Shawe-Taylor J, Christianini N and Watkins C (2002). Text classification using string kernels. J Mach Learn Res 2: 419–444

    Article  MATH  Google Scholar 

  • Minsky M, Papert SA (1969) Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, expanded edition, reprinted 1988

  • Pillet V, Zehnder M, Seewald AK, Veuthey A-L and Petrak J (2005). GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005(21): 1743–1744

    Google Scholar 

  • Platt J (1998) Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf B, Burges C, Smola A (eds). Advances in kernel methods—support vector learning. MIT Press, Cambridge

  • Rousu J and Shawe-Taylor J (2005). Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6: 1323–1344

    MathSciNet  Google Scholar 

  • Seewald AK (2003) Recognizing domain and species from MEDLINE proteomics publications. Workshop on data mining and text mining for bioinformatics. In: 14th European conference on machine learning (ECML-2003), Dubrovnik-Cavtat, Croatia

  • Seewald AK (2004) Ranking for medical annotation: investigating performance, local search and homonymy recognition. In: Proceedings of the symposium on knowledge exploration in life science informatics (KELSI 2004), Milano, Italy

  • Seewald AK (2007). An evaluation of Naive Bayes Variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5): 497–524

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander K. Seewald.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seewald, A.K., Kleedorfer, F. Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering. ADAC 1, 221–239 (2007). https://doi.org/10.1007/s11634-007-0012-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-007-0012-1

Keywords

Mathematics Subject Classification (2000)

Navigation