Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

Seewald, Alexander K.; Kleedorfer, Florian

doi:10.1007/s11634-007-0012-1

Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

Regular Article
Published: 31 August 2007

Volume 1, pages 221–239, (2007)
Cite this article

Advances in Data Analysis and Classification Aims and scope Submit manuscript

Alexander K. Seewald¹ &
Florian Kleedorfer²

184 Accesses
8 Citations
Explore all metrics

Abstract

The support vector machine (SVM) is a powerful learning algorithm, e.g., for classification and clustering tasks, that works even for complex data structures such as strings, trees, lists and general graphs. It is based on the usage of a kernel function for measuring scalar products between data units. For analyzing string data Lodhi et al. (J Mach Learn Res 2:419–444, 2002) have introduced a String Subsequence kernel (SSK). In this paper we propose an approximation to SSK based on dropping higher orders terms (i.e., subsequences which are spread out more than a certain threshold) that reduces the computational burden of SSK. As we are also concerned with practical application of complex kernels with high computational complexity and memory consumption, we provide an empirical model to predict runtime and memory of the approximation as well as the original SSK, based on easily measurable properties of input data. We provide extensive results on the properties of the proposed approximation, SSK-LP, with respect to prediction accuracy, runtime and memory consumption. Using some real-life datasets of text mining tasks, we show that models based on SSK and SSK-LP perform similarly for a set of real-life learning tasks, and that the empirical runtime model is also useful in roughly determining total learning time for a SVM using either kernel.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Article 04 November 2014

Opening the Black Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms

Efficient geometric-based computation of the string subsequence kernel

Article 20 November 2017

References

Collins M and Duffy N (2002). Convolution kernels for natural language. In: Dietterich, TG, Becker, S and Ghahramani, Z (eds) Advances in neural information processing systems 14 (2001), pp 625–632. MIT Press, Cambridge
Google Scholar
Cortes C, Haffner P and Mohri M (2004). Rational kernels: theory and algorithms. J Mach Learn Res 5: 1035–1062
MathSciNet Google Scholar
Cristianini N and Shawe-Taylor J (2000). An introduction to support vector machines. Cambridge University Press, Cambridge
Google Scholar
Frank E, Witten IH (2005) Data mining—practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann/Elsevier, San Francisco/Amsterdam 2005. http://www.cs.waikato.ac.nz/~ml/weka/
Haussler D (1999). Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, Baskin School of Engineering. University of California, Santa Cruz
Google Scholar
Joachims T (2002) Learning to classify text using support vector machines. Kluwer, Dordrecht
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing 2002, pp 564–575
Lodhi H, Saunders C, Shawe-Taylor J, Christianini N and Watkins C (2002). Text classification using string kernels. J Mach Learn Res 2: 419–444
Article MATH Google Scholar
Minsky M, Papert SA (1969) Perceptrons: an introduction to computational geometry. MIT Press, Cambridge, expanded edition, reprinted 1988
Pillet V, Zehnder M, Seewald AK, Veuthey A-L and Petrak J (2005). GPSDB: a new database for synonyms expansion of gene and protein names. Bioinformatics 2005(21): 1743–1744
Google Scholar
Platt J (1998) Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf B, Burges C, Smola A (eds). Advances in kernel methods—support vector learning. MIT Press, Cambridge
Rousu J and Shawe-Taylor J (2005). Efficient computation of gapped substring kernels on large alphabets. J Mach Learn Res 6: 1323–1344
MathSciNet Google Scholar
Seewald AK (2003) Recognizing domain and species from MEDLINE proteomics publications. Workshop on data mining and text mining for bioinformatics. In: 14th European conference on machine learning (ECML-2003), Dubrovnik-Cavtat, Croatia
Seewald AK (2004) Ranking for medical annotation: investigating performance, local search and homonymy recognition. In: Proceedings of the symposium on knowledge exploration in life science informatics (KELSI 2004), Milano, Italy
Seewald AK (2007). An evaluation of Naive Bayes Variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5): 497–524
Google Scholar

Download references

Author information

Authors and Affiliations

Seewald Solutions, Leitermayergasse 33/24, 1180, Vienna, Austria
Alexander K. Seewald
Research Studios Austria, Smart Agent Techn., Hasnerstraße 123, 1160, Vienna, Austria
Florian Kleedorfer

Authors

Alexander K. Seewald
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kleedorfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander K. Seewald.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seewald, A.K., Kleedorfer, F. Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering. ADAC 1, 221–239 (2007). https://doi.org/10.1007/s11634-007-0012-1

Download citation

Received: 28 September 2006
Revised: 11 July 2007
Accepted: 20 July 2007
Published: 31 August 2007
Issue Date: December 2007
DOI: https://doi.org/10.1007/s11634-007-0012-1

Keywords

Mathematics Subject Classification (2000)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

Abstract

Access this article

Similar content being viewed by others

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Opening the Black Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms

Efficient geometric-based computation of the string subsequence kernel

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

Navigation

Lambda pruning: an approximation of the string subsequence kernel for practical SVM classification and redundancy clustering

Abstract

Access this article

Similar content being viewed by others

Learning sequential classifiers from long and noisy discrete-event sequences efficiently

Opening the Black Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms

Efficient geometric-based computation of the string subsequence kernel

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

Search

Navigation