Skip to main content

Using String Kernels for Classification of Slovenian Web Documents

  • Conference paper
From Data and Information Analysis to Knowledge Engineering

Abstract

In this paper we present an approach for classifying web pages obtained from the Slovenian Internet directory where the web sites covering different topics are organized into a topic ontology. We tested two different methods for representing text documents, both in combination with the linear SVM classification algorithm. The first representation used is a standard bag-of-words approach with TFIDF weights and cosine distance used as similarity measure. We compared this to String kernels where text documents are compared not by words but by substrings. This removes the need for stemming or lemmatisation which can be an important issue when documents are in other languages than English and tools for stemming or lemmatisation are unavailable or are expensive to make or learn. In highly inflected natural languages, such as Slovene language, the same word can have many different forms, thus String kernels have an advantage here over the bag-of-words. In this paper we show that in classification of documents written in highly inflected natural language the situation is opposite and String Kernels significantly outperform the standard bag-of-words representation. Our experiments also show that the advantage of String kernels is more evident for domains with unbalanced class distribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 159.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • B.E. BOSER, I.M. GUYON, and V.N. VAPNIK (1992): Proc. 5th Annual ACM Workshop on Computational Learning Theory, 144–152. Pittsburgh, PA, July 1992. ACM Press

    Google Scholar 

  • J. BRANK, M. GROBELNIK, N. MILIC-FRAYLING, D. MLADENIC (2003): Training text classifiers with SVM on very few positive examples, Technical repot, MSR-TR-2003-34.

    Google Scholar 

  • T. JOACHIMS (1999): Making large-scale svm learning practical. In: B. Scholkopf, C. Burges, and A. Smola (eds.): Advances in Kernel Methods-Support Vector Learning. MIT-Press.

    Google Scholar 

  • H. LODHI, C. SAUNDERS, J. SHAWE-TAYLOR, N. CRISTIANINI, and C. WATKINS (2002): Text classification using string kernels. Journal of Machine Learning Research, 2, 419–444.

    Article  Google Scholar 

  • D. MLADENIC and M. GROBELNIK (2003): Feature selection on hierarchy of web documents. Journal of Decision Support Systems, 35(1): 45–87.

    Google Scholar 

  • K. MORIK, P. BROCKHAUSEN, and T. JOACHIMS (1999): Combining statistical learning with a knowledge-based approach — A case study in intensive care monitoring. Int. Conf. Machine Learning

    Google Scholar 

  • J. PLISSON, N. LAVRAC, and D. MLADENIC (2004): A rule based approach to word lemmatization. Proc. 7th Int. Conf. Information Society IS-2004, 83–86. Ljubljana: Institut Jozef Stefan.

    Google Scholar 

  • C. SAUNDERS, H. TSCHACH, and J. SHAWE-TAYLOR (2002): Syllables and Other String Kernel extensions. Proc. 19th Int. Conf. Machine Learning

    Google Scholar 

  • F. SEBASTIANI (2002): Machine Learning for Automated Text Categorization. ACM Computing Surveys, 34:1, 1–47.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Berlin · Heidelberg

About this paper

Cite this paper

Fortuna, B., Mladenič, D. (2006). Using String Kernels for Classification of Slovenian Web Documents. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_43

Download citation

Publish with us

Policies and ethics