Comparing and Combining Two Approaches to Automated Subject Classification of Text

Golub, Koraljka; Ardö, Anders; Mladenić, Dunja; Grobelnik, Marko

doi:10.1007/11863878_45

Koraljka Golub²⁰,
Anders Ardö²⁰,
Dunja Mladenić²¹ &
…
Marko Grobelnik²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4172))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

Abstract

A machine-learning and a string-matching approach to automated subject classification of text were compared, as to their performance, advantages and downsides. The former approach was based on an SVM algorithm, while the latter comprised string-matching between a controlled vocabulary and words in the text to be classified. Data collection consisted of a subset from Compendex, classified into six different classes. It was shown that SVM on average outperforms the string-matching approach: our hypothesis that SVM yields better recall and string-matching better precision was confirmed only on one of the classes. The two approaches being complementary, we investigated different combinations of the two based on combining their vocabularies. The results have shown that the original approaches, i.e. machine-learning approach without using background knowledge from the controlled vocabulary, and string-matching approach based on controlled vocabulary, outperform approaches in which combinations of automatically and manually obtained terms were used. Reasons for these results need further investigation, including a larger data collection and combining the two using predictions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A General Framework for Text Document Classification Using SEMCON and ACVSR

Automatic Document Classification Based on J.S. Mill’s Ideas

A Weakly Supervised Text Classification Method Based on Vocabulary Construction

References

Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary: challenges and recommendations. New review of hypermedia and multimedia, Special issue on knowledge organization systems and services 2006(1)
Google Scholar
Milstead, J. (ed.) Ei thesaurus Engineering Information, Castle Point on the Hudson Hoboken, 2nd edn. (1995)
Google Scholar
Grobelnik, M., Mladenic, D.: Text Mining Recipes. Springer, Heidelberg (2006), accompanying software available at, http://www.textmining.net
Mladenic, D., Grobelnik, M.: Feature selection on hierarchy of web documents. Journal of Decision Support Systems 35, 45–87 (2003)
Article Google Scholar
Compendex database, http://www.engineeringvillage2.org/

Download references

Author information

Authors and Affiliations

KnowLib Research Group, Dept. of Information Technology, Lund University, Sweden
Koraljka Golub & Anders Ardö
J. Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Dunja Mladenić & Marko Grobelnik

Authors

Koraljka Golub
View author publications
You can also search for this author in PubMed Google Scholar
Anders Ardö
View author publications
You can also search for this author in PubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar
Marko Grobelnik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

No Affiliations,
Julio Gonzalo
Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Richerche, Via Moruzzi, 1, 56124, Pisa, Italy
Costantino Thanos
Dpto. Lenguajes y Sistemas Informáticos, UNED,
M. Felisa Verdejo
Dep. de Lenguajes y Sistemas Informáticos, Universidad de Alicante, E-03071, Alicante, Spain
Rafael C. Carrasco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Golub, K., Ardö, A., Mladenić, D., Grobelnik, M. (2006). Comparing and Combining Two Approaches to Automated Subject Classification of Text. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2006. Lecture Notes in Computer Science, vol 4172. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11863878_45

Download citation

DOI: https://doi.org/10.1007/11863878_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44636-1
Online ISBN: 978-3-540-44638-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics