Semantically Enhanced Text Stemmer (SETS) for Cross-Domain Document Clustering

Stankov, Ivan; Todorov, Diman; Setchi, Rossitza

doi:10.1007/978-3-642-37343-5_12

Ivan Stankov²³,
Diman Todorov²³ &
Rossitza Setchi²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7828))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

970 Accesses

Abstract

This paper focuses on processing cross-domain document repositories, which is challenged by the word ambiguity and the fact that monosemic words are more domain-oriented than polysemic ones. The paper describes a semantically enhanced text normalization algorithm (SETS) aimed at improving document clustering and investigates the performance of the sk-means clustering algorithm across domains by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 generic sub-domains of a thousand documents each randomly selected from the Reuters21578 corpus. The experimental results demonstrate improved coherence of the clusters produced by SETS compared to the text normalization obtained with the Porter stemmer. In addition, semantic-based text normalization is shown to be resistant to noise, which is often introduced in the index aggregation stage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Cutting, D.R., et al.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1992, Copenhagen, Denmark (1992)
Google Scholar
Carpineto, C., et al.: A survey of Web clustering engines. ACM Computing Surveys 41(3), 1–38 (2009)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1. Association for Computational Linguistics, Suntec (2009)
Google Scholar
Xu, J., Croft, W.B.: Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16(1), 61–81 (1998)
Article Google Scholar
Smirnov, I.: Overview of Stemming Algorithms. Mechanical Translation (2008)
Google Scholar
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
Article Google Scholar
Wessel, K., Ren, P., et al.: Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1996. ACM, Zurich (1996)
Google Scholar
Lee, L.: Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics 1999. Association for Computational Linguistics, College Park (1999)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 1993, Pittsburgh, Pennsylvania, United States (1993)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York (1986)
Google Scholar
Fung, B., Wang, K., Ester, M.: Hierarchical document clustering. In: The Encyclopedia of Data Warehousing and Mining. Idea Group, NY (2005)
Google Scholar
Gliozzo, A., Strapparava, C., Dagan, I.: Unsupervised and Supervised Exploitation of Semantic Domains in Lexical Disambiguation. Computer Speech and Language 18(3), 24 (2004)
Article Google Scholar
Setchi, R., Tang, Q., Stankov, I.: Semantic-based information retrieval in support of concept design. Advanced Engineering Informatics 25(2), 131–146 (2011)
Article Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, Boston (2000)
Google Scholar
Setchi, R., Tang, Q., Bouchard, C.: Ontology-Based Concept Indexing of Images. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) KES 2009, Part I. LNCS, vol. 5711, pp. 293–300. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Knowledge Engineering Systems Group, School of Engineering, Cardiff University, UK
Ivan Stankov, Diman Todorov & Rossitza Setchi

Authors

Ivan Stankov
View author publications
You can also search for this author in PubMed Google Scholar
Diman Todorov
View author publications
You can also search for this author in PubMed Google Scholar
Rossitza Setchi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computational Science and Artificial Intelligence, University of the Basque Country, Manuel Lardizabal 1, 20018, San Sebastian, Spain
Manuel Graña
Vicomtech-IK4, Paseo Mijeletegui, 20009, San Sebastian, Spain
Carlos Toro
KES International, P.O. Box 2115, BN43 9AF, Shoreham-by-sea, UK
Robert J. Howlett
School of Engineering, University of Canberra, Mawson Lakes Campus, ACT 2601, Mawson Lakes, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stankov, I., Todorov, D., Setchi, R. (2013). Semantically Enhanced Text Stemmer (SETS) for Cross-Domain Document Clustering. In: Graña, M., Toro, C., Howlett, R.J., Jain, L.C. (eds) Knowledge Engineering, Machine Learning and Lattice Computing with Applications. KES 2012. Lecture Notes in Computer Science(), vol 7828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37343-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-37343-5_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37342-8
Online ISBN: 978-3-642-37343-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics