Skip to main content
Log in

On ontology-driven document clustering using core semantic features

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. ICML 25–32

  2. Al Sumait L, Domeniconi C (2007) Local semantic kernels for text document clustering. In: SIAM international conference on data mining workshop on text mining

  3. Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using Wikipedia. SIGIR 787–788

  4. Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. ICML 19–26

  5. Bodner RC, Song F(1996) Knowledge-based approaches to query expansion in information retrieval. Adv Artif Intell 146–158

  6. CLUTO Family of Clustering Software Tools: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

  7. Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell TM et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118:69–113

    Google Scholar 

  8. Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. CIKM 1469–1470

  9. Dhillon I, Mallela S, Modha D (2003) Information-theoretic co-clustering. KDD 89–98

  10. Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: SIAM international conference on data mining workshop on text mining

  11. Fodeh SJ, Punch W, Tan PN (2009) Combining statistics and semantics via ensemble model for document clustering. SAC 1446–1450

  12. Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. NCAI 21: 1301–1306

    Google Scholar 

  13. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 1606–1611

  14. Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: SIGIR 2003 semantic web workshop. 541–544

  15. Hu J, Fang L, Cao Y (2008) Enhancing text clustering by leveraging Wikipedia semantics. SIGIR 179–186

  16. Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. CIKM 919–928

  17. Ifrim G, Theobald M, Weikum G (2005) Learning word-to-concept mappings for automated text classification. In: Workshop on learning in web search (LWS 2005). 18–25

  18. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. International Conference Research on Computational Linguistics (ROCLING X)

  19. Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text mining

  20. Kandylas V, Upham SP, Ungar LH (2009) Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17: 335–354

    Article  Google Scholar 

  21. Lang K (1995) NewsWeeder: learning to filter netnews. ICML 331–339

  22. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. KDD 16–22

  23. Lewis D (1997) Reuters-21578 text categorization test collection. AT&T Labs Research

  24. Lin D (1998) An information-theoretic definition of similarity. ICML 1: 296–304

    Google Scholar 

  25. Mandala R, Tokunaga T, Tanaka H (1999) Complementing WordNet with Roget’s and Corpus-based Thesauri for information retrieval. In: The 9th conference of the European chapter of the association for computational linguistics. 94–101

  26. MeSH, National Library of Medicine Controlled Vocabulary: http://www.nlm.nih.gov/mesh

  27. Moravec P, Kolovrat M, Snasel V (2004) LSI vs. WordNet ontology in dimension reduction and information retrieval. DATESO 288–294

  28. Natural Language Toolkit: http://www.nltk.org

  29. Recupero D (2007) A new unsupervised method for Document Clustering by using WordNet Lexical and Conceptual Relations. SIGIR 10: 563–579

    Google Scholar 

  30. Rosso P, Ferretti E, Jimenez D et al (2004) Text categorization and information retrieval using WordNet senses. In: 2nd Global WordNet international conference. 299–304

  31. Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: 3rd workshop on Robust methods in analysis of natural language processing data. 104–113

  32. Siolas G, d’Alche Buc F (2004) Support vector machines based on a semantic kernel for text categorization. IJCNN’00 5: 205–209

    Google Scholar 

  33. Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. SIGIR 208–215

  34. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining 34:35–36

  35. Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co, Boston

    Google Scholar 

  36. Termier A, Rousset MC, Sebag M (2001) Combining statistics and semantics for word and document clustering. IJCAI 1: 49–54

    Google Scholar 

  37. The 20 Newsgroups data set: http://people.csail.mit.edu/jrennie/20Newsgroups/

  38. Vorhees E (1993) Using WordNet to disambiguate word senses for text retrieval. SIGIR 171–180

  39. Wang P, Hu J, Zeng HJ et al (2007) Improving text classification by using encyclopedia knowledge. ICDM 332–341

  40. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. KDD 713–721

  41. Wang Y, Hodges J (2006) Document clustering with semantic analysis. HICSS 3:54c–54c

    Google Scholar 

  42. Wikipedia: http://wikipedia.edu

  43. WordNet: http://wordnet.princeton.edu

  44. Wu Z, Palmer M Verb (1994) Semantics and lexical selection. MACL 133–138

  45. Xiong H, Steinbach M, Ruslim A et al (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 133–138

    Article  Google Scholar 

  46. Yoo I, Hu X, Song I (2006) Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. KDD 791–796

  47. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8: 374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samah Fodeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fodeh, S., Punch, B. & Tan, PN. On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28, 395–421 (2011). https://doi.org/10.1007/s10115-010-0370-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0370-4

Keywords

Navigation