Abstract
Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
Similar content being viewed by others
References
Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. ICML 25–32
Al Sumait L, Domeniconi C (2007) Local semantic kernels for text document clustering. In: SIAM international conference on data mining workshop on text mining
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using Wikipedia. SIGIR 787–788
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. ICML 19–26
Bodner RC, Song F(1996) Knowledge-based approaches to query expansion in information retrieval. Adv Artif Intell 146–158
CLUTO Family of Clustering Software Tools: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell TM et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118:69–113
Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. CIKM 1469–1470
Dhillon I, Mallela S, Modha D (2003) Information-theoretic co-clustering. KDD 89–98
Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: SIAM international conference on data mining workshop on text mining
Fodeh SJ, Punch W, Tan PN (2009) Combining statistics and semantics via ensemble model for document clustering. SAC 1446–1450
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. NCAI 21: 1301–1306
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 1606–1611
Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: SIGIR 2003 semantic web workshop. 541–544
Hu J, Fang L, Cao Y (2008) Enhancing text clustering by leveraging Wikipedia semantics. SIGIR 179–186
Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. CIKM 919–928
Ifrim G, Theobald M, Weikum G (2005) Learning word-to-concept mappings for automated text classification. In: Workshop on learning in web search (LWS 2005). 18–25
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. International Conference Research on Computational Linguistics (ROCLING X)
Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text mining
Kandylas V, Upham SP, Ungar LH (2009) Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17: 335–354
Lang K (1995) NewsWeeder: learning to filter netnews. ICML 331–339
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. KDD 16–22
Lewis D (1997) Reuters-21578 text categorization test collection. AT&T Labs Research
Lin D (1998) An information-theoretic definition of similarity. ICML 1: 296–304
Mandala R, Tokunaga T, Tanaka H (1999) Complementing WordNet with Roget’s and Corpus-based Thesauri for information retrieval. In: The 9th conference of the European chapter of the association for computational linguistics. 94–101
MeSH, National Library of Medicine Controlled Vocabulary: http://www.nlm.nih.gov/mesh
Moravec P, Kolovrat M, Snasel V (2004) LSI vs. WordNet ontology in dimension reduction and information retrieval. DATESO 288–294
Natural Language Toolkit: http://www.nltk.org
Recupero D (2007) A new unsupervised method for Document Clustering by using WordNet Lexical and Conceptual Relations. SIGIR 10: 563–579
Rosso P, Ferretti E, Jimenez D et al (2004) Text categorization and information retrieval using WordNet senses. In: 2nd Global WordNet international conference. 299–304
Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: 3rd workshop on Robust methods in analysis of natural language processing data. 104–113
Siolas G, d’Alche Buc F (2004) Support vector machines based on a semantic kernel for text categorization. IJCNN’00 5: 205–209
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. SIGIR 208–215
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining 34:35–36
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co, Boston
Termier A, Rousset MC, Sebag M (2001) Combining statistics and semantics for word and document clustering. IJCAI 1: 49–54
The 20 Newsgroups data set: http://people.csail.mit.edu/jrennie/20Newsgroups/
Vorhees E (1993) Using WordNet to disambiguate word senses for text retrieval. SIGIR 171–180
Wang P, Hu J, Zeng HJ et al (2007) Improving text classification by using encyclopedia knowledge. ICDM 332–341
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. KDD 713–721
Wang Y, Hodges J (2006) Document clustering with semantic analysis. HICSS 3:54c–54c
Wikipedia: http://wikipedia.edu
WordNet: http://wordnet.princeton.edu
Wu Z, Palmer M Verb (1994) Semantics and lexical selection. MACL 133–138
Xiong H, Steinbach M, Ruslim A et al (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 133–138
Yoo I, Hu X, Song I (2006) Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. KDD 791–796
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8: 374–384
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fodeh, S., Punch, B. & Tan, PN. On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28, 395–421 (2011). https://doi.org/10.1007/s10115-010-0370-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0370-4