On ontology-driven document clustering using core semantic features

Fodeh, Samah; Punch, Bill; Tan, Pang-Ning

doi:10.1007/s10115-010-0370-4

On ontology-driven document clustering using core semantic features

Regular Paper
Published: 29 January 2011

Volume 28, pages 395–421, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Samah Fodeh¹,
Bill Punch² &
Pang-Ning Tan²

654 Accesses
Explore all metrics

Abstract

Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Andrzejewski D, Zhu X, Craven M (2009) Incorporating domain knowledge into topic modeling via dirichlet forest priors. ICML 25–32
Al Sumait L, Domeniconi C (2007) Local semantic kernels for text document clustering. In: SIAM international conference on data mining workshop on text mining
Banerjee S, Ramanathan K, Gupta A (2007) Clustering short texts using Wikipedia. SIGIR 787–788
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. ICML 19–26
Bodner RC, Song F(1996) Knowledge-based approaches to query expansion in information retrieval. Adv Artif Intell 146–158
CLUTO Family of Clustering Software Tools: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Craven M, DiPasquo D, Freitag D, McCallum A, Mitchell TM et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118:69–113
Google Scholar
Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. CIKM 1469–1470
Dhillon I, Mallela S, Modha D (2003) Information-theoretic co-clustering. KDD 89–98
Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: SIAM international conference on data mining workshop on text mining
Fodeh SJ, Punch W, Tan PN (2009) Combining statistics and semantics via ensemble model for document clustering. SAC 1446–1450
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. NCAI 21: 1301–1306
Google Scholar
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 1606–1611
Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: SIGIR 2003 semantic web workshop. 541–544
Hu J, Fang L, Cao Y (2008) Enhancing text clustering by leveraging Wikipedia semantics. SIGIR 179–186
Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. CIKM 919–928
Ifrim G, Theobald M, Weikum G (2005) Learning word-to-concept mappings for automated text classification. In: Workshop on learning in web search (LWS 2005). 18–25
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. International Conference Research on Computational Linguistics (ROCLING X)
Jing L, Zhou L, Ng MK, Huang JZ (2006) Ontology-based distance measure for text clustering. In: SIAM SDM workshop on text mining
Kandylas V, Upham SP, Ungar LH (2009) Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17: 335–354
Article Google Scholar
Lang K (1995) NewsWeeder: learning to filter netnews. ICML 331–339
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. KDD 16–22
Lewis D (1997) Reuters-21578 text categorization test collection. AT&T Labs Research
Lin D (1998) An information-theoretic definition of similarity. ICML 1: 296–304
Google Scholar
Mandala R, Tokunaga T, Tanaka H (1999) Complementing WordNet with Roget’s and Corpus-based Thesauri for information retrieval. In: The 9th conference of the European chapter of the association for computational linguistics. 94–101
MeSH, National Library of Medicine Controlled Vocabulary: http://www.nlm.nih.gov/mesh
Moravec P, Kolovrat M, Snasel V (2004) LSI vs. WordNet ontology in dimension reduction and information retrieval. DATESO 288–294
Natural Language Toolkit: http://www.nltk.org
Recupero D (2007) A new unsupervised method for Document Clustering by using WordNet Lexical and Conceptual Relations. SIGIR 10: 563–579
Google Scholar
Rosso P, Ferretti E, Jimenez D et al (2004) Text categorization and information retrieval using WordNet senses. In: 2nd Global WordNet international conference. 299–304
Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: 3rd workshop on Robust methods in analysis of natural language processing data. 104–113
Siolas G, d’Alche Buc F (2004) Support vector machines based on a semantic kernel for text categorization. IJCNN’00 5: 205–209
Google Scholar
Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. SIGIR 208–215
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining 34:35–36
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co, Boston
Google Scholar
Termier A, Rousset MC, Sebag M (2001) Combining statistics and semantics for word and document clustering. IJCAI 1: 49–54
Google Scholar
The 20 Newsgroups data set: http://people.csail.mit.edu/jrennie/20Newsgroups/
Vorhees E (1993) Using WordNet to disambiguate word senses for text retrieval. SIGIR 171–180
Wang P, Hu J, Zeng HJ et al (2007) Improving text classification by using encyclopedia knowledge. ICDM 332–341
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. KDD 713–721
Wang Y, Hodges J (2006) Document clustering with semantic analysis. HICSS 3:54c–54c
Google Scholar
Wikipedia: http://wikipedia.edu
WordNet: http://wordnet.princeton.edu
Wu Z, Palmer M Verb (1994) Semantics and lexical selection. MACL 133–138
Xiong H, Steinbach M, Ruslim A et al (2009) Characterizing pattern preserving clustering. Knowl Inf Syst 19: 133–138
Article Google Scholar
Yoo I, Hu X, Song I (2006) Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. KDD 791–796
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8: 374–384
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yale University, New Haven, CT, USA
Samah Fodeh
Michigan State University, East Lansing, MI, USA
Bill Punch & Pang-Ning Tan

Authors

Samah Fodeh
View author publications
You can also search for this author inPubMed Google Scholar
Bill Punch
View author publications
You can also search for this author inPubMed Google Scholar
Pang-Ning Tan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Samah Fodeh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fodeh, S., Punch, B. & Tan, PN. On ontology-driven document clustering using core semantic features. Knowl Inf Syst 28, 395–421 (2011). https://doi.org/10.1007/s10115-010-0370-4

Download citation

Received: 10 December 2009
Revised: 06 September 2010
Accepted: 26 November 2010
Published: 29 January 2011
Issue Date: August 2011
DOI: https://doi.org/10.1007/s10115-010-0370-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On ontology-driven document clustering using core semantic features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taxonomy-Augmented Features for Document Clustering

A semi-supervised framework for concept-based hierarchical document clustering

Combining semantic and term frequency similarities for text clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

On ontology-driven document clustering using core semantic features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Taxonomy-Augmented Features for Document Clustering

A semi-supervised framework for concept-based hierarchical document clustering

Combining semantic and term frequency similarities for text clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now