New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps

Gad, Walaa K.; Kamel, Mohamed S.

doi:10.1007/978-3-642-03070-3_50

Walaa K. Gad²⁰ &
Mohamed S. Kamel²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

2422 Accesses
10 Citations

Abstract

Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features.

In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical.

The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques.

The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. In: Computational Linguistics, vol. 32, pp. 13–47 (2006)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: WordNet Improve Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)
Google Scholar
Sedding, J., Dimitar, K.: WordNet-based Text Document Clustering. In: COLING 2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113 (2004)
Google Scholar
Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based Distance Measure for Text Clustering. In: IAM SDM Workshop on Text Mining (2003)
Google Scholar
Wang, Y., Hodges, J.: Document Clustering with Semantic Analysis. In: The 39th Annual Hawaii International Conference on System Sciences, HICSS 2006, vol. 3, p. 54c (2006)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Hirst, G., St.-Onge, S.: Lexical Chains as Representation of Context for the Detection and Correction of Malapropisms. In: Fellbaum, C. (ed.) Wordnet: An electronic lexical database and some of its applications, pp. 305–332. MIT Press, Cambridge (1997)
Google Scholar
Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: The 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)
Google Scholar
Li, Y., Zuhair, A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15, 871–882 (2003)
Article Google Scholar
Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. Fellbaum, 265–283 (1998)
Google Scholar
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in Taxonomy. In: The 14th international joint conference Artificial Intelligence, pp. 448–453 (1995)
Google Scholar
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference on Research in Computational Linguistics, pp. 19–33 (1997)
Google Scholar
Lin, D.: An information-theoretic Definition of Similarity. In: The 15th International Conference on Machine Learning, pp. 296–304 (1998)
Google Scholar
Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)
Article Google Scholar
Knappe, R., Bulskov, H., Andreasen, T.: Perspectives on Ontology-based Querying. International Journal Intelligent Systems 22, 739–761 (2007)
Article MATH Google Scholar
Shehata, S., Karray, F., Kamel, M.: A Concept-Based Model for Enhancing Text Categorization. In: The 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 629–637 (2007)
Google Scholar
Hammouda, K., Kamel, M.: Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16, 1279–1296 (2004)
Article Google Scholar
Porter, M.: An algorithm for Suffix Stripping. Program 14, 130–137 (1980)
Article Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In: The ACM SIG-DOC Conference, pp. 24–26 (1986)
Google Scholar
Banerjee, S., Pedersen, T.: Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Computational Linguistics and Intelligent Text Processing (2002)
Google Scholar
Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1
Walaa K. Gad & Mohamed S. Kamel

Authors

Walaa K. Gad
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed S. Kamel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Bildverarbeitung und angewandte Informatik, Körnerstr. 10, 04107, Leipzig, Deutschland, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gad, W.K., Kamel, M.S. (2009). New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-03070-3_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03069-7
Online ISBN: 978-3-642-03070-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics