Skip to main content

New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5632))

Abstract

Most text clustering techniques are based on words and/or phrases weights in the text. Such representation is often unsatisfactory because it ignores the relationships between terms, and considers them as independent features.

In this paper, a new semantic similarity based model (SSBM) is proposed. The semantic similarity based model computes semantic similarities by utilizing WordNet as an ontology. The proposed model captures the semantic similarities between documents that contain semantically similar terms but unnecessarily syntactically identical.

The semantic similarity based model assigns a new weight to document terms reflecting the semantic relationships between terms that co-occur literally in the document. Our model in conjunction with the extended gloss overlaps measure and the adapted Lesk algorithm solves ambiguity, synonymy problems that are not detected using traditional term frequency based text mining techniques.

The proposed model is evaluated on the Reuters-21578 and the 20-Newsgroups text collections datasets. The performance is assessed in terms of the Fmeasure, Purity and Entropy quality measures. The obtained results show promising performance improvements compared to the traditional term based vector space model (VSM) as well as other existing methods that include semantic similarity measures in text clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. In: Computational Linguistics, vol. 32, pp. 13–47 (2006)

    Google Scholar 

  2. Hotho, A., Staab, S., Stumme, G.: WordNet Improve Text Document Clustering. In: SIGIR 2003 Semantic Web Workshop, pp. 541–544 (2003)

    Google Scholar 

  3. Sedding, J., Dimitar, K.: WordNet-based Text Document Clustering. In: COLING 2004 3rd Workshop on Robust Methods in Analysis of Natural Language Data, pp. 104–113 (2004)

    Google Scholar 

  4. Jing, L., Zhou, L., Ng, M., Huang, Z.: Ontology-based Distance Measure for Text Clustering. In: IAM SDM Workshop on Text Mining (2003)

    Google Scholar 

  5. Wang, Y., Hodges, J.: Document Clustering with Semantic Analysis. In: The 39th Annual Hawaii International Conference on System Sciences, HICSS 2006, vol. 3, p. 54c (2006)

    Google Scholar 

  6. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  7. Hirst, G., St.-Onge, S.: Lexical Chains as Representation of Context for the Detection and Correction of Malapropisms. In: Fellbaum, C. (ed.) Wordnet: An electronic lexical database and some of its applications, pp. 305–332. MIT Press, Cambridge (1997)

    Google Scholar 

  8. Wu, Z., Palmer, M.: Verb Semantics and Lexical Selection. In: The 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138 (1994)

    Google Scholar 

  9. Li, Y., Zuhair, A., McLean, D.: An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering 15, 871–882 (2003)

    Article  Google Scholar 

  10. Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. Fellbaum, 265–283 (1998)

    Google Scholar 

  11. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in Taxonomy. In: The 14th international joint conference Artificial Intelligence, pp. 448–453 (1995)

    Google Scholar 

  12. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: International Conference on Research in Computational Linguistics, pp. 19–33 (1997)

    Google Scholar 

  13. Lin, D.: An information-theoretic Definition of Similarity. In: The 15th International Conference on Machine Learning, pp. 296–304 (1998)

    Google Scholar 

  14. Tversky, A.: Features of Similarity. Psychological Review 84, 327–352 (1977)

    Article  Google Scholar 

  15. Knappe, R., Bulskov, H., Andreasen, T.: Perspectives on Ontology-based Querying. International Journal Intelligent Systems 22, 739–761 (2007)

    Article  MATH  Google Scholar 

  16. Shehata, S., Karray, F., Kamel, M.: A Concept-Based Model for Enhancing Text Categorization. In: The 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 629–637 (2007)

    Google Scholar 

  17. Hammouda, K., Kamel, M.: Efficient Phrase-based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16, 1279–1296 (2004)

    Article  Google Scholar 

  18. Porter, M.: An algorithm for Suffix Stripping. Program 14, 130–137 (1980)

    Article  Google Scholar 

  19. Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  20. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

  21. Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone. In: The ACM SIG-DOC Conference, pp. 24–26 (1986)

    Google Scholar 

  22. Banerjee, S., Pedersen, T.: Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Computational Linguistics and Intelligent Text Processing (2002)

    Google Scholar 

  23. Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gad, W.K., Kamel, M.S. (2009). New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2009. Lecture Notes in Computer Science(), vol 5632. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03070-3_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03070-3_50

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03069-7

  • Online ISBN: 978-3-642-03070-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics