Skip to main content

An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification

  • Conference paper
Web Information Systems Engineering – WISE 2014 (WISE 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8786))

Included in the following conference series:

Abstract

The use of semantics in tasks related to information retrieval has become, in recent years, a vast field of research. Considering supervised text classification, which is the main interest of this work, semantics can be involved at different steps of text processing: during indexing step, during training step and during class prediction step. As for class prediction step, new text-to-text semantic similarity measures can replace classical similarity measures that are traditionally used by some classification methods for decision-making. In this paper we propose a new measure for assessing semantic similarity between texts based on TF/IDF with a new function that aggregates semantic similarities between concepts representing the compared text documents pair-to-pair. Experimental results demonstrate that our measure outperforms other semantic and classical measures with significant improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  2. Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing 1971. Prentice-Hall, Inc. (1971)

    Google Scholar 

  3. Albitar, S., Fournier, S., Espinasse, B.: The Impact of Conceptualization on Text Classification. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) WISE 2012. LNCS, vol. 7651, pp. 326–339. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Bloehdorn, S., Moschitti, A.: Combined syntactic and semantic Kernels for text classification. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 307–318. Springer, Heidelberg (2007)

    Google Scholar 

  6. Albitar, S., Fournier, S., Espinasse, B.: Conceptualization Effects on MEDLINE Documents Classification Using Rocchio Method. In: Web Intelligence 2012, pp. 462–466 (2012)

    Google Scholar 

  7. Hotho, A., Staab, S., Stumme, G.: Text clustering based on background knowledge (2003)

    Google Scholar 

  8. Guisse, A., Khelif, K., Collard, M.: PatClust: une plateforme pour la classification sémantique des brevets. In: Conférence d’Ingénierie des Connaissances, Hammamet, Tunisie (2009)

    Google Scholar 

  9. Huang, L., et al.: Learning a concept-based document similarity measure. J. Am. Soc. Inf. Sci. Technol. 63(8), 1593–1608 (2012)

    Article  Google Scholar 

  10. Peng, X., Choi, B.: Document classifications based on word semantic hierarchies. In: International Conference on Artificial Intelligence and Applications (AIA 2005), pp. 362–367 (2005)

    Google Scholar 

  11. Wang, P., et al.: Improving Text Classification by Using Encyclopedia Knowledge. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining 2007, pp. 332–341. IEEE Computer Society (2007)

    Google Scholar 

  12. Al-Mubaid, H., Nguyen, H.A.: A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain. In: 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS 2006 (2006)

    Google Scholar 

  13. Rada, R., et al.: Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics 19(1), 17–30 (1989)

    Article  Google Scholar 

  14. Azuaje, F., Wang, H., Bodenreider, O.: Ontology-driven similarity approaches to supporting gene functional assessment. In: Proceedings of the ISMB 2005 SIG Meeting on Bio-Ontologies (2005)

    Google Scholar 

  15. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 12006, pp. 775–780. AAAI Press, Boston

    Google Scholar 

  16. Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Athens (2009)

    Google Scholar 

  17. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics 2004, p. 350. Association for Computational Linguistics, Geneva (2004)

    Google Scholar 

  18. Hersh, W., et al.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag New York, Inc., Dublin (1994)

    Google Scholar 

  19. Aronson, A.R., Lang, F.M.: An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17(3), 229–236 (2010)

    Google Scholar 

  20. Caviedes, J.E., Cimino, J.J.: Towards the development of a conceptual distance metric for the UMLS. J. of Biomedical Informatics 37(2), 77–85 (2004)

    Article  Google Scholar 

  21. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics 1994, pp. 133–138. Association for Computational Linguistics, Las Cruces (1994)

    Chapter  Google Scholar 

  22. Leacock, C., Chodorow, M.: Combining Local Context and WordNet Similarity for Word Sense Identification. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database (Language, Speech, and Communication), pp. 265–283. The MIT Press (1998)

    Google Scholar 

  23. Zhong, J., Zhu, H., Li, J., Yu, Y.: Conceptual Graph Matching for Semantic Search. In: Priss, U., Corbett, D.R., Angelova, G. (eds.) ICCS 2002. LNCS (LNAI), vol. 2393, pp. 92–106. Springer, Heidelberg (2002)

    Google Scholar 

  24. Sebastiani, F.: Text Categorization. In: Encyclopedia of Database Technologies and Applications 2005, pp. 683–687. Idea Group (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Albitar, S., Fournier, S., Espinasse, B. (2014). An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8786. Springer, Cham. https://doi.org/10.1007/978-3-319-11749-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11749-2_8

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11748-5

  • Online ISBN: 978-3-319-11749-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics