skip to main content
10.1145/3508397.3564848acmconferencesArticle/Chapter ViewAbstractPublication PagesmedesConference Proceedingsconference-collections
research-article

A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts

Published:08 December 2022Publication History

ABSTRACT

An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.

References

  1. Kheireddine Abainia, Siham Ouamour, and Halim Sayoud. 2017. A novel robust Arabic light stemmer. Journal of Experimental & Theoretical Artificial Intelligence 29, 3 (2017), 557--573.Google ScholarGoogle ScholarCross RefCross Ref
  2. Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. 2011. Analyzing temporal dynamics in twitter profiles for personalized recommendations in the social web. In Proceedings of the 3rd international web science conference. 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fawaz S. Al-Anzi and Dia AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 189--195.Google ScholarGoogle ScholarCross RefCross Ref
  4. Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2015. Polyglot-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 586--594.Google ScholarGoogle ScholarCross RefCross Ref
  5. Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. arXiv preprint arXiv:1307.1662 (2013).Google ScholarGoogle Scholar
  6. Mohammad AL-Smadi, Zain Jaradat, Mahmoud AL-Ayyoub, and Yaser Jararweh. 2017-05-01. Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Information Processing & Management 53, 3 (2017-05-01), 640--652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mohammed Aljlayl and Ophir Frieder. 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the eleventh international conference on Information and knowledge management. ACM, 340--347.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Rahul Bhagat and Eduard Hovy. 2013. What is a paraphrase? Computational Linguistics 39, 3 (2013), 463--472.Google ScholarGoogle ScholarCross RefCross Ref
  9. Paulo Bicalho, Marcelo Pita, Gabriel Pedrosa, Anisio Lacerda, and Gisele L. Pappa. 2017-07-01. A general framework to expand short text for topic modeling. Information Sciences 393 (2017-07-01), 66--81. Google ScholarGoogle ScholarCross RefCross Ref
  10. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media, Inc.".Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993--1022. Issue Jan.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014-12. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014-12), 2928--2941. Google ScholarGoogle ScholarCross RefCross Ref
  13. Paul Clough and Mark Sanderson. 2013. Evaluating the performance of information retrieval systems using test collections. Information research 18, 2 (2013), 18--2.Google ScholarGoogle Scholar
  14. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hanane Elfaik, Mohammed Bekkali, Habibi Brahim, and Abdelmonaime Lachkar. 2019. Arabic Paraphrasing Recognition Based Kernel Function for Measuring the Similarity of Pairs. In Smart Data and Computational Intelligence (Lecture Notes in Networks and Systems), Faddoul Khoukhi, Mohamed Bahaj, and Mostafa Ezziyyani (Eds.). Springer International Publishing, 183--194.Google ScholarGoogle Scholar
  16. Asli Eyecioglu and Bill Keller. 2015. Twitter paraphrase identification with simple overlap features and SVMs. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 64--69.Google ScholarGoogle ScholarCross RefCross Ref
  17. Liangjie Hong and Brian D. Davison. 2010. Empirical Study of Topic Modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics (SOMA '10). ACM, 80--88. event-place: Washington D.C., District of Columbia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Tarn Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models.. In UbiComp, Vol. 8. 10--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Aminul Islam and Diana Inkpen. 2009. Semantic similarity of short texts. Recent Advances in Natural Language Processing V 309 (2009), 227--236.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Adnen Mahmoud, Ahmed Zrigui, and Mounir Zrigui. 2017. A text semantic similarity approach for Arabic paraphrase detection. In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 338--349.Google ScholarGoogle Scholar
  22. Adnen Mahmoud and Mounir Zrigui. 2021. Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY 18, 1 (2021), 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  23. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu (2002).Google ScholarGoogle Scholar
  24. Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '13). ACM, 889--892. event-place: Dublin, Ireland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Aaai, Vol. 6. 775--780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  27. Michael Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5.Google ScholarGoogle Scholar
  28. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  29. Radim Rehurek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer.Google ScholarGoogle Scholar
  30. Motaz K. Saad and Wesam M. Ashour. 2010. Osac: Open source arabic corpora. Osac: Open source arabic corpora 10 (2010).Google ScholarGoogle Scholar
  31. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web. 851--860.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424--433.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystems
      October 2022
      172 pages
      ISBN:9781450392198
      DOI:10.1145/3508397

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 December 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate267of682submissions,39%
    • Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader