research-article

A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts

Authors:
Bashar Al-Shboul

The University of Jordan-Amman, Jordan

The University of Jordan-Amman, Jordan
View Profile

,
Duha Al-Darras

The University of Jordan-Amman, Jordan

The University of Jordan-Amman, Jordan
View Profile

,
Dana Al-Qudah

The University of Jordan-Amman, Jordan

The University of Jordan-Amman, Jordan
View Profile

MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystemsOctober 2022Pages 129–135https://doi.org/10.1145/3508397.3564848

Published:08 December 2022Publication History

MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystems

Pages 129–135

ABSTRACT

An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.

References

Kheireddine Abainia, Siham Ouamour, and Halim Sayoud. 2017. A novel robust Arabic light stemmer. Journal of Experimental & Theoretical Artificial Intelligence 29, 3 (2017), 557--573.Google ScholarCross Ref
Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. 2011. Analyzing temporal dynamics in twitter profiles for personalized recommendations in the social web. In Proceedings of the 3rd international web science conference. 1--8.Google ScholarDigital Library
Fawaz S. Al-Anzi and Dia AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 189--195.Google ScholarCross Ref
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2015. Polyglot-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 586--594.Google ScholarCross Ref
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. arXiv preprint arXiv:1307.1662 (2013).Google Scholar
Mohammad AL-Smadi, Zain Jaradat, Mahmoud AL-Ayyoub, and Yaser Jararweh. 2017-05-01. Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Information Processing & Management 53, 3 (2017-05-01), 640--652. Google ScholarDigital Library
Mohammed Aljlayl and Ophir Frieder. 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the eleventh international conference on Information and knowledge management. ACM, 340--347.Google ScholarDigital Library
Rahul Bhagat and Eduard Hovy. 2013. What is a paraphrase? Computational Linguistics 39, 3 (2013), 463--472.Google ScholarCross Ref
Paulo Bicalho, Marcelo Pita, Gabriel Pedrosa, Anisio Lacerda, and Gisele L. Pappa. 2017-07-01. A general framework to expand short text for topic modeling. Information Sciences 393 (2017-07-01), 66--81. Google ScholarCross Ref
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media, Inc.".Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993--1022. Issue Jan.Google ScholarDigital Library
X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014-12. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014-12), 2928--2941. Google ScholarCross Ref
Paul Clough and Mark Sanderson. 2013. Evaluating the performance of information retrieval systems using test collections. Information research 18, 2 (2013), 18--2.Google Scholar
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 350.Google ScholarDigital Library
Hanane Elfaik, Mohammed Bekkali, Habibi Brahim, and Abdelmonaime Lachkar. 2019. Arabic Paraphrasing Recognition Based Kernel Function for Measuring the Similarity of Pairs. In Smart Data and Computational Intelligence (Lecture Notes in Networks and Systems), Faddoul Khoukhi, Mohamed Bahaj, and Mostafa Ezziyyani (Eds.). Springer International Publishing, 183--194.Google Scholar
Asli Eyecioglu and Bill Keller. 2015. Twitter paraphrase identification with simple overlap features and SVMs. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 64--69.Google ScholarCross Ref
Liangjie Hong and Brian D. Davison. 2010. Empirical Study of Topic Modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics (SOMA '10). ACM, 80--88. event-place: Washington D.C., District of Columbia. Google ScholarDigital Library
Tarn Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models.. In UbiComp, Vol. 8. 10--19.Google ScholarDigital Library
Aminul Islam and Diana Inkpen. 2009. Semantic similarity of short texts. Recent Advances in Natural Language Processing V 309 (2009), 227--236.Google ScholarCross Ref
Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145--151.Google ScholarDigital Library
Adnen Mahmoud, Ahmed Zrigui, and Mounir Zrigui. 2017. A text semantic similarity approach for Arabic paraphrase detection. In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 338--349.Google Scholar
Adnen Mahmoud and Mounir Zrigui. 2021. Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY 18, 1 (2021), 1--7.Google ScholarCross Ref
Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu (2002).Google Scholar
Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '13). ACM, 889--892. event-place: Dublin, Ireland. Google ScholarDigital Library
Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Aaai, Vol. 6. 775--780.Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Michael Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
Radim Rehurek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer.Google Scholar
Motaz K. Saad and Wesam M. Ashour. 2010. Osac: Open source arabic corpora. Osac: Open source arabic corpora 10 (2010).Google Scholar
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web. 851--860.Google ScholarDigital Library
Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424--433.Google ScholarDigital Library

Index Terms

A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Adding semantics to microblog posts
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining

Microblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual ...
Read More
Unsupervised Opinion Targets Expansion and Modification Relation Identification for Microblog Sentiment Analysis
SocInfo 2013: Proceedings of the 5th International Conference on Social Informatics - Volume 8238

Microblog brings challenges to existing researches on sentiment analysis. First, microblog short messages might contain fewer content features. Second, it's difficult to know what users want to express without suitable contexts. On the other hand, ...
Read More
Unsupervised keyword extraction from microblog posts via hashtags

Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benefits many applications such as advertising, search, and content filtering. Unlike traditional web ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystems
October 2022
172 pages
ISBN:9781450392198
DOI:10.1145/3508397
General Chairs:
Ernesto Damiani
Khalifa University, UAE
,
Claudio Silvestri
Università Ca' Foscari di Venezia, Italy
,
Mirjana Ivanovic
University of Novi Sad, Serbia
,
Richard Chbeir
University of Pau and the Adour Region, France
,
Yannis Manolopoulos
Open University of Cyprus, Cyprus
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
arabic paraphrase identification
arabic semantic text expansion
arabic topic modelling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate267of682submissions,39%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 22
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts

MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adding semantics to microblog posts

Unsupervised Opinion Targets Expansion and Modification Relation Identification for Microblog Sentiment Analysis

Unsupervised keyword extraction from microblog posts via hashtags