ABSTRACT
An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.
- Kheireddine Abainia, Siham Ouamour, and Halim Sayoud. 2017. A novel robust Arabic light stemmer. Journal of Experimental & Theoretical Artificial Intelligence 29, 3 (2017), 557--573.Google ScholarCross Ref
- Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. 2011. Analyzing temporal dynamics in twitter profiles for personalized recommendations in the social web. In Proceedings of the 3rd international web science conference. 1--8.Google ScholarDigital Library
- Fawaz S. Al-Anzi and Dia AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 189--195.Google ScholarCross Ref
- Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2015. Polyglot-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 586--594.Google ScholarCross Ref
- Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilingual nlp. arXiv preprint arXiv:1307.1662 (2013).Google Scholar
- Mohammad AL-Smadi, Zain Jaradat, Mahmoud AL-Ayyoub, and Yaser Jararweh. 2017-05-01. Paraphrase identification and semantic text similarity analysis in Arabic news tweets using lexical, syntactic, and semantic features. Information Processing & Management 53, 3 (2017-05-01), 640--652. Google ScholarDigital Library
- Mohammed Aljlayl and Ophir Frieder. 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the eleventh international conference on Information and knowledge management. ACM, 340--347.Google ScholarDigital Library
- Rahul Bhagat and Eduard Hovy. 2013. What is a paraphrase? Computational Linguistics 39, 3 (2013), 463--472.Google ScholarCross Ref
- Paulo Bicalho, Marcelo Pita, Gabriel Pedrosa, Anisio Lacerda, and Gisele L. Pappa. 2017-07-01. A general framework to expand short text for topic modeling. Information Sciences 393 (2017-07-01), 66--81. Google ScholarCross Ref
- Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. "O'Reilly Media, Inc.".Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993--1022. Issue Jan.Google ScholarDigital Library
- X. Cheng, X. Yan, Y. Lan, and J. Guo. 2014-12. BTM: Topic Modeling over Short Texts. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014-12), 2928--2941. Google ScholarCross Ref
- Paul Clough and Mark Sanderson. 2013. Evaluating the performance of information retrieval systems using test collections. Information research 18, 2 (2013), 18--2.Google Scholar
- Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 350.Google ScholarDigital Library
- Hanane Elfaik, Mohammed Bekkali, Habibi Brahim, and Abdelmonaime Lachkar. 2019. Arabic Paraphrasing Recognition Based Kernel Function for Measuring the Similarity of Pairs. In Smart Data and Computational Intelligence (Lecture Notes in Networks and Systems), Faddoul Khoukhi, Mohamed Bahaj, and Mostafa Ezziyyani (Eds.). Springer International Publishing, 183--194.Google Scholar
- Asli Eyecioglu and Bill Keller. 2015. Twitter paraphrase identification with simple overlap features and SVMs. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 64--69.Google ScholarCross Ref
- Liangjie Hong and Brian D. Davison. 2010. Empirical Study of Topic Modeling in Twitter. In Proceedings of the First Workshop on Social Media Analytics (SOMA '10). ACM, 80--88. event-place: Washington D.C., District of Columbia. Google ScholarDigital Library
- Tarn Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models.. In UbiComp, Vol. 8. 10--19.Google ScholarDigital Library
- Aminul Islam and Diana Inkpen. 2009. Semantic similarity of short texts. Recent Advances in Natural Language Processing V 309 (2009), 227--236.Google ScholarCross Ref
- Jianhua Lin. 1991. Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37, 1 (1991), 145--151.Google ScholarDigital Library
- Adnen Mahmoud, Ahmed Zrigui, and Mounir Zrigui. 2017. A text semantic similarity approach for Arabic paraphrase detection. In International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 338--349.Google Scholar
- Adnen Mahmoud and Mounir Zrigui. 2021. Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY 18, 1 (2021), 1--7.Google ScholarCross Ref
- Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu (2002).Google Scholar
- Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. 2013. Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '13). ACM, 889--892. event-place: Dublin, Ireland. Google ScholarDigital Library
- Rada Mihalcea, Courtney Corley, Carlo Strapparava, et al. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Aaai, Vol. 6. 775--780.Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Michael Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 5.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
- Radim Rehurek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer.Google Scholar
- Motaz K. Saad and Wesam M. Ashour. 2010. Osac: Open source arabic corpora. Osac: Open source arabic corpora 10 (2010).Google Scholar
- Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web. 851--860.Google ScholarDigital Library
- Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424--433.Google ScholarDigital Library
Index Terms
- A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts
Recommendations
Adding semantics to microblog posts
WSDM '12: Proceedings of the fifth ACM international conference on Web search and data miningMicroblogs have become an important source of information for the purpose of marketing, intelligence, and reputation management. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual ...
Unsupervised Opinion Targets Expansion and Modification Relation Identification for Microblog Sentiment Analysis
SocInfo 2013: Proceedings of the 5th International Conference on Social Informatics - Volume 8238Microblog brings challenges to existing researches on sentiment analysis. First, microblog short messages might contain fewer content features. Second, it's difficult to know what users want to express without suitable contexts. On the other hand, ...
Unsupervised keyword extraction from microblog posts via hashtags
Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benefits many applications such as advertising, search, and content filtering. Unlike traditional web ...
Comments