Abstract
Word-Embedding models have enabled massive advances in natural language understanding tasks and achieved state-of-the-art performances in multiple natural language processing tasks. In this paper, we present an original method based on an “easy” meta-embedding to automatically detect and correct Arabic real-words errors that are semantically inconsistent with the context of the sentence. Due to the lexical proximity of words in Arabic, the risk of having this type of errors in documents is relatively high compared to other languages. Our method uses three word embedding techniques and their combination, namely SkipGram, FastText and BERT for both detection and correction. It checks the semantic affinity of words with the immediate context in a collocation and the near context of the sentence. Experiments have shown that the proposed meta-embedding improves the overall performance of our system.
Similar content being viewed by others
Notes
One error edition: Addition of a character, deletion of a character, substitution of a character, inversion of two adjacent characters.
The morpho-syntactic analyzer segments agglutinated words and gives different syntactic information as Part-of-Speech and lemmas.
Available for free download for non-commercial use, sourceforge.net/
projects/kacst-acptool/files/
References
Al-Jefri, M. M., Mahmoud, S. A., (2013) Context-sensitive Arabic spellchecker using contextwords and n-gram language models. In proc. Taibah Univ. Int. Conf. Adv. Inf. Technol. Holy Quran Sci. pp. 258–263
Alwehaibi, A., Roy, K., 2018 Comparison of pre-trained word vectors for Arabic text classification using deep learning approach. In proc. - 17th IEEE international conference on machine learning and applications, ICMLA, pp. 1471–1474
Azmi AM, Almutery MN, Aboalsamh HA (2019) Real-word errors in Arabic texts: a BetterAlgorithm for detection and correction. IEEE/ACM Transac Audio, Speech, Language Proc 27(8):1308–1320
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching Word Vectors with Subword Information, arXiv preprint arXiv:1607.04606
Bravo-Candel D, López-Hernández J, García-Díaz JA, Molina-Molina F (2021) Automatic correction of real-word errors in Spanish clinical texts. Sensors J 21:2893
Coates JN, Bollegala D (2018) Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings, in Proc. of NAACL-HLT 2018, New Orleans, Louisiana, pp 194–198.
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding”, in Proce of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1. pp. 4171–4186
Firth JR (1957) A synopsis of linguistic theory studies in linguistic analysis. Blackwell, Oxford
Golding, A.R., 1995. A Bayesian hybrid method for context-sensitive spelling correction, in proc. of the 3rd workshop on very large corpora, Massachusetts, USA. pp. 39–53.
Golding AR, Roth D (1999) A winnow-based approach to context-sensitive spelling correction. Machine Learn J 34(1–3):107–130
Golding AR, Schabes Y (1996) Combining trigram based and feature based methods for context sensitive spelling correction, in proc. of the 34th annual meeting of the Association for Computational Linguistics, Santa Cruz. pp. 71-78
Graem H, Budanitsky A (2005) Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng 11:87–111
Gutierrez F, Dou D, Fickas S, Griffiths G, (2014) Online reasoning for ontology-based error detection in text. OTM international conference on ontologies, databases and application of semantics, pp.562-579
Islam A, Inkpen D (2015) Real-word spelling correction using Google web 1T 3-gram data set, in proc. 18th ACM Conf. Inf Knowl Manage, 2009, pp. 1689–1692
Kim M, Choi S-K, Jin J, Kwon H-C (2015) Adaptive context-sensitivespelling error correction techniques for the extremely unpredictable error generating language environments, in proc. IEEE Int. Conf. Comput. Inf. Technol.; ubiquitous Comput. Commun.; dependable, auto-nomic secure Comput.; pervasive Intell. Comput. pp. 654-656
Lee JH, Kim M, Kwon HC (2018) Context-sensitive spelling errorcorrection techniques using contextual embeddings, in proc. KIISE Korea Comput Congr 2018:607–609
Lee JH, Kim M, Kwon HC (2020) Deep learning-based context-sensitive spelling. IEEE Access 8:152565–152578
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J, (2013) Distributed representations of phrase and their compositionality. Advances in neural information processing systems, 3111-3119
Rokaya M (2015) Arabic semantic spell checking based on power links. Int Interdisciplinary J 18(11):4749–4770
Samanta P, Chaudhuri BB, A simple real-word error detection and correction using local word bigram and trigram, in proc. 25th Conf. Computational Linguistics Speech Process
Sharmaa S, Guptab S (2015) A correction model for real-word errors, the 4th international conference on eco-friendly computing and communication systems. Procedia Comput Sci 70:99–106
Soliman E, Eissa K, El-Beltagy S (2017) AraVec: a set of Arabic word embedding models for use in Arabic NLP. Procedia Computer Science, pp. 256–265
Toshevska M, Stojanovska F, Kalajdjiesk J (2020) Comparative analysis of word embeddings for capturing word similarities. 6th International Conference on Natural Language Processing, Copenhagen, Denmark. pp. 9–24
Turney PD (2008) A uniform approach to analogies, synonyms, antonyms, and associations, in proc. of the 22nd International Conference on Computational Linguistics, , Manchester, UK. pp. 905–912
Yin W, Schütze H (2016). Learning word meta-embeddings, in Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 351–1360
Zribi Ben Othmane C (2018) Word2Vec vs LSA pour la détection des erreurs orthograohiques produisant un dérèglement sémantique en langue arabe. Actes CORIA-TALN-RJC 1:293–302
Zribi Ben Othmane C (2020) English-Arabic collocation extraction to enhance Arabic collocation identification. Knowl Inf Syst 62(6):2439–2459
Zribi Ben Othmane C, Ben Ahmed M (2013) Detection of semantic errors in Arabic texts. Artificial Intel J 195:249–264
Zribi Ben Othmane C, Ben Fraj F, Limam I (2017) POS-tagging arabic texts: a novel approach based on ant colony. Nat Lang Eng 23(3):419–439
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
It has not any conflicts of interests or competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zribi, C.B.O. “Easy” meta-embedding for detecting and correcting semantic errors in Arabic documents. Multimed Tools Appl 82, 21161–21175 (2023). https://doi.org/10.1007/s11042-023-14553-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14553-4