skip to main content
research-article

Ibn-Ginni: An Improved Morphological Analyzer for Arabic

Published: 08 February 2024 Publication History

Abstract

Arabic is a morphologically rich language, which means that the Arabic language has a complicated system of word formation and structure. The affixes in the Arabic language (i.e., prefixes and suffixes) can be added to root words to generate different meanings and grammatical functions. These affixes can indicate aspects such as tense, gender, number, case, person, and more. In addition, the meaning and function of words can be modified in Arabic using an internal structure known as morphological patterns. Computational morphological analyzers of Arabic are vital to developing Arabic language processing toolkits. In this article, we introduce a new morphological analyzer (Ibn-Ginni) that inherits the speed and quality of the Buckwalter Arabic Morphological Analyzer (BAMA). The BAMA has poor coverage of the classical Arabic language. Hence, the coverage of classical Arabic is improved by using the Alkhalil analyzer. Although it is slow, it was used to generate a huge number of solutions for 3 million unique Arabic words collected from different resources. These word form-based solutions were converted to stem-based solutions, refined manually, and added to the database of BAMA, resulting in substantial improvements in the quality of the analysis. Hence, Ibn-Ginni is a hybrid system between BAMA and Alkhalil analyzers and may be considered an efficient large-scale analyzer. The Ibn-Ginni analyzer analyzed 0.6 million more words than the BAMA analyzer. Therefore, our analyzer significantly improves the coverage of the Arabic language. Besides, the Ibn-Ginni analyzer is high speed at providing solutions, the average time to analyze a word is 0.3 ms. Using a corpus designed for benchmarking Arabic morphological analyzers, our analyzer was able to find all solutions for 72.72% of the words. Moreover, the analyzer did not provide all possible morphological solutions for 24.24% of the words. The analyzer and its morphological database are publicly available on GitHub.

References

[1]
Hassan Al-Haj and Alon Lavie. 2012. The impact of arabic morphological segmentation on broad-coverage english-to-arabic statistical machine translation. Machine Translation 26 (2012), 3–24.
[2]
Riyad Al-Shalabi and Martha Evens. 1998. A computational morphology system for arabic. In Proceedings of the Computational Approaches to Semitic Languages.
[3]
Imad A. Al-Sughaiyer and Ibrahim A. Al-Kharashi. 2004. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology 55, 3 (2004), 189–213.
[4]
Abdulmohsen O. Al-Thubaity. 2015. A 700M+ arabic corpus: KACST arabic corpus design and construction. Language Resources and Evaluation 49, 3 (2015), 721–751.
[5]
Sameh Alansary and Magdi Nagi. 2014. The international corpus of arabic: Compilation, analysis and evaluation. In Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing. 8–17.
[6]
Sameh Alansary, Magdy Nagi, and Noha Adly. 2008. Towards analyzing the international corpus of arabic (ICA): Progress of morphological stage. In Proceedings of the 8th International Conference on Language Engineering, Egypt. 1–23.
[7]
Mark Aronoff and Kirsten Fudeman. 2022. What is Morphology?John Wiley & Sons.
[8]
Mohamed Atteya, Almoataz B. Al-Said, Ahmed Ragheb, and Naem Abdul Ghani. 2019. The Arabic Language and the Artificial Intelligence. King Abdullah bin Abdulaziz International Center for Arabic Language Service.
[9]
Mohammed A. Attia. 2006. An ambiguity-controlled morphological analyzer for modern standard arabic modeling finite state networks. In Proceedings of the International Conference on the Challenge of Arabic for NLP/MT. 48–67.
[10]
Kenneth R. Beesley. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics.
[11]
Mohamed Boudchiche, Azzeddine Mazroui, Mohamed Ould Abdallahi Ould Bebah, Abdelhak Lakhouaja, and Abderrahim Boudlal. 2017. AlKhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. Journal of King Saud University-Computer and Information Sciences 29, 2 (2017), 141–146.
[12]
Abderrahim Boudlal, Abdelhak Lakhouaja, Azzeddine Mazroui, Abdelouafi Meziane, MOAO Bebah, and Mostafa Shoul. 2010. Alkhalil morpho sys1: A morphosyntactic analysis system for arabic texts. In Proceedings of the International Arab Conference on Information Technology. Elsevier Science Inc. New York, NY, 1–6.
[13]
Tim Buckwalter. 2004. Issues in arabic orthography and morphology analysis. In Proceedings of the workshop on computational approaches to Arabic script-based languages. 31–34.
[14]
Kareem Darwish. 2002. Building a shallow arabic morphological analyser in one day. In Proceedings of the ACL-02 Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA.
[15]
Ibrahim Abu El-Khair. 2016. 1.5 billion words arabic corpus. arXiv:1611.04033. Retrieved from http://arxiv.org/abs/1611.04033
[16]
Mohamed A. ElAraby. 2000. A large-scale computational processor of the arabic morphology, and applications. A Master’s Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt (2000).
[17]
Ali Farghaly and Khaled Shaalan. 2009. Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP) 8, 4 (2009), 1–22.
[18]
Amany Fashwan and Sameh Alansary. 2017. SHAKKIL: An automatic diacritization system for modern standard arabic texts. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 84–93.
[19]
David Graff, Mohamed Maamouri, Basma Bouziri, Sondos Krouna, Seth Kulick, and Tim Buckwalter. 2009. Standard arabic morphological analyzer (SAMA) version 3.1. Linguistic Data Consortium LDC2009E73 (2009), 53–56.
[20]
Nizar Habash, Ramy Eskander, and Abdelati Hawwari. 2012. A morphological analyzer for egyptian arabic. In Proceedings of the 12th Meeting of the Special Interest Group on Computational Morphology and Phonology. 1–9.
[21]
Nizar Habash and Owen Rambow. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics. 573–580.
[22]
Nizar Habash and Owen Rambow. 2007. Arabic diacritization through full morphological tagging. In Proceedings of the Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers. 53–56.
[23]
Nizar Habash, Owen Rambow, and Ryan Roth. 2009. MADA+ TOKAN: A toolkit for arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, Vol. 41. 62.
[24]
Nizar Y. Habash. 2022. Introduction to Arabic Natural Language Processing. Springer Nature.
[25]
Jan Hajic, Otakar Smrz, Tim Buckwalter, and Hubert Jin. 2005. Feature-based tagger of approximations of functional arabic morphology. In Proceedings of the Workshop on Treebanks and Linguistic Theories, Barcelona, Spain. Citeseer.
[26]
Nouha Chaâben Kammoun, Lamia Hadrich Belguith, and Abdelmajid Ben Hamadou. 2010. The MORPH2 new version: A robust morphological analyzer for arabic texts. In Proceedings of the JADT 2010: 10th International Conference on Statistical Analysis of Textual Data.
[27]
Al-Saeed Muhammad Badawi. 1973. Levels of Contemporary Arabic in Egypt (Research into the Relationship of Language to Civilization). Dar Al-Maaref, Cairo.
[28]
Arfath Pasha, Mohamed Al-Badrashiny, Mona T. Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the 9th International Conference on Language Resources and Evaluation, Vol. 14. 1094–1101.
[29]
Iazzi Said, Yousfi Abdellah, Bellafkih Mostafa, and Aboutajdine Driss. 2018. Arabic morphological analysis based on graphs and correspondence tables between affixes and root. In Proceedings of the 2018 9th International Symposium on Signal, Image, Video and Communications. IEEE, 318–322.
[30]
Iazzi Said, Yousfi Abdellah, Bellafkih Mostafa, and Aboutajdine Driss. 2018. Morphological analysis by surface patterns and by graph. International Journal of Engineering & Technology 7, 3.4 (2018), 204–208.
[31]
Iazzi Said, Bellafkih Mostafa, Aboutajdine Driss, and Yousfi Abdellah. June, 2013. Graph-based morphological analysis. Journal of Computer Science and Engineering 19 (June, 2013). Issue 2.
[32]
Majdi Shaker Salem Sawalha. 2011. Open-Source Resources and Standards for Arabic Word Structure Analysis: Fine Grained Morphological Analysis of Arabic Text Corpora. University of Leeds.
[33]
Rushin Shah, Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri, and Lyle Ungar. 2010. A new approach to lexical disambiguation of arabic text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 725–735.
[34]
Abdelhadi Soudi, Günter Neumann, and Antal van den Bosch. 2007. Arabic computational morphology: Knowledge-based and empirical methods. In Proceedings of the Arabic Computational Morphology: Knowledge-Based and Empirical Methods. Springer, 3–14.

Cited By

View all
  • (2024)Semantic Textual Similarity (STS) in Arabic using Lexical-Semantic Analysis2024 25th International Arab Conference on Information Technology (ACIT)10.1109/ACIT62805.2024.10877260(1-6)Online publication date: 10-Dec-2024

Index Terms

  1. Ibn-Ginni: An Improved Morphological Analyzer for Arabic

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
    February 2024
    340 pages
    EISSN:2375-4702
    DOI:10.1145/3613556
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2024
    Online AM: 28 December 2023
    Accepted: 24 December 2023
    Revised: 18 December 2023
    Received: 18 November 2023
    Published in TALLIP Volume 23, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. The arabic morphological analyzers
    2. arabic morphology
    3. buckwalter morphological analyzer (BAMA)
    4. alkhalil
    5. stem-based morphology
    6. wordform-based morphology
    7. arabic stemming
    8. arabic lemmatization
    9. affixation

    Qualifiers

    • Research-article

    Funding Sources

    • Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Semantic Textual Similarity (STS) in Arabic using Lexical-Semantic Analysis2024 25th International Arab Conference on Information Technology (ACIT)10.1109/ACIT62805.2024.10877260(1-6)Online publication date: 10-Dec-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media