Skip to main content

An Approach for Arabic Diacritization

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11608))

Abstract

Modern Standard Arabic (MSA) contains optional diacritical marks (diacritics, in Arabic harakat), which became less used in Arabic books, newspapers and other written media. Diacritics are very important for readability and understandability of texts. Their absence causes critical problems that add to the lexical, morphological and semantic ambiguities. In this paper, we present an automatic diacritization system of the Arabic language, using Hidden Markov Models with the Viterbi’s algorithm, based on probabilities based on learning on diacritized Arabic texts. The corpus used was mostly composed of religious texts. Our results were satisfactory, achieving a precision of up to 80% at the word level.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://tahadz.com/mishkal/.

References

  1. Hamdi, A.: Apport de la diacritisation dans l’analyse morphosyntaxique de l’Arabe. In: JEP-TALN-RECITAL 2012, Volume 3: RECITAL (2012)

    Google Scholar 

  2. Fashwan, A., Alansary, S.: SHAKKIL: an automatic diacritization system for modern standard Arabic texts. Phonetics and Linguistics Department, Faculty of Arts, Alexandria University, Alexandria, Egypt (2017)

    Google Scholar 

  3. Azmi, Almajed: Survey much of the literature on MSA diacritization (2015)

    Google Scholar 

  4. Chelba, C., Jelinek, F.: Structured language modeling. Comput. Speech Lang. 14(4), 283–332 (2000)

    Article  Google Scholar 

  5. Darwish, K., Mubarak, H., Abdelali, A.: Arabic diacritization: stats, rules, and hacks. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, pp. 9–17 (2017)

    Google Scholar 

  6. Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew (2002)

    Google Scholar 

  7. Abandah, G., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., Al-Taee, M.: Automatic diacritization of Arabic text using recurrent neural networks. Int. J. Doc. Anal. Recognit. 18(2), 183–197 (2015)

    Article  Google Scholar 

  8. Goweder, A., de Roeck, A.: Assessment of a significant Arabic corpus. In: Arabic NLP Workshop at ACL/EACL, Toulouse, France (2001)

    Google Scholar 

  9. Jurafsky, D., Martin, J.H.: Speech and language processing. In: Draft Chapters in Progress (2018)

    Google Scholar 

  10. Kontrovich, L., Lee, D.D.: Learning semitic languages with Hidden Markov Models. In: NIPS 2001 Workshop on Machine Learning Methods for Text and Images (2001)

    Google Scholar 

  11. Bebah, M., Amine, C., Azzeddine, M., Abdelhak, L.: Hybrid approaches for automatic vowelization of Arabic texts. Int. J. Nat. Lang. Comput. (IJNLC) 3, 53–71 (2014). https://doi.org/10.5121/ijnlc.2014.3404

    Article  Google Scholar 

  12. Diab, M., Ghoneim, M., Habash, N.: Arabic diacritization in the context of statistical machine translation (2007)

    Google Scholar 

  13. Alnefaie, R., Azmi, A.M.: Automatic minimal diacritization of Arabic texts. In: 3rd International Conference on Arabic Computational Linguistics, Dubai, United Arab Emirates, 5–6 November 2017

    Google Scholar 

  14. Alansary, S.: Alserag: an automatic diacritization system for Arabic. In: Hassanien, A.E., Shaalan, K., Gaber, T., Azar, A.T., Tolba, M.F. (eds.) AISI 2016. AISC, vol. 533, pp. 182–192. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-48308-5_18

    Chapter  Google Scholar 

  15. Smrž, O., Zemánek, P.: Sherds from an Arabic treebanking mosaic. Bull. Math. Linguist. 78, 63–76 (2002)

    Google Scholar 

  16. Mustafa, S.H.: Arabic string searching in the context of character code standards and orthographic variations. Comput. Stand. Interfaces 20(1), 31–51 (1998)

    Article  Google Scholar 

  17. Zerrouki, T., Balla, A.: Tashkeela: novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data Brief 11, 147–151 (2017)

    Article  Google Scholar 

  18. Khorsheed, M.S.: A HMM-based system to diacritize arabic text. J. Softw. Eng. Appl., 124–127 (2012). https://doi.org/10.4236/jsea.2012.512b024

  19. Darwish, K., Abdelali, A., Mubarak, H., Samih, Y., Attia, M.: Diacritization of Moroccan and Tunisian Arabic Dialects: A CRF Approach (2018)

    Google Scholar 

  20. Hadj Ameur, M.S., Moulahoum, Y., Guessoum, A.: Restoration of Arabic diacritics using a multilevel statistical model. In: Amine, A., Bellatreche, L., Elberrichi, Z., Neuhold, Erich J., Wrembel, R. (eds.) CIIA 2015. IAICT, vol. 456, pp. 181–192. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19578-0_15

    Chapter  Google Scholar 

  21. Jarrar, M., Zaraket, F., Asia, R., Amayreh, H.: Diacritic-based matching of Arabic words. In: ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 18, no. 2, Article 10, December 2018

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ismail Hadjir , Mohamed Abbache or Fatma Zohra Belkredim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hadjir, I., Abbache, M., Belkredim, F.Z. (2019). An Approach for Arabic Diacritization. In: Métais, E., Meziane, F., Vadera, S., Sugumaran, V., Saraee, M. (eds) Natural Language Processing and Information Systems. NLDB 2019. Lecture Notes in Computer Science(), vol 11608. Springer, Cham. https://doi.org/10.1007/978-3-030-23281-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23281-8_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23280-1

  • Online ISBN: 978-3-030-23281-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics