Skip to main content
Log in

Automatic diacritization of Arabic text using recurrent neural networks

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Abandah, G., Khundakjie, F.: Issues concerning code system for Arabic letters. Dirasat Eng. Sci. J. 31(1), 165–177 (2004)

    Google Scholar 

  2. Abandah, G.A., Jamour, F.T., Qaralleh, E.A.: Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks. Int. J. Doc. Anal. Recognit. 17(3), 275–291 (2014)

    Article  Google Scholar 

  3. Al-Sughaiyer, I.A., Al-Kharashi, I.A.: Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol. 55(3), 189–213 (2004)

    Article  Google Scholar 

  4. Azim, A.S., Wang, X., Sim, K.C.: A weighted combination of speech with text-based models for Arabic diacritization. In: 13th Annual Conference of International Speech Communication Association, pp. 2334–2337 (2012)

  5. Azmi, A.M., Almajed, R.S.: A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 1–19 (2013). doi:10.1017/S1351324913000284

  6. Bahanshal, A., Al-Khalifa, H.S.: A first approach to the evaluation of Arabic diacritization systems. In: International Conference on Digital Information Management, pp. 155–158 (2012)

  7. Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th Conference on Computational Linguistics, vol. 1, pp. 89–94 (1996)

  8. Buckwalter, T.: Buckwalter Arabic Morphological Analyzer, v2.0 edn. Linguistic Data Consortium, Philadelphia (2004)

    Google Scholar 

  9. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)

    Article  Google Scholar 

  10. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  11. El-Sadany, T., Hashish, M.: Semi-automatic vowelization of Arabic verbs. In: 10th National Computer Conference, pp. 725–732 (1988)

  12. Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 1–7 (2002)

  13. Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(1), 115–143 (2002)

    MathSciNet  Google Scholar 

  14. Graves, A.: Practical variational inference for neural networks. In: Advances in Neural Information Processing Systems, pp. 2348–2356. Curran Associates, Inc. (2011)

  15. Graves, A.: Offline Arabic handwriting recognition with multidimensional recurrent neural networks. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 297–313. Springer, London (2012)

    Chapter  Google Scholar 

  16. Graves, A.: Sequence transduction with recurrent neural networks. In: ICML Representation Learning Worksop (2012)

  17. Graves, A.: Supervised sequence labelling with recurrent neural networks. Springer, Berlin (2012)

    Book  MATH  Google Scholar 

  18. Graves, A.: Generating sequences with recurrent neural networks. arXiv:1308.0850 (2013)

  19. Graves, A.: RNNLIB: a recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/ (2013)

  20. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)

  21. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  22. Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. In: Conference on North American Chapter of the Association for Computational Linguistics, pp. 53–56 (2007)

  23. Hifny, Y.: Smoothing techniques for Arabic diacritics restoration. In: 12th Conference on Language Engineering, pp. 6–12 (2012)

  24. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  25. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., et al.: Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins summer workshop. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 344–347 (2003)

  26. Lewis, M.P. (ed.): Ethnologue: Languages of the World, 16th edn. SIL International, Dallas (2009)

    Google Scholar 

  27. Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, pp. 102–109 (2004)

  28. Märgner, V., El Abed, H.: ICDAR 2009: Arabic handwriting recognition competition. In: International Conference on Document Analysis and Recognition, pp. 1383–1387 (2009)

  29. Murray, A.F., Edwards, P.J.: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Netw. 5(5), 792–802 (1994)

    Article  Google Scholar 

  30. Nelken, R., Shieber, S.M.: Arabic diacritization using weighted finite-state transducers. In: ACL Workshop on Computational Approaches to Semitic Languages, pp. 79–86 (2005)

  31. Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011)

    Article  Google Scholar 

  32. Ryding, K.C.: A Reference Grammar of Modern Standard Arabic. Cambridge University Press, Cambridge (2005)

    Book  Google Scholar 

  33. Said, A., El-Sharqwi, M., Chalabi, A., Kamal, E.: A hybrid approach for Arabic diacritization. In: Mtais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 7934, pp. 53–64. Springer, Berlin (2013)

    Chapter  Google Scholar 

  34. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  35. Vergyri, D., Kirchhoff, K.: Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: Workshop on Computational Approaches to Arabic Script-based Languages, pp. 66–73 (2004)

  36. Zarrabi-Zadeh, H.: Tanzil: Quran Navigator. http://tanzil.net/download. Accessed 27 Nov 2014

  37. Zerrouki, T.: Arabic corpora resources, Tashkila collection from the Arabic Al-Shamela library. http://aracorpus.e3rab.com. Accessed 27 Nov 2014

  38. Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of Arabic diacritics. In: 21st International Conference on Computational Linguistics, pp. 577–584 (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gheith A. Abandah.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abandah, G.A., Graves, A., Al-Shagoor, B. et al. Automatic diacritization of Arabic text using recurrent neural networks. IJDAR 18, 183–197 (2015). https://doi.org/10.1007/s10032-015-0242-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-015-0242-2

Keywords

Navigation