Skip to main content
Log in

IESK-ArDB: a database for handwritten Arabic and an optimized topological segmentation approach

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Even though a lot of researches have been conducted in order to solve the problem of unconstrained handwriting recognition, an effective solution is still a serious challenge. In this article, we address two Arabic handwriting recognition-related issues. Firstly, we present IESK-arDB, a new multi-propose off-line Arabic handwritten database. It is publicly available and contains more than 4,000 word images, each equipped with binary version, thinned version as well as a ground truth information stored in separate XML file. Additionally, it contains around 6,000 character images segmented from the database. A letter frequency analysis showed that the database exhibits letter frequencies similar to that of large corpora of digital text, which proof the database usefulness. Secondly, we proposed a multi-phase segmentation approach that starts by detecting and resolving sub-word overlaps, then hypothesizing a large number of segmentation points that are later reduced by a set of heuristic rules. The proposed approach has been successfully tested on IESK-arDB. The results were very promising, indicating the efficiency of the suggested approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. The second version is planned to be released in July, 2012. It will contain much more samples of word images, 200 full pages of annotated handwritten Arabic text with and without non-text elements, as well as 60 pages of bilingual handwritten Arabic/Latin text. In addition, a software tool for ground truth management will be also made available for download.

  2. Phonemes specific to Farsi are represented by the basic shapes of some letters with additional diacritics, resulting in an alphabet of 32 letters, compared to 28 letters in case of Arabic.

  3. This is due to the fact that Arabic text is written from right to left, and writers usually writing main \(CC\) first then auxiliaries. As a result, auxiliaries are appearing shifted to the right away from their correspondence sub-words.

References

  1. Belaïd, A., Choisy, C.: Human reading based strategies for off-line arabic word recognition. In: Proceedings of the 2006 Conference on Arabic and Chinese Handwriting Recognition, SACH’06, pp. 36–56. Springer-Verlag, Berlin, Heidelberg (2008)

  2. Lorigo, L.M., Govindaraju, V.: Offline arabic handwriting recognition: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 28(5), 712–724 (2006)

    Article  Google Scholar 

  3. Almuallim, H., Yamaguchi, S.: A method of recognition of Arabic cursive handwriting. IEEE Trans. Pattern Anal. Mach. Intell. 9, 715–722 (1987)

    Article  Google Scholar 

  4. Elzobi, M., Al-Hamadi, A., Dinges, L., Michaelis, B.: A structural features based segmentation for off-line handwritten Arabic text. In: 2010 5th International Symposium on I/V Communications and Mobile Network (ISVC), pp. 1–4. Rabat, Morocco (2010)

  5. Arica, N., Yarman-Vural, F.T.: An overview of character recognition focused on off-line handwriting. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 31(2), 216–233 (2001)

    Article  Google Scholar 

  6. Al Aghbari, Z., Brook, S.: Hah manuscripts: a holistic paradigm for classifying and retrieving historical arabic handwritten documents. Expert Syst. Appl. 36(8), 10942–10951 (2009)

    Article  Google Scholar 

  7. Elzobi, M., Al-Hamadi, A., Al Aghbari, Z.: Off-line handwritten Arabic words segmentation based on structural features and connected components analysis. In: Baranoski, G., Skala, V. (eds.) WSCG’ 2011 Communication Papers Proceedings, pp. 135–142. Vaclav Skala Union Agency, Plzen, Czech Republic (2011)

  8. Gatos, B., Stamatopoulos, N., Louloudis, G.: Icdar 2009 handwriting segmentation contest. In: 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 1393–1397. Catalonia, Spain (2009)

  9. Lavrenko, V., Rath, T.M., Manmatha, R.: Holisticword recognition for handwritten historical documents. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries, pp. 278–287. ACM, New York (2004)

  10. Steinherz, T., Rivlin, E., Intrator, N.: Offline cursive script word recognition a survey. Int. J. Doc. Anal. Recognit. 2, 90–110 (1999)

    Article  Google Scholar 

  11. Yanikoglu, B., Sandon, P.A.: Segmentation of off-line cursive handwriting using linear programming. Pattern Recognit. 31, 1825–1833 (1998)

    Article  Google Scholar 

  12. Lorigo, L.M., Govindaraju, V.: Segmentation and pre-recognition of Arabic handwriting. In: Proceedings of the 8th International Conference on Document Analysis and Recognition 2005, vol. 2, pp. 605–609. Washington

  13. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996)

    Article  Google Scholar 

  14. Blumenstein, M.: Cursive character segmentation using neural network techniques. In: Marinai, S., Fujisawa, H. (eds.) Machine Learning in Document Analysis and Recognition, vol. 90 of Studies in Computational Intelligence, pp. 259–275. Springer, Berlin (2008)

    Chapter  Google Scholar 

  15. Märgner, V., El Abed, H.: Databases and competitions: strategies to improve arabic recognition systems. In: Proceedings of the 2006 Conference on Arabic and Chinese Handwriting Recognition, SACH’06, pp. 82–103. Springer, Berlin, Heidelberg (2008)

  16. Srihari, S., Srinivasan, H., Babu, P., Bhole, C.: Handwritten Arabic word spotting using the cedarabic document analysis system. In: Proceedings of the Symposium on Document Image Understanding Technology (SDIUT-05), pp. 123–132. College Park, MD (2005)

  17. Al-Ma’adeed, S., Elliman, D., Higgins, C.: A data base for arabic handwritten text recognition research. In: Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition 2002, pp. 485–489 (2002)

  18. Al-Ohali, Y., Cheriet, M., Suen, C.: Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 36(1), 111–121 (2003)

    Google Scholar 

  19. Stephanie, M.S.: Linguistic resources for arabic handwriting recognition. In: Proceedings of the 2nd International Conference for Arabic Handwriting Recognition (2009)

  20. Slimane, F., Ingold, R., Kanoun, S., Alimi, A., Hennebert, J.: Database and evaluation protocols for Arabic printed text recognition. Technical Report 296-09-01. Department of Informatics, University of Fribourg (2009)

  21. Schlosser, S.: Erim Arabic Database. Document Processing Research Program, Information and Materials Applications Laboratory, Environmental Research Institute of Michigan. http://documents.cfar.umd.edu/resources/database/erim_Arabic_DB.html (1995)

  22. Farrahi Moghaddam, R., Cheriet, M., Adankon, M.M., Filonenko, K., Wisnovsky, R.: Ibn sina: a database for research on processing and understanding of Arabic manuscripts images. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS ’10, pp. 11–18. ACM, New York, NY, USA (2010)

  23. Mozaffari, S., El Abed, H., Maergner, V., Faez, K., Amirshahi, A.: A Database of Farsi Handwritten City Names. IfN/Farsi-Database (2008)

  24. Ziaratban, M., Faez, K., Bagheri, F.: Fht: an unconstraint farsi handwritten text database. In: Proceedings of the 10th International Conference on Document Analysis and Recognition, ICDAR’09, pp. 281–285. Catalonia, Spain (2009)

  25. Alamri, H., Lei He, C., Suen, C.Y.: A new approach for segmentation and recognition of Arabic handwritten touching numeral pairs. In: Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns, CAIP ’09, pp. 165–172. Springer, Berlin, Heidelberg (2009)

  26. Xiu, P., Peng, L., Ding, X., Wang, H.: Offline Handwritten Arabic Character Segmentation with Probabilistic Model, pp. 402–412. Number project 60472002. Springer, Berlin (2006)

  27. Srihari, S.N., Yang, X., Ball, G.R.: Offline Chinese handwriting recognition: A survey. In: Frontiers of Computer Science in, p. 2007 (2007)

  28. Bushofa, B.: Segmentation and recognition of Arabic characters by structural classification. Image Vis. Comput. 15(3), 167–179 (1997)

    Article  Google Scholar 

  29. Atici, A., Yarman-Vural, F.T.: A heuristic algorithm for optical character recognition of Arabic script. Signal Process. 62(1), 87–99 (1997)

    Article  MATH  Google Scholar 

  30. Abuhaiba, I.S.I., Holt, M.J.J., Datta, S.: Recognition of off-line cursive handwriting. Comput. Vis. Image Underst. 71(1), 19–38 (1998)

    Article  Google Scholar 

  31. Nawaz, T., Naqvi, S., Rehman, H., Faiz, A.: Optical character recognition system for urdu (naskh font) using pattern matching technique. Int. J. Image Process. 3(3), 92–104 (2008)

    Google Scholar 

  32. Lam, L., Lee, S.-W., Suen, C.Y.: Thinning methodologies: a comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 14, 869–885 (1992)

    Article  Google Scholar 

  33. Rosenfeld, R.: Two decades of statistical language modeling: where do we go from here. In. Proceedings of the IEEE, p. 2000 (2000)

  34. Madi, M.: A study of Arabic letter frequency analysis. WWW page, (2011)

  35. Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15, 11–15 (1972)

    Article  Google Scholar 

  36. Boubaker, H., Kherallah, M., Alimi, A.M.: New algorithm of straight or curved baseline detection for short arabic handwritten writing. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, ICDAR ’09, pp. 778–782. IEEE Computer Society, Washington, DC, USA (2009)

  37. Cote, M., Lecolinet, E., Cheriet, M., Suen, C.Y.: Automatic reading of cursive scripts using human knowledge. In: Proceedings of the 4th International Conference on Document Analysis and Recognition 1997, vol. 1, pp. 107–111 (1997)

  38. Slavik, P., Govindaraju, V. (eds.): Equivalence of different methods for slant and skew corrections in word recognition applications. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 323–326 (2001)

  39. Bunke, H., Wang, P.S.P. (eds.): Handbook of character recognition and document image analysis. In: Image Processing Methods for Document Image Analysis, pp. 15–19. World Scientific, Singapore (1997)

  40. Vincent, L.: Morphological grayscale reconstruction in image analysis: applications and efficient algorithms. IEEE Trans. Image Process. 2, 176–201 (1993)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moftah Elzobi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elzobi, M., Al-Hamadi, A., Al Aghbari, Z. et al. IESK-ArDB: a database for handwritten Arabic and an optimized topological segmentation approach. IJDAR 16, 295–308 (2013). https://doi.org/10.1007/s10032-012-0190-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-012-0190-z

Keywords

Navigation