Skip to main content
Log in

Arabic character recognition using a Haar cascade classifier approach (HCC)

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Optical character recognition (OCR) shows great potential for rapid data entry, but has limited success when applied to the Arabic language. Traditional OCR problems are compounded by the nature of Arabic language and because the script is heavily connected. A machine learning, Haar cascade classifier (HCC) approach was introduced by Viola and Jones (Rapid object detection using a boosted cascade of simple features. Kauai, Hawaii, 2001) to achieve rapid object detection based on a boosted cascade of simple Haar-like features. Here, that approach is applied for the first time to suit Arabic glyph recognition. HCC approach eliminates problematic steps in the pre-processing and recognition phases and, most importantly, character segmentation stage. A classifier was produced for each of the 61 Arabic glyphs that exist after the removal of diacritical marks (dots). These classifiers were trained and tested on some 2,000 images each. The system was tested with real text images and produces a recognition rate for Arabic glyphs of 87 %. The technique gives good results relative to those achieved using a commercial Arabic OCR application and existing state-of-the-art research application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: IEEE conference on computer vision and pattern recognition (CVPR01). Kauai, Hawaii, pp 511–518

  2. Abdelazim HY (2006) Recent trends in Arabic character recognition. In: The sixth conference on language engineering. Cairo, Egypt, pp 212–249

  3. Adolf F (2003) How-to build a cascade of boosted classifiers based on Haar-like features. OpenCV’s Rapid Object Detection. http://lab.cntl.kyutech.ac.jp/~kobalab/nishida/opencv/OpenCV_ObjectDetection_HowTo.pdf. Accessed 3 Nov 2012

  4. Lienhart R, Maydt J (2002) An extended set of Haar-like features for rapid object detection. In: IEEE international conference of image processing (ICIP 2002). New York, USA, pp 900–903

  5. Lienhart R, Kuranov A, Pisarevsky V (2002) Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In: 25th pattern recognition symposium (DAGM03). Madgeburg, Germany, pp 297–304

  6. Kanoun S, Moalla I, Ennaji A, Alimi AM (2000) Script identification for Arabic and Latin, printed and handwritten documents. Presented at the 4th IAPR-international workshop on document analysis systems: DAS. Rio de Janeiro, Brazil

  7. Kanoun S, Ennaji A, Lecourtier Y, Alimi AM (2002) Linguistic integration information in the AABATAS Arabic text analysis system. In: 8th international workshop on frontiers in handwriting recognition (IWFHR’02). Ontario, Canada, pp 389–394

  8. Kanoun S, Alimi AM, Lecourtier Y (2005) Affixal approach for arabic decomposable vocabulary recognition: a validation on printed word in only one font. In: The 8th international conference on document analysis and recognition (ICDAR’05). Seoul, Korea, pp 1025–1029

  9. Kanoun S, Slimane F, Guesmi H, Ingold R, Alimi AM, Hennebert J (2007) Affixal approach versus analytical approach for off-line arabic decomposable vocabulary recognition. In: 10th international conference on document analysis and recognition, ICDAR ‘09. Barcelona, Spain, 661–665

  10. Slimane F, Ingold R, Kanoun S, Alimi AM, Hennebert J (2009) A new arabic printed text image database and evaluation protocols. In: 10th international conference on document analysis and recognition. Barcelona, Spain, pp 946–950

  11. Benjelil M, Kanoun S, Mullot R, Alimi AM (2010) Complex documents images segmentation based on steerable pyramid features. Int J Doc Anal Recogn 13:209–228

    Article  Google Scholar 

  12. Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2010) New features using fractal multi-dimensions for generalized Arabic font recognition. Pattern Recogn Lett 31:361–371

    Article  Google Scholar 

  13. Slimane F, Kanoun S, Hennebert J, Alimi AM, Ingold R (2013) A study on font-family and font-size recognition applied to Arabic word images at ultra-low resolution. Pattern Recogn Lett 34:209–218

    Article  Google Scholar 

  14. The Unicode Consortium (2014) The Unicode Standard, Version 7.0.0, The Unicode Consortium, Mountain View, CA. ISBN 978-1-936213-09-2. http://www.unicode.org/versions/Unicode7.0.0/

  15. Jaiem FK, Kanoun S, Khemakhem M, El Abed H, Kardoun J (2013) Database for Arabic printed text recognition research. In: ICIAP 2013, Part I, LNCS 8156, pp 251–259

  16. AbdelRaouf A, Higgins C, Pridmore T, Khalil M (2010) Building a multi-modal Arabic corpus (MMAC). Int J Doc Anal Recogn 13:285–302

    Article  Google Scholar 

  17. The Unicode Consortium (2011) The Unicode Standard, Version 6.0.0, Chapter 8. The Unicode Consortium, Mountain View, CA. ISBN 978-1-936213-01-6. http://unicode.org/Public/UNIDATA/ArabicShaping.txt. Accessed 11 Apr 2014

  18. Ahmed I, Mahmoud SA, Parvez MT (2012) Printed Arabic text recognition. In: Guide to OCR for arabic scripts, Springer, London, pp 147–168

  19. Lorigo LM, Govindaraju V (2006) Offline Arabic handwriting recognition: a survey. IEEE Trans Pattern Anal Mach Intell 28:712–724

    Article  Google Scholar 

  20. Amin A (1997) Off line Arabic character recognition—a survey. In: The 4th international conference on document analysis and recognition. Ulm, Germany, pp 596–599

  21. Muna LL (2014) Khayyat, Ching Y. Suen, Learning-based word spotting system for Arabic handwritten documents. Pattern Recogn 47:1021–1030

    Article  Google Scholar 

  22. AbdelRaouf A, Higgins C, Khalil M (2008) A database for Arabic printed character recognition. In: The international conference on image analysis and recognition-ICIAR2008. Póvoa de Varzim, Portugal, pp 567–578

  23. Alginahi YM (2013) A survey on Arabic character segmentation. Int J Doc Anal Recogn 16:105–126

    Article  Google Scholar 

  24. Harty R, Ghaddar C (2004) Arabic text recognition. Int Arab J Inf Technol 1:156–163

    Google Scholar 

  25. Kasinski A, Schmidt A (2010) The architecture and performance of the face and eyes detection system based on the Haar cascade classifiers. Pattern Anal Appl 13:197–211

    Article  MathSciNet  Google Scholar 

  26. Crow FC (1984) Summed-area tables for texture mapping. SIGGRAPH Comput Graph 18:207–212

    Article  Google Scholar 

  27. Messom C, Barczak A (2006) Fast and efficient rotated haar-like features using rotated integral images. In: Australian conference on robotics and automation (ACRA2006), pp 1–6

  28. AbdelRaouf A, Higgins CA, Pridmore T, Khalil MI (2014) Fast Arabic glyph recognizer based on haar cascade classifiers. Presented at the international conference on pattern recognition applications and methods (ICPRAM 2014). Angers, France

  29. Schapire RE (2002) The boosting approach to machine learning, an overview. In: MSRI workshop on nonlinear estimation and classification. Berkeley, CA, USA, pp 149–172

  30. Khorsheed MS (2002) Off-line arabic character recognition—a review. Pattern Anal Appl 5:31–45

    Article  MathSciNet  Google Scholar 

  31. Senior A (1992) Off-line handwriting recognition: a review and experiments. Cambridge University, Engineering Department, Cambridge

  32. Cheriet M, Kharma N, Liu C-L, Suen C (2007) Character recognition systems: a guide for students and practitioners. Wiley, New York

  33. Souza A, Cheriet M, Naoi S, Suen CY (2003) Automatic filter selection using image quality assessment. In: The 7th international conference on document analysis and recognition (ICDAR’03). Edinburgh, Scotland

  34. Ahmad I (2013) A technique for skew detection of printed Arabic documents. In: Computer graphics, imaging and visualization (CGIV), 2013 10th international conference, pp 62–67

  35. Breuel TM (2002) Robust least square baseline finding using a branch and bound algorithm. In: Document recognition and retrieval VIII, SPIE

  36. Broumandnia A (2007) Shanbehzadeh J Fast Zernike wavelet moments for Farsi character recognition. Image Vis Comput 25:717–726

    Article  Google Scholar 

  37. Touj S, Amara NEB, Amiri H (2003) Generalized hough transform for Arabic optical character recognition. In: 7th international conference on document analysis and recognition (ICDAR 2003). Edinburgh, Scotland, pp 1242–1246

  38. Noor SM, Mohammed IA, George LE (2011) Handwritten Arabic (indian) numerals recognition using Fourier descriptor and structure base classifier. J Al-Nahrain Univ 14:215–224

    Google Scholar 

  39. Gonzalez RC, Woods RE (2007) Digital image processing, 3rd edn. Prentice Hall, New Jersey, USA

  40. Zidouri A (2007) PCA-based Arabic character feature extraction. In: 9th international symposium on signal processing and its applications (ISSPA 2007). Sharjah, United Arab Emirates, pp 1–4

  41. Kurt Z, Turkmen HI, Karsligil ME (2009) Linear discriminant analysis in ottoman alphabet character recognition. In: The European computing conference. Tbilisi, Georgia, pp 601–607

  42. Trenkle J, Gillies A, Erlandson E, Schlosser S, Cavin S (2001) Advances in Arabic text recognition. In: Symposium on document image understanding technology. Maryland, USA

  43. Yalniz IZ, Altingovde IS, Güdükbay U, Ulusoy Ö (2009) Integrated segmentation and recognition of connected Ottoman script. Opt Eng 48(11):117205

    Article  Google Scholar 

  44. Sabbour N, Shafait F (2013) A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE electronic imaging, SPIE digital library, USA, pp 86580 N-86580 N-12

  45. Abandah GA, Younis KS, Khedher MZ (2008) Handwritten Arabic character recognition using multiple classifiers based on letter form. In: The 5th iasted international conference on signal processing, pattern recognition and applications (SPPRA 2008). Innsbruck, Austria, pp 128–133

  46. Alma’adeed S, Higgens C, Elliman D (2002) Recognition of off-line handwritten Arabic words using hidden markov model approach. In: The 16th international conference on pattern recognition (ICPR’02). Quebec, Canada, pp 481–484

  47. Bushofa B, Spann M (1997) Segmentation and recognition of Arabic characters by structural classification. Image Vis Comput 15:167–179

    Article  Google Scholar 

  48. Mehran R, Pirsiavash H, Razzazi F (2005) A front-end OCR for Omni-font Persian/Arabic cursive printed documents. In: Digital image computing: techniques and applications (DICTA’05). Cairns, Australia, pp 56–64

  49. Rahman AFR, Fairhurst MC (2003) Multiple classifier decision combination strategies for character recognition: a review. Int J Doc Anal Recogn 5:166–194

    Article  Google Scholar 

  50. OpenCV (2002) Rapid object detection with a cascade of boosted classifiers based on Haar-like features. OpenCV haartraining tutorial

  51. Sonka M, Hlavac V, Boyle R (1998) Image processing: analysis and machine vision, 2nd edn. Thomson Learning Vocational, Cengage Learning, New Delhi, India

  52. Box GEP, Muller ME (1958) A note on the generation of random normal deviates. Ann Math Stat 29:610–611

    Article  MATH  Google Scholar 

  53. Seo N (2008) Tutorial: OpenCV haartraining (rapid object detection with a cascade of boosted classifiers based on Haar-like features)

  54. IRIS (2011) Readiris 12 pro. http://www.irislink.com/c2-1684-225/Readiris-12-for-Windows.aspx. Accessed 27 Jul 2011

  55. Kohavi R, Provost F (1998) Glossary of terms. special issue on applications of machine learning and the knowledge discovery process. Mach Learn 30:271–274

    Article  Google Scholar 

  56. IRIS (2004) Readiris pro 10, 10th edn

  57. Schulz KU, Mihov S (2002) Fast string correction with Levenshtein automata. Int J Doc Anal Recogn 5:67–85

    Article  MATH  Google Scholar 

  58. Garcıa S, Herrera F (2008) An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  59. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  60. Webb GI (2000) Multiboosting: a technique for combining boosting and wagging. Mach Learn 40:159–196

    Article  Google Scholar 

  61. Zaiontz C (2013–2015) The data analysis for this paper was generated using the real statistics resource pack software (release 3.5). http://www.real-statistics.com. Accessed 26 Feb 2015

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ashraf AbdelRaouf.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

AbdelRaouf, A., Higgins, C.A., Pridmore, T. et al. Arabic character recognition using a Haar cascade classifier approach (HCC). Pattern Anal Applic 19, 411–426 (2016). https://doi.org/10.1007/s10044-015-0466-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-015-0466-2

Keywords

Navigation