Skip to main content

Advertisement

Log in

Arabic document layout analysis

  • Industrial and Commercial Application
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Bukhari SS, Shafait F, Breuel TM (2012) In guide to OCR Arabic scripts. In: Märgner V, El Abed H (eds) Springer, London, pp 35–53

  2. Wernicke A, Lienhart R (2000) IEEE international conference on multimedia and expo ICME2000, vol 3, p 1

  3. Bloomberg DSDS (1991) In international conference on document analysis and recognition (IEEE), pp 1–12

  4. Agrawal M, Doermann D (2009) In international conference on document analysis and recognition (IEEE), pp 1011–1015

  5. Moll MA, Baird HS, An C (2008) In: International workshop on document analysis system (IEEE), pp 379–385

  6. Moll MA, Baird HS (2008) In document recognition and retrieval XV. In: Yanikoglu BA, Berkner K (eds) Proceedings of SPIE, pp 68150L–68150L-8

  7. Bukhari SS, Ibrahim M, Shafait F, Breuel TM (2010) In: International workshop on document analysis system (ACM), pp 183–190

  8. Bukhari SS, Breuel TM, Asi A, El-Sana J (2012) In: International conference on frontiers in handwriting recognition, vol 639

  9. Shirali-shahreza M, Shirali-shahreza S (2005) In: Proceedings of 5th WSEAS international conference on signal processing and computing Geom Artif Vis, vol 163

  10. Pietikäinen M, Okun O (2001) In: Proceedings of scandinavian conference on image analysis

  11. Pietikäinen M, Okun O (2001) In: International workshop on document analysis system (IEEE), pp 286–291

  12. Okun O, Pietikäinen M (1999) Texture anal mach vis. World Scientific, Singapore, pp 165–177

  13. Gautam A (2013) Segmentation of text from image document. Int J Comput Sci Inf Technol 4:538–540

  14. Lins RD (2009) A taxonomy for noise in images of paper documents—the physical noises. Image Anal Recognit 844–854

  15. Shafait F, Keysers D, Breuel TM (2008) Document recognition and retrieval XV. Proc SPIE 6815:681510

    Article  Google Scholar 

  16. Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recognit 33:225–236

  17. Dong J, Ponson D, Krzyżak A, Suen CY (2005) In: 8th international conference on document analysis and recognition (IEEE), pp 478–483

  18. Strouthopoulos C, Papamarkos N (1998) Text identification for document image analysis using a neural network. Image Vis Comput 16:879–896

  19. Raymer M, Punch W (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171

  20. Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27–66

  21. Dos Santos RP, Clemente GS, Ren TI, Cavalcanti (2009) In: 10th international conference on document analysis and recognition (IEEE), pp 651–655

  22. Likforman-Sulem L (1995) In: 3rd international conference on document analysis and recognition (IEEE), pp 774–777

  23. Bukhari SS, Shafait F, Breuel TM (2011) In: Proceedings of international conference on document analysis and recognition, ICDAR, vol 579

  24. Shi Z, Setlur S, Govindaraju V (2009) In: 10th international conference on document analysis and recognition, vol 176

  25. Jin J, Wang H, Ding X, Peng L (2005) Proc-Spie Int Soc Opt Eng 5676: 48

  26. Attia M, El-mahallawy M (2007) 522

  27. RDI (n.d.) http://www.rdi-eg.com/projects/OCR.htm

  28. Sakhr (n.d.) http://www.sakhr.com/index.php/en/solutions/ocr

Download references

Acknowledgements

This project was funded by the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, the Kingdom of Saudi Arabia, award number (11-INF-1997-03). The author thanks Science and Technology Unit, King Abdulaziz University for technical support. Also, the author would like to thank RDI team, especially, Eng. Shaimaa Samir for her developing and testing results.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amany M. Hesham.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hesham, A.M., Rashwan, M.A.A., Al-Barhamtoshy, H.M. et al. Arabic document layout analysis. Pattern Anal Applic 20, 1275–1287 (2017). https://doi.org/10.1007/s10044-017-0595-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-017-0595-x

Keywords

Navigation