Abstract
Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.
Similar content being viewed by others
References
Bukhari SS, Shafait F, Breuel TM (2012) In guide to OCR Arabic scripts. In: Märgner V, El Abed H (eds) Springer, London, pp 35–53
Wernicke A, Lienhart R (2000) IEEE international conference on multimedia and expo ICME2000, vol 3, p 1
Bloomberg DSDS (1991) In international conference on document analysis and recognition (IEEE), pp 1–12
Agrawal M, Doermann D (2009) In international conference on document analysis and recognition (IEEE), pp 1011–1015
Moll MA, Baird HS, An C (2008) In: International workshop on document analysis system (IEEE), pp 379–385
Moll MA, Baird HS (2008) In document recognition and retrieval XV. In: Yanikoglu BA, Berkner K (eds) Proceedings of SPIE, pp 68150L–68150L-8
Bukhari SS, Ibrahim M, Shafait F, Breuel TM (2010) In: International workshop on document analysis system (ACM), pp 183–190
Bukhari SS, Breuel TM, Asi A, El-Sana J (2012) In: International conference on frontiers in handwriting recognition, vol 639
Shirali-shahreza M, Shirali-shahreza S (2005) In: Proceedings of 5th WSEAS international conference on signal processing and computing Geom Artif Vis, vol 163
Pietikäinen M, Okun O (2001) In: Proceedings of scandinavian conference on image analysis
Pietikäinen M, Okun O (2001) In: International workshop on document analysis system (IEEE), pp 286–291
Okun O, Pietikäinen M (1999) Texture anal mach vis. World Scientific, Singapore, pp 165–177
Gautam A (2013) Segmentation of text from image document. Int J Comput Sci Inf Technol 4:538–540
Lins RD (2009) A taxonomy for noise in images of paper documents—the physical noises. Image Anal Recognit 844–854
Shafait F, Keysers D, Breuel TM (2008) Document recognition and retrieval XV. Proc SPIE 6815:681510
Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recognit 33:225–236
Dong J, Ponson D, Krzyżak A, Suen CY (2005) In: 8th international conference on document analysis and recognition (IEEE), pp 478–483
Strouthopoulos C, Papamarkos N (1998) Text identification for document image analysis using a neural network. Image Vis Comput 16:879–896
Raymer M, Punch W (2000) Dimensionality reduction using genetic algorithms. IEEE Trans Evol Comput 4:164–171
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27–66
Dos Santos RP, Clemente GS, Ren TI, Cavalcanti (2009) In: 10th international conference on document analysis and recognition (IEEE), pp 651–655
Likforman-Sulem L (1995) In: 3rd international conference on document analysis and recognition (IEEE), pp 774–777
Bukhari SS, Shafait F, Breuel TM (2011) In: Proceedings of international conference on document analysis and recognition, ICDAR, vol 579
Shi Z, Setlur S, Govindaraju V (2009) In: 10th international conference on document analysis and recognition, vol 176
Jin J, Wang H, Ding X, Peng L (2005) Proc-Spie Int Soc Opt Eng 5676: 48
Attia M, El-mahallawy M (2007) 522
RDI (n.d.) http://www.rdi-eg.com/projects/OCR.htm
Sakhr (n.d.) http://www.sakhr.com/index.php/en/solutions/ocr
Acknowledgements
This project was funded by the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, the Kingdom of Saudi Arabia, award number (11-INF-1997-03). The author thanks Science and Technology Unit, King Abdulaziz University for technical support. Also, the author would like to thank RDI team, especially, Eng. Shaimaa Samir for her developing and testing results.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hesham, A.M., Rashwan, M.A.A., Al-Barhamtoshy, H.M. et al. Arabic document layout analysis. Pattern Anal Applic 20, 1275–1287 (2017). https://doi.org/10.1007/s10044-017-0595-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-017-0595-x