An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Mohammad, Khader; Qaroush, Aziz; Washha, Mahdi; Agaian, Sos; Tumar, Iyad

doi:10.1007/s11042-020-09737-1

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Published: 11 September 2020

Volume 80, pages 2177–2204, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Khader Mohammad¹,
Aziz Qaroush¹,
Mahdi Washha¹,
Sos Agaian² &
…
Iyad Tumar¹

330 Accesses
7 Citations
Explore all metrics

Abstract

The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing style, pen type, and document quality. In this paper, we present a novel unsupervised text-line segmentation algorithm for printed Arabic documents with and without diacritics. The presented approach employs a projection profile along with connected components in an iterative manner to detect text-lines. The primary benefits of the presented algorithm are (i) it is not threshold dependent, (ii) it is not required a training phase for threshold selection, and (iii) it is robust towards page rotation, font type, size, and style variation for both with and without diacritics documents. The extensive computational simulations on manually collected dataset prove the efficiency of the proposed scheme compared with several baseline and states of the art methods, including, Voronoi, X-Y Cut, Docstrum, Smearing and Seam-carving methods. Computational time analysis also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 3

Fig. 5

Fig. 9

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

Mohinder Kumar, M. K. Jindal & Munish Kumar

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

Thomas Hegghammer

AI-Based Engineering and Production Drawing Information Extraction

Notes

http://sites.birzeit.edu/bzuocr/data-sets

References

Aldavert D, Rusiñol M (2018) Manuscript text line detection and segmentation using second-order derivatives. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp 293–298
Ayesh M, Mohammad K, Qaroush A, Agaian S, Washha M (2017) A robust line segmentation algorithm for arabic printed text with diacritics. Electronic Imaging 2017:42–47. https://doi.org/10.2352/ISSN.2470-1173.2017.13.IPAS-204
Article Google Scholar
Barakat BK, Droby A, Alasam R, Madi B, Rabaev I, Shammes R, El-Sana J (2020) Unsupervised text line segmentation
Breuel TM (2002) Two geometric algorithms for layout analysis. In: Proceedings of the 5th International workshop on document analysis systems V, DAS ’02. http://dl.acm.org/citation.cfm?id=647798.736824. Springer, London, pp 188–199
Bukhari SS, Shafait F, Breuel TM (2013) Towards generic text-line extraction. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp 748–752
Forczmański P, Markiewicz A (2016) Two-stage approach to extracting visual objects from paper documents. Mach Vis Appl 27(8):1243–1257
Article Google Scholar
Haraty RA, Ghaddar C (2004) Arabic text recognition. Int Arab J Inf Technol 1:156–163
Google Scholar
Isheawy NAM, Hasan H Optical character recognition (ocr) system
Jaeger S, Zhu G, Doermann D, Chen K, Sampat S (2006) Doclib: A software library for document processing. In: International Conference on Document Recognition and Retrieval XIII. San Jose, pp 1–9
Jain A, Yu B (1998) Document representation and its application to page decomposition. IEEE Trans Pattern Anal Mach Intell 20(3):294–308. https://doi.org/10.1109/34.667886
Article Google Scholar
Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area voronoi diagram. Comput Vis Image Underst 70(3):370–382
Article Google Scholar
Kundu S, Paul S, Bera SK, Abraham A, Sarkar R (2020) Text-line extraction from handwritten document Q5 672 images using gan. Expert Syst Appl 140(112):916
Google Scholar
Lam L, Lee SW, Suen C (1992) Thinning methodologies-a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 14(9):869–885. https://doi.org/10.1109/34.161346
Article Google Scholar
Lawgali A (2015) Handwritten digit recognition based on dwt and dct
Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329
Article Google Scholar
Manmatha R, Rothfeder JL (2005) A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans Pattern Anal Mach Intell 27(8):1212–1225
Article Google Scholar
Mao S, Kanungo T (2001) Empirical performance evaluation methodology and its application to page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 23(3):242–256
Article Google Scholar
Mao S, Kanungo T (2002) Software architecture of pset: a page segmentation evaluation toolkit. Int J Doc Anal Recognit 4(3):205–217
Article Google Scholar
Mao S, Rosenfeld A, Kanungo T (2003) Document structure analysis algorithms: a literature survey. https://doi.org/10.1117/12.476326
Marti UV, Bunke H (2001) Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition, 2001. Proceedings. IEEE, pp 159–163
MATLAB (2015) version 8.15.0 (R2015a). The MathWorks Inc., Natick
Google Scholar
Mohammad K, Agaian S (2012) Practical recognition system for text printed on clear reflected material. ISRN Machine Vision 2012
Mohammad K, Agaian S, Saleh H (2012) Arabic license plate recognition system
Mozaffari S, Faez K, Faradji F, Ziaratban M, Golzan SM (2006) A comprehensive isolated farsi/arabic character database for handwritten ocr research. In: Tenth international workshop on frontiers in handwriting recognition. Suvisoft
Nagy G (2000) Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis & Machine Intelligence (1)38–62
Nagy G, Seth S, Viswanathan M (1992) A prototype document image analysis system for technical journals. Computer 25(7):10–22
Article Google Scholar
Neche C, Belaid A, Kacem-Echi A (2019) Arabic handwritten documents segmentation into text-lines and words using deep learning. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol 6, pp 19–24
O’Gorman L (1993) The document spectrum for page layout analysis. IEEE Trans Pattern Anal Mach Intell 15(11):1162–1173
Article Google Scholar
Oliveira S, Seguin B, Kaplan F (2018) dhsegment: A generic deep-learning approach for document segmentation. arXiv:1804.10371
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9(1):62–66. https://doi.org/10.1109/TSMC.1979.43100767
Article Google Scholar
Pal U, Roy PP (2004) Multioriented and curved text lines extraction from indian documents. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 34(4):1676–1684
Article Google Scholar
Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H, et al. (2002) Ifn/enit-database of handwritten arabic words. In: Proceedings of CIFED, vol 2. Citeseer, pp 127–136
Renton G, Soullard Y, Chatelain C, Adam S, Kermorvant C, Paquet T (2018) Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR)
Saabni R (2018) Robust and efficient text: Line extraction by local minimal sub-seams, pp 1–6
Seuret M, Stoekl Ben Ezra D, Liwicki M (2017) Robust heartbeat-based line segmentation methods for regular texts and paratextual elements
Shafait F, Keysers D, Breuel T (2008) Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 30(6):941–954
Article Google Scholar
Singh S (2013) Optical character recognition techniques: a survey. Journal of emerging Trends in Computing and information Sciences 4(6):545–550
Google Scholar
Slimane F, Ingold R, Kanoun S, Alimi A, Hennebert J (2009) A new arabic printed text image database and evaluation protocols. In: 10th International conference on document analysis and recognition, 2009. ICDAR ’09, pp 946–950. https://doi.org/10.1109/ICDAR.2009.155
Suleyman E, Tuerxun P, Moydin K, Hamdulla A (2019) An adaptive threshold algorithm for offline uyghur handwritten text line segmentation, pp 302–312
Tripathy N, Pal U (2004) Handwriting segmentation of unconstrained oriya text. In: Ninth International workshop on frontiers in handwriting recognition, 2004. IWFHR-9 2004. IEEE, pp 306–311
Wang L, Uchida S, Fan W, Sun J (2016) Globally optimal text line extraction based on k-shortest paths algorithm. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp 335–339
Wang L, Uchida S, Zhu A, Sun J (2017) Human reading knowledge inspired text line extraction. Cogn Comput 10:1–10
Google Scholar
White J, Rohrer G (1983) Image thresholding for optical character recognition and other applications requiring character image extraction. IBM J Res Dev 27(4):400–411. https://doi.org/10.1147/rd.274.0400
Article Google Scholar
Yu B, Jain AK (1996) A robust and fast skew detection algorithm for generic documents. Pattern Recognit 29(10):1599–1629
Article Google Scholar
Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Sixth International conference on document analysis and recognition, 2001. Proceedings. IEEE, pp 281–285

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Birzeit University, Birzeit, Palestine
Khader Mohammad, Aziz Qaroush, Mahdi Washha & Iyad Tumar
College of Staten Island, The City University of New York, New York, NY, USA
Sos Agaian

Authors

Khader Mohammad
View author publications
You can also search for this author in PubMed Google Scholar
Aziz Qaroush
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Washha
View author publications
You can also search for this author in PubMed Google Scholar
Sos Agaian
View author publications
You can also search for this author in PubMed Google Scholar
Iyad Tumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khader Mohammad.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mohammad, K., Qaroush, A., Washha, M. et al. An adaptive text-line extraction algorithm for printed Arabic documents with diacritics. Multimed Tools Appl 80, 2177–2204 (2021). https://doi.org/10.1007/s11042-020-09737-1

Download citation

Received: 04 March 2020
Revised: 23 July 2020
Accepted: 26 August 2020
Published: 11 September 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11042-020-09737-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

AI-Based Engineering and Production Drawing Information Extraction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

AI-Based Engineering and Production Drawing Information Extraction

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation