Skip to main content
Log in

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing style, pen type, and document quality. In this paper, we present a novel unsupervised text-line segmentation algorithm for printed Arabic documents with and without diacritics. The presented approach employs a projection profile along with connected components in an iterative manner to detect text-lines. The primary benefits of the presented algorithm are (i) it is not threshold dependent, (ii) it is not required a training phase for threshold selection, and (iii) it is robust towards page rotation, font type, size, and style variation for both with and without diacritics documents. The extensive computational simulations on manually collected dataset prove the efficiency of the proposed scheme compared with several baseline and states of the art methods, including, Voronoi, X-Y Cut, Docstrum, Smearing and Seam-carving methods. Computational time analysis also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://sites.birzeit.edu/bzuocr/data-sets

References

  1. Aldavert D, Rusiñol M (2018) Manuscript text line detection and segmentation using second-order derivatives. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp 293–298

  2. Ayesh M, Mohammad K, Qaroush A, Agaian S, Washha M (2017) A robust line segmentation algorithm for arabic printed text with diacritics. Electronic Imaging 2017:42–47. https://doi.org/10.2352/ISSN.2470-1173.2017.13.IPAS-204

    Article  Google Scholar 

  3. Barakat BK, Droby A, Alasam R, Madi B, Rabaev I, Shammes R, El-Sana J (2020) Unsupervised text line segmentation

  4. Breuel TM (2002) Two geometric algorithms for layout analysis. In: Proceedings of the 5th International workshop on document analysis systems V, DAS ’02. http://dl.acm.org/citation.cfm?id=647798.736824. Springer, London, pp 188–199

  5. Bukhari SS, Shafait F, Breuel TM (2013) Towards generic text-line extraction. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp 748–752

  6. Forczmański P, Markiewicz A (2016) Two-stage approach to extracting visual objects from paper documents. Mach Vis Appl 27(8):1243–1257

    Article  Google Scholar 

  7. Haraty RA, Ghaddar C (2004) Arabic text recognition. Int Arab J Inf Technol 1:156–163

    Google Scholar 

  8. Isheawy NAM, Hasan H Optical character recognition (ocr) system

  9. Jaeger S, Zhu G, Doermann D, Chen K, Sampat S (2006) Doclib: A software library for document processing. In: International Conference on Document Recognition and Retrieval XIII. San Jose, pp 1–9

  10. Jain A, Yu B (1998) Document representation and its application to page decomposition. IEEE Trans Pattern Anal Mach Intell 20(3):294–308. https://doi.org/10.1109/34.667886

    Article  Google Scholar 

  11. Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area voronoi diagram. Comput Vis Image Underst 70(3):370–382

    Article  Google Scholar 

  12. Kundu S, Paul S, Bera SK, Abraham A, Sarkar R (2020) Text-line extraction from handwritten document Q5 672 images using gan. Expert Syst Appl 140(112):916

    Google Scholar 

  13. Lam L, Lee SW, Suen C (1992) Thinning methodologies-a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 14(9):869–885. https://doi.org/10.1109/34.161346

    Article  Google Scholar 

  14. Lawgali A (2015) Handwritten digit recognition based on dwt and dct

  15. Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329

    Article  Google Scholar 

  16. Manmatha R, Rothfeder JL (2005) A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans Pattern Anal Mach Intell 27(8):1212–1225

    Article  Google Scholar 

  17. Mao S, Kanungo T (2001) Empirical performance evaluation methodology and its application to page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 23(3):242–256

    Article  Google Scholar 

  18. Mao S, Kanungo T (2002) Software architecture of pset: a page segmentation evaluation toolkit. Int J Doc Anal Recognit 4(3):205–217

    Article  Google Scholar 

  19. Mao S, Rosenfeld A, Kanungo T (2003) Document structure analysis algorithms: a literature survey. https://doi.org/10.1117/12.476326

  20. Marti UV, Bunke H (2001) Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition, 2001. Proceedings. IEEE, pp 159–163

  21. MATLAB (2015) version 8.15.0 (R2015a). The MathWorks Inc., Natick

    Google Scholar 

  22. Mohammad K, Agaian S (2012) Practical recognition system for text printed on clear reflected material. ISRN Machine Vision 2012

  23. Mohammad K, Agaian S, Saleh H (2012) Arabic license plate recognition system

  24. Mozaffari S, Faez K, Faradji F, Ziaratban M, Golzan SM (2006) A comprehensive isolated farsi/arabic character database for handwritten ocr research. In: Tenth international workshop on frontiers in handwriting recognition. Suvisoft

  25. Nagy G (2000) Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis & Machine Intelligence (1)38–62

  26. Nagy G, Seth S, Viswanathan M (1992) A prototype document image analysis system for technical journals. Computer 25(7):10–22

    Article  Google Scholar 

  27. Neche C, Belaid A, Kacem-Echi A (2019) Arabic handwritten documents segmentation into text-lines and words using deep learning. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol 6, pp 19–24

  28. O’Gorman L (1993) The document spectrum for page layout analysis. IEEE Trans Pattern Anal Mach Intell 15(11):1162–1173

    Article  Google Scholar 

  29. Oliveira S, Seguin B, Kaplan F (2018) dhsegment: A generic deep-learning approach for document segmentation. arXiv:1804.10371

  30. Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9(1):62–66. https://doi.org/10.1109/TSMC.1979.43100767

    Article  Google Scholar 

  31. Pal U, Roy PP (2004) Multioriented and curved text lines extraction from indian documents. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 34(4):1676–1684

    Article  Google Scholar 

  32. Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H, et al. (2002) Ifn/enit-database of handwritten arabic words. In: Proceedings of CIFED, vol 2. Citeseer, pp 127–136

  33. Renton G, Soullard Y, Chatelain C, Adam S, Kermorvant C, Paquet T (2018) Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR)

  34. Saabni R (2018) Robust and efficient text: Line extraction by local minimal sub-seams, pp 1–6

  35. Seuret M, Stoekl Ben Ezra D, Liwicki M (2017) Robust heartbeat-based line segmentation methods for regular texts and paratextual elements

  36. Shafait F, Keysers D, Breuel T (2008) Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 30(6):941–954

    Article  Google Scholar 

  37. Singh S (2013) Optical character recognition techniques: a survey. Journal of emerging Trends in Computing and information Sciences 4(6):545–550

    Google Scholar 

  38. Slimane F, Ingold R, Kanoun S, Alimi A, Hennebert J (2009) A new arabic printed text image database and evaluation protocols. In: 10th International conference on document analysis and recognition, 2009. ICDAR ’09, pp 946–950. https://doi.org/10.1109/ICDAR.2009.155

  39. Suleyman E, Tuerxun P, Moydin K, Hamdulla A (2019) An adaptive threshold algorithm for offline uyghur handwritten text line segmentation, pp 302–312

  40. Tripathy N, Pal U (2004) Handwriting segmentation of unconstrained oriya text. In: Ninth International workshop on frontiers in handwriting recognition, 2004. IWFHR-9 2004. IEEE, pp 306–311

  41. Wang L, Uchida S, Fan W, Sun J (2016) Globally optimal text line extraction based on k-shortest paths algorithm. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp 335–339

  42. Wang L, Uchida S, Zhu A, Sun J (2017) Human reading knowledge inspired text line extraction. Cogn Comput 10:1–10

    Google Scholar 

  43. White J, Rohrer G (1983) Image thresholding for optical character recognition and other applications requiring character image extraction. IBM J Res Dev 27(4):400–411. https://doi.org/10.1147/rd.274.0400

    Article  Google Scholar 

  44. Yu B, Jain AK (1996) A robust and fast skew detection algorithm for generic documents. Pattern Recognit 29(10):1599–1629

    Article  Google Scholar 

  45. Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Sixth International conference on document analysis and recognition, 2001. Proceedings. IEEE, pp 281–285

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khader Mohammad.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohammad, K., Qaroush, A., Washha, M. et al. An adaptive text-line extraction algorithm for printed Arabic documents with diacritics. Multimed Tools Appl 80, 2177–2204 (2021). https://doi.org/10.1007/s11042-020-09737-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09737-1

Keywords

Navigation