Abstract
Segmentation is a significant stage for the recognition of old newspapers. Text-line extraction in the documents like newspaper pages which have very complex layouts poses a significant challenge. Old newspaper documents printed in Gurumukhi script present several forms of hurdles in segmentation due to noise, degradation, bleed-through of ink, multiple font styles and sizes, little space between neighboring text lines, overlapping of lines, etc. Because of the low quality and the complexity of these documents, automatic text line segmentation remains an open research field. Very few researches are available in the literature to segment news articles in Gurumukhi script. This is one of the first few attempts to recognize Gurumukhi newspaper text. The goal of this paper is to present a new methodology for text-line extraction by integrating median calculation and strip height calculation techniques. Non-suitability of existing techniques to segment newspaper text lines have also been discussed with results in the article. The efficiency of the proposed algorithm is demonstrated by experimentation directed on two diverse own made datasets: (a) on the data set of single-column documents with headlines block (b) on the dataset of multi-column documents with headlines block.
Similar content being viewed by others
Availability of data and materials
During our research, we suffered a lot from the lack of a public dataset. Thus, we do not have a benchmark to compare our algorithm with others. A public dataset may help other researchers working on similar projects as ours. So, we decide to share our raw data for experimental work.
References
Cruz F, Terrades OR (2018) A probabilistic framework for handwritten text line segmentation, pp. 1–47
Gatos B, Louloudis G, Stamatopoulos N (2014) Segmentation of historical handwritten documents into text zones and text lines. In: 14th International conference on in handwriting recognition frontiers (ICFHR), pp. 464–469
Santos RPD, Clemente GS, Ren TI, Cavalcanti GD (2009) Text line segmentation based on morphology and histogram projection. In: 10th International conference on document analysis and recognition (ICDAR'09), pp. 651–655
Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. In: Seventh international conference on document analysis and recognition (ICDAR), pp. 128–133
Vassilis P (2010) Handwritten document image segmentation into text lines and words. Pattern Recogn 43(1):369–377
Kumar V, Senegar PK (2010) Segmentation of printed text in Devanagari script and Gurumukhi script. Int J Comput Appl 3:24–29
Jindal MK, Lehal GS, Sharma RK (2009) On segmentation of touching characters and overlapping lines in degraded printed Gurumukhi script. Int J Image Gr 9(3):321–353
Kumar M, Sharma RK, Jindal MK (2010) Segmentation of lines and words in handwritten Gurumukhi script documents. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, pp. 25–28.
Mehta B, Jindal SR (2014) Segmentation of broken characters of handwritten Gurumukhi script. Int J Eng Sci 3:95–105
Lehal GS (2013) Ligature segmentation for Urdu OCR. In: 12th International conference on document analysis and recognition (ICDAR), pp. 1130–1134
Malgi PS, Gayakwad S (2014) Line segmentation of Devanagari handwritten documents. Int J Electron, Commun Instrum Eng Res Dev (IJECIERD) 4(2):25–32
Kumar M, Jindal MK, Sharma RK (2017) A novel technique for line segmentation in offline handwritten Gurumukhi script documents. Natl Acad Sci Lett 40(4):273–277
Palakollu S, Dhir R, Rani R (2012) Handwritten Hindi text segmentation techniques for lines and characters. Proc World Congr Eng Comput Sci 1:24–26
Alaei A, Nagabhushan P, Pal U (2011) Piece-wise painting technique for line segmentation of unconstrained handwritten text: a specific study with Persian text documents. Pattern Anal Appl 14(4):381–394
Alaei A, Pal U, Nagabhushan P (2011) A new scheme for unconstrained handwritten text-line segmentation. Pattern Recogn 44(4):917–928
Bandyopadhyay A, Ganguly A, Pal U (2011) Layout segmentation of scanned newspaper documents. J Comput Linguist 1(1):5–10
Roy PP, Pal U, Lladós J (2008) Morphology based handwritten line segmentation using foreground and background information. In: International conference on frontiers in handwriting recognition. pp. 241–246
Tripathy N, Pal U (2006) Handwriting segmentation of unconstrained Oriya text. Sadhana 31:755–769
Deshmukh MS, Patil MP, Kolhe SR (2018) A hybrid text line segmentation approach for the ancient handwritten unconstrained freestyle Modi script documents. Imaging Sci J 66(7):433–442
Vishwanath NV, Murugan R, Kumar SN (2018) A comparative analysis of line and word segmentation for handwritten document image. Int J Adv Res Comput Sci 9(1):514–519
He L, Song Z, Chang M, Zang C, Yan G, Liversedge SP (2020) Contrasting off-line segmentation decisions with on-line word segmentation during reading. British J Psychol 112:662–689
Malik S, Sajid A, Ahmad A, Almogren A, Hayat B, Awais M, Kim KH (2020) An efficient skewed line segmentation technique for cursive script OCR. Sci Program 2020:1–12
Kurar Barakat B, Cohen R, Droby A, Rabaev I, El-Sana J (2020) Learning-free text line segmentation for historical handwritten documents. Appl Sci 10(22):8276
Narang SR, Jindal MK, Kumar M (2019). Line segmentation of Devanagari ancient manuscripts. In: Proceedings of the national academy of sciences, India section A: physical sciences, pp. 1–8
Seewig J, Scott PJ, Eifler M, Barwick B, Hüser D (2020) Crossing-the-line segmentation as a basis for Rsm and Rc evaluation. Surf Topogr: Metrol Prop 8(2):024010
Kaur RP, Jindal MK, Kumar M (2020) Text and graphics segmentation of newspapers printed in Gurmukhi script: a hybrid approach. Vis Comput 37:1–23
Kaur RP, Jindal MK, Kumar M (2018). Zone segmentation of a text line printed in Gurmukhi script newspaper. In: 2018 Fifth International conference on parallel, distributed and grid computing (PDGC), pp. 330–334
Khare V, Shivakumara P, Navya BJ, Swetha GC, Guru DS, Pal U, Lu T. (2018). Weighted-gradient features for handwritten line segmentation. In: 2018 24th International conference on pattern recognition, pp. 3651–3656
Mehta N, Doshi J. (2021). Text line segmentation for medieval Devnagari manuscript. In Proceedings of International conference on communication and computational technologies, pp. 405–412
Kaur RP, Jindal MK, Kumar M (2021). Newspaper text recognition printed in Gurumukhi script: SVM versus MLP. In: Proceedings of data driven approach towards disruptive technologies, MIDAS, pp. 23–37
Kaur RP, Jindal MK, Kumar M (2021) TxtLineSeg: text line segmentation of unconstrained printed text in Devanagari script. In: Singh V, Asari VK, Kumar S, Patel RB (eds) Computational methods and data engineering. Springer, Singapore, pp 85–100
Rani NS, Pruthvi TR, Rao AG, Bipin NB (2021). Automated text line segmentation and table detection for pre-printed document image analysis systems. In: Proceedings of 3rd international conference on signal processing and communication, pp. 723–730
Kaur R, Jindal MK (2016) Problems in Making OCR of Gurumukhi script newspapers. Int J Adv Res Comput Sci 7 (6):16–22
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kaur, R.P., Jindal, M.K., Kumar, M. et al. LineSeg: line segmentation of scanned newspaper documents. Pattern Anal Applic 25, 189–208 (2022). https://doi.org/10.1007/s10044-021-01031-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-01031-6