Skip to main content
Log in

LineSeg: line segmentation of scanned newspaper documents

  • Industrial and Commercial Application
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Segmentation is a significant stage for the recognition of old newspapers. Text-line extraction in the documents like newspaper pages which have very complex layouts poses a significant challenge. Old newspaper documents printed in Gurumukhi script present several forms of hurdles in segmentation due to noise, degradation, bleed-through of ink, multiple font styles and sizes, little space between neighboring text lines, overlapping of lines, etc. Because of the low quality and the complexity of these documents, automatic text line segmentation remains an open research field. Very few researches are available in the literature to segment news articles in Gurumukhi script. This is one of the first few attempts to recognize Gurumukhi newspaper text. The goal of this paper is to present a new methodology for text-line extraction by integrating median calculation and strip height calculation techniques. Non-suitability of existing techniques to segment newspaper text lines have also been discussed with results in the article. The efficiency of the proposed algorithm is demonstrated by experimentation directed on two diverse own made datasets: (a) on the data set of single-column documents with headlines block (b) on the dataset of multi-column documents with headlines block.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29

Similar content being viewed by others

Availability of data and materials

During our research, we suffered a lot from the lack of a public dataset. Thus, we do not have a benchmark to compare our algorithm with others. A public dataset may help other researchers working on similar projects as ours. So, we decide to share our raw data for experimental work.

References

  1. Cruz F, Terrades OR (2018) A probabilistic framework for handwritten text line segmentation, pp. 1–47

  2. Gatos B, Louloudis G, Stamatopoulos N (2014) Segmentation of historical handwritten documents into text zones and text lines. In: 14th International conference on in handwriting recognition frontiers (ICFHR), pp. 464–469

  3. Santos RPD, Clemente GS, Ren TI, Cavalcanti GD (2009) Text line segmentation based on morphology and histogram projection. In: 10th International conference on document analysis and recognition (ICDAR'09), pp. 651–655

  4. Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. In: Seventh international conference on document analysis and recognition (ICDAR), pp. 128–133

  5. Vassilis P (2010) Handwritten document image segmentation into text lines and words. Pattern Recogn 43(1):369–377

    Article  Google Scholar 

  6. Kumar V, Senegar PK (2010) Segmentation of printed text in Devanagari script and Gurumukhi script. Int J Comput Appl 3:24–29

    Google Scholar 

  7. Jindal MK, Lehal GS, Sharma RK (2009) On segmentation of touching characters and overlapping lines in degraded printed Gurumukhi script. Int J Image Gr 9(3):321–353

    Article  Google Scholar 

  8. Kumar M, Sharma RK, Jindal MK (2010) Segmentation of lines and words in handwritten Gurumukhi script documents. In: Proceedings of the first international conference on intelligent interactive technologies and multimedia, pp. 25–28.

  9. Mehta B, Jindal SR (2014) Segmentation of broken characters of handwritten Gurumukhi script. Int J Eng Sci 3:95–105

    Google Scholar 

  10. Lehal GS (2013) Ligature segmentation for Urdu OCR. In: 12th International conference on document analysis and recognition (ICDAR), pp. 1130–1134

  11. Malgi PS, Gayakwad S (2014) Line segmentation of Devanagari handwritten documents. Int J Electron, Commun Instrum Eng Res Dev (IJECIERD) 4(2):25–32

    Google Scholar 

  12. Kumar M, Jindal MK, Sharma RK (2017) A novel technique for line segmentation in offline handwritten Gurumukhi script documents. Natl Acad Sci Lett 40(4):273–277

    Article  Google Scholar 

  13. Palakollu S, Dhir R, Rani R (2012) Handwritten Hindi text segmentation techniques for lines and characters. Proc World Congr Eng Comput Sci 1:24–26

    Google Scholar 

  14. Alaei A, Nagabhushan P, Pal U (2011) Piece-wise painting technique for line segmentation of unconstrained handwritten text: a specific study with Persian text documents. Pattern Anal Appl 14(4):381–394

    Article  MathSciNet  Google Scholar 

  15. Alaei A, Pal U, Nagabhushan P (2011) A new scheme for unconstrained handwritten text-line segmentation. Pattern Recogn 44(4):917–928

    Article  Google Scholar 

  16. Bandyopadhyay A, Ganguly A, Pal U (2011) Layout segmentation of scanned newspaper documents. J Comput Linguist 1(1):5–10

    Google Scholar 

  17. Roy PP, Pal U, Lladós J (2008) Morphology based handwritten line segmentation using foreground and background information. In: International conference on frontiers in handwriting recognition. pp. 241–246

  18. Tripathy N, Pal U (2006) Handwriting segmentation of unconstrained Oriya text. Sadhana 31:755–769

    Article  Google Scholar 

  19. Deshmukh MS, Patil MP, Kolhe SR (2018) A hybrid text line segmentation approach for the ancient handwritten unconstrained freestyle Modi script documents. Imaging Sci J 66(7):433–442

    Article  Google Scholar 

  20. Vishwanath NV, Murugan R, Kumar SN (2018) A comparative analysis of line and word segmentation for handwritten document image. Int J Adv Res Comput Sci 9(1):514–519

    Article  Google Scholar 

  21. He L, Song Z, Chang M, Zang C, Yan G, Liversedge SP (2020) Contrasting off-line segmentation decisions with on-line word segmentation during reading. British J Psychol 112:662–689

    Article  Google Scholar 

  22. Malik S, Sajid A, Ahmad A, Almogren A, Hayat B, Awais M, Kim KH (2020) An efficient skewed line segmentation technique for cursive script OCR. Sci Program 2020:1–12

    Google Scholar 

  23. Kurar Barakat B, Cohen R, Droby A, Rabaev I, El-Sana J (2020) Learning-free text line segmentation for historical handwritten documents. Appl Sci 10(22):8276

    Article  Google Scholar 

  24. Narang SR, Jindal MK, Kumar M (2019). Line segmentation of Devanagari ancient manuscripts. In: Proceedings of the national academy of sciences, India section A: physical sciences, pp. 1–8

  25. Seewig J, Scott PJ, Eifler M, Barwick B, Hüser D (2020) Crossing-the-line segmentation as a basis for Rsm and Rc evaluation. Surf Topogr: Metrol Prop 8(2):024010

    Article  Google Scholar 

  26. Kaur RP, Jindal MK, Kumar M (2020) Text and graphics segmentation of newspapers printed in Gurmukhi script: a hybrid approach. Vis Comput 37:1–23

    Google Scholar 

  27. Kaur RP, Jindal MK, Kumar M (2018). Zone segmentation of a text line printed in Gurmukhi script newspaper. In: 2018 Fifth International conference on parallel, distributed and grid computing (PDGC), pp. 330–334

  28. Khare V, Shivakumara P, Navya BJ, Swetha GC, Guru DS, Pal U, Lu T. (2018). Weighted-gradient features for handwritten line segmentation. In: 2018 24th International conference on pattern recognition, pp. 3651–3656

  29. Mehta N, Doshi J. (2021). Text line segmentation for medieval Devnagari manuscript. In Proceedings of International conference on communication and computational technologies, pp. 405–412

  30. Kaur RP, Jindal MK, Kumar M (2021). Newspaper text recognition printed in Gurumukhi script: SVM versus MLP. In: Proceedings of data driven approach towards disruptive technologies, MIDAS, pp. 23–37

  31. Kaur RP, Jindal MK, Kumar M (2021) TxtLineSeg: text line segmentation of unconstrained printed text in Devanagari script. In: Singh V, Asari VK, Kumar S, Patel RB (eds) Computational methods and data engineering. Springer, Singapore, pp 85–100

    Chapter  Google Scholar 

  32. Rani NS, Pruthvi TR, Rao AG, Bipin NB (2021). Automated text line segmentation and table detection for pre-printed document image analysis systems. In: Proceedings of 3rd international conference on signal processing and communication, pp. 723–730

  33. Kaur R, Jindal MK (2016) Problems in Making OCR of Gurumukhi script newspapers. Int J Adv Res Comput Sci 7 (6):16–22

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Munish Kumar.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaur, R.P., Jindal, M.K., Kumar, M. et al. LineSeg: line segmentation of scanned newspaper documents. Pattern Anal Applic 25, 189–208 (2022). https://doi.org/10.1007/s10044-021-01031-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-021-01031-6

Keywords

Navigation