Abstract
This paper presents the text line and word segmentation from unconstrained handwritten documents based on horizontal projection histogram (HPH) to detect mid-points and gap trailing between lines. The midpoints are estimated from the HPH for the first 100 to 200 columns of the whole document. Then, considering the mid-points, the gap is tracked between two consecutive lines from locally computed HPH for a block having k rows and j columns. The HPH block is examined for various cases to locate optimal rows that separate adjacent lines. The proposed method segments curve, touching and skew-lines and is robust to writing variation and language independent. Word segmentation is not treated as a separate problem and goes efficiently alongside the line segmentation. As the trailing of space between neighboring lines goes on, the vertical projection Histogram (VPH) of t columns is monitored between the above and below separator of a line and find the optimal word separator. The algorithm is evaluated on two isolated datasets of different languages (Meitei Mayek and English). Text-line and word segmentation on Meitei Mayek handwritten documents achieve 91.84% and 88.96% accuracy respectively. Similarly, the handwritten English document meets 94.18% and 87.73% accuracy for line and word segmentation.
Similar content being viewed by others
References
Abuhaiba ISI, Datta S, Holt MJJ (1995) Line extraction and stroke ordering of text pages. In: Proceedings of 3rd international conference on document analysis and recognition, vol 1. IEEE
Arivazhagan M, Srinivasan H, Srihari S (2007) A statistical approach to line segmentation in handwritten documents. In: Document recognition and retrieval XIV, vol 6500. International Society for Optics and Photonics
Basu S, et al. (2007) Text line extraction from multi-skewed handwritten documents. Pattern Recognit 40(6):1825–1839
dos Santos RP, et al. (2009) Text line segmentation based on morphology and histogram projection. In: 2009 International 10th conference on document analysis and recognition. IEEE
Ghosh S, et al. (2013) An OCR system for the Meetei Mayek script. In: 2013 Fourth national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG). IEEE
He J, Downton AC (2003) User-assisted archive document image analysis for digital library construction. In: Seventh international conference on document analysis and recognition, 2003. Proceedings. IEEE
Inunganbi S, Choudhary P (2018) Recognition of handwritten Meitei Mayek script based on texture feature. Int J Nat Lang Comput (IJNLC) 7(5):99–108
Jindal P, Jindal B (2015) Line and word segmentation of handwritten text documents written in Gurmukhi Script using mid point detection technique. In: 2015 2nd international conference on recent advances in engineering & computational sciences (RAECS). IEEE
Kahan S, Pavlidis T, Baird HS (1987) On the recognition of printed characters of any font and size. IEEE Trans Pattern Anal Mach Intell 2:274–288
Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area Voronoi diagram. Comput Vis Image Underst 70(3):370–382
Laishram R, et al. (2014) A neural network based handwritten Meitei Mayek alphabet optical character recognition system. In: 2014 IEEE international conference on computational intelligence and computing research. IEEE
Louloudis G, et al. (2009) Text line and word segmentation of handwritten documents. Pattern Recognit 42(12):3169–3183
Li Y, et al. (2006) A new algorithm for detecting text line in handwritten documents. In: Tenth international workshop on frontiers in handwriting recognition. Suvisoft
Li Y, Zheng Y, Doermann D (2006) Detecting text lines in handwritten documents. In: 18th international conference on pattern recognition (ICPR’06), vol 2. IEEE
Li Y, et al. (2008) Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329
Likforman-Sulem L, Hanimyan A, Faure C (1995) A Hough based algorithm for extracting text lines in handwritten documents. In: Proceedings of 3rd international conference on document analysis and recognition, vol 2. IEEE
Louloudis G, et al. (2006) A block-based Hough transform mapping for text line detection in handwritten documents. In: Tenth international workshop on frontiers in handwriting recognition. Suvisoft
Malik SA, et al. (2019) An efficient segmentation technique for urdu optical character recognizer (ocr). In: Future of information and communication conference. Springer, Cham
Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the 5th international conference on document analysis and recognition, pp 705–708
Marti Us-V, Bunke H (2001) Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Proceedings of sixth international conference on document analysis and recognition. IEEE
Marti U, Bunke H (2002) The IAM-database: an English sentence database for off-line handwriting recognition. Int J Doc Anal Recognit 5:39–46
Nagy G, Seth S, Viswanathan M (1992) A prototype document image analysis system for technical journals. Computer 25(7):10–22
Nguyen KC, Nakagawa M (2016) Text-line and character segmentation for offline recognition of handwritten japanese text. IEICE Techn Rep 115(517):53–58
Nicolas S, Paquet T, Heutte L (2004) Text line segmentation in handwritten document using a production system. In: Ninth international workshop on frontiers in handwriting recognition. IEEE
O’Gorman L (1993) The document spectrum for page layout analysis. IEEE Trans Pattern Anal Mach Intell 15(11):1162–1173
Pal U, Datta S (2003) Segmentation of Bangla unconstrained handwritten text. null IEEE
Pu Y, Shi Z (1998) A natural learning algorithm based on hough transform for text lines extraction in handwritten document: 637–646
Saha S, et al. (2010) A Hough transform based technique for text segmentation. arXiv:1002.4048
Simon A, Pret J-C, Johnson AP (1997) A fast algorithm for bottom-up document layout analysis. IEEE Trans Pattern Anal Mach Intell 19(3):273–277
Su T-H, et al. (2007) Skew detection for Chinese handwriting by horizontal stroke histogram. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 2. IEEE
Weliwitage C, Harvey AL, Jennings AB (2005) Handwritten document offline text line segmentation. In: Digital image computing: techniques and applications (DICTA’05). IEEE
Yin F, Liu C-L (2009) Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognit 42(12):3146–3157
Zahour A, et al. (2001) Arabic hand-written text-line extraction. In: Proceedings of sixth international conference on document analysis and recognition. IEEE
Zahour A, et al. (2007) Text line segmentation of historical arabic documents. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 1. IEEE
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sanasam, I., Choudhary, P. & Singh, K.M. Line and word segmentation of handwritten text document by mid-point detection and gap trailing. Multimed Tools Appl 79, 30135–30150 (2020). https://doi.org/10.1007/s11042-020-09416-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09416-1