Text Segmentation for Document Recognition

Nobile, Nicola; Suen, Ching Y.

doi:10.1007/978-0-85729-859-1_8

Nicola Nobile³ &
Ching Y. Suen⁴

4093 Accesses
2 Citations

Abstract

Document segmentation is the process of dividing a document (handwritten or printed) into its base components (lines, words, characters). Once the zones (text and non-text) have been identified, the segmentation of the text elements can begin. Several challenges exist which need to be worked out in order to segment the elements correctly. For line segmentation, touching, broken, or overlapping text lines frequently occur. Handwritten documents have the additional challenge of curvilinear lines. Once a line has been segmented, it is processed to further segment it into characters. Similar problems of touching and broken elements exist for characters.

An added level of complexity exists since documents have a degree of noise which can come from scanning, photocopying, or from physical damage. Historical documents have some amount of degradation to them. In addition, variation of typefaces, for printed text, and styles for handwritten text bring new difficulties for segmentation and recognition algorithms.

This chapter contains descriptions of some methodologies, presented from recent research, that propose solutions that overcome these obstacles. Line segmentation solutions include horizontal projection, region growth techniques, probability density, and the level set method as possible, albeit partial, solutions. A method of angle stepping to detect angles for slanted lines is presented. Locating the boundaries of characters in historical, degraded ancient documents employs multi-level classifiers, and a level set active contour scheme as a possible solution. Mathematical expressions are generally more complex since the layout does not follow standard and typical text blocks. Lines can be composed of split sections (numerator and denominator), can have symbols spanning and overlapping other elements, and contain a higher concentration of superscript and subscript characters than regular text lines. Template matching is described as a partial solution to segment these characters.

The methods described here apply to both printed and handwritten. They have been tested on Latin-based scripts as well as Arabic, Dari, Farsi, Pashto, and Urdu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 549.99; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-Independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329
Article Google Scholar
Brodić D (2010) Optimization of the anisotropic Gaussian kernel for text segmentation and parameter extraction. In: Theoretical computer science. Springer, Brisbane, pp 140–152
Chapter Google Scholar
Sumengen B (2004) Variational image segmentation and curve evolution on natural images. Ph. D. Thesis, University of California, Santa Barbara
Google Scholar
Li Y, Zheng Y, Doermann D, Jaeger S (2006) A new algorithm for detecting text line in handwritten documents. In: Tenth international workshop on frontiers in handwriting recognition, La Baule, pp 35–40
Google Scholar
Suen C, Nikfal S, Li Y, Zhang Y, Nobile N (2010) Evaluation of typeface legibility. In: ATypI, Dublin, Sept 2010
Google Scholar
Li Y, Naoi S, Cheriet M, Suen C (2004) A segmentation method for touching Italic characters. In: International conference on pattern recognition (ICPR), Cambridge, pp 594–597, Aug 2004
Google Scholar
Lu Y (1995) Machine printed character segmentation – an overview. Pattern Recognit 28: 67–80
Article Google Scholar
Moghaddam R, Rivest-Hénault D, Cheriet M (2009) Restoration and segmentation of highly degraded characters using a shape-independent level set approach and multi-level classifiers. In: International conference on document analysis and recognition (ICDAR), Barcelona, pp 828–832, July 2009
Google Scholar
Moghaddam R, Cheriet M (2009) RSLDI: restoration of single-sided low-quality document images. Pattern Recognit 42(12):3355–3364
Article Google Scholar
Nomura A, Michishita K, Uchida S, Suzuki M (2003) Detection and segmentation of touching characters in mathematical expressions. In: Seventh international conference on document analysis and recognition – ICDAR2003, Edinburgh, pp 126–130
Google Scholar
Ball G, Srihari S, Srinivasan H (2006) Segmentation-Based and segmentation-free approaches to Arabic word spotting. In: Proceedings of the international workshop on frontiers in handwriting recognition (IWFHR-10), La Baule, pp 53–58, Oct 2006
Google Scholar
Liu C, Suen C (2008) A new benchmark on the recognition of handwritten Bangla and Farsi numeral characters. In: Proceedings of eleventh international conference on frontiers in handwriting recognition (ICFHR 2008), Montreal, pp 278–283
Google Scholar
Liu C, Nakashima K, Sako H, Fujisawa H (2004) Handwritten Digit Recognition: Investigation of Normalization and Feature Extraction Techniques. Pattern Recognition 37(2):265–279
Article Google Scholar
McLachlan G (1992) Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience, New York
Book Google Scholar
Shi M, Fujisawa Y, Wakabayashi T, Kimura F (2002) Handwritten Numeral Recognition Using Gradient and Curvature of Gray Scale Image. Pattern Recognition 35(10):2051–2059
Article Google Scholar
Li Y, Zheng Y, Doermann D (2006) Detecting Text Lines in Handwritten Documents. In: International Conference on Pattern Recognition, Hong Kong, vol 2, pp 1030–1033
Google Scholar
Likforman-Sulem L, Vinciarelli A (2008) HMM-based Offline Recognition of Handwritten Words Crossed Out with Different Kinds of Strokes. In: Eleventh International Conference on Frontiers in Handwriting Recognition, Montreal, pp 70–75
Google Scholar
Zheng D, Sun J, Naoi S, Hotta Y, Minagawa A, Suwa M, Fujimoto K (2008) Handwritten Email address recognition with syntax and lexicons. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 119–124
Google Scholar
Kessentini Y, Paquet T, Benhamadou A (2008) A multi-stream HMM-based approach for off-line multi-script handwritten word recognition. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 147–152
Google Scholar
Fei Y, Liu C-L (2008) Handwritten text line segmentation by clustering with distance metric learning. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 229–234
Google Scholar
Roy P, Pal U, LLados J (2008) Morphology based handwritten line segmentation using foreground and background information. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 241–246
Google Scholar
Du X, Pan W, Bui T (2008) Text line segmentation in handwritten documents using Mumford-Shah model. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 253–258
Google Scholar
Liu C-L, Suen C (2008) A new benchmark on the recognition of handwritten Bangla and Farsi numeral characters. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 278–283
Google Scholar
Mori S, Nishida H, Yamada H (1999) Optical character recognition. Wiley-Interscience, New York
Google Scholar
Chaudhuri B (2007) Digital document processing: major directions and recent advances. Springer, London
Book Google Scholar
Bunke H, Wang P (1997) Handbook of character recognition and document image analysis. World Scientific, Singapore
Book Google Scholar
Garain U, Paquet T, Heutte L (2006) On foreground – background separation in low quality document images. Int J Doc Anal Recognit 8(1):47–63
Article Google Scholar
Morita M, Sabourin R, Bortolozzi F, Suen C (2004) Segmentation and recognition of handwritten dates: an HMM-MLP hybrid approach. Int J Doc Anal Recognit 6(4):248–262
Article Google Scholar
Hase H, Yoneda M, Tokai S, Kato J, Suen C (2004) Color segmentation for text extraction. Int J Doc Anal Recognit 6(4):271–284
Article Google Scholar
Sarhan A (2009) Arabic character recognition using a combination of k-means and k-NN algorithms. Int J Comput Process Lang 22(4):305–320
Article Google Scholar
Karthik S, Hemanth V, Balaji V, Soman K (2012) Level set methodology for Tamil document image binarization and segmentation. Int J Comput Appl 39(9):7–12
Google Scholar
Ouwayed N, Belaïd A (2008) Multi-Oriented text line extraction from handwritten Arabic documents. In: Eighth IAPR international workshop on document analysis systems, Nara, pp 339–346
Google Scholar
Pan P, Zhu Y, Sun J, Naoi S (2011) Recognizing characters with severe perspective distortion using hash tables and perspective invariants. In: International conference on document analysis and recognition, Beijing, pp 548–552
Google Scholar
Silva G, Lins R (2011) An automatic method for enhancing character recognition in degraded historical documents. In: International conference on document analysis and recognition, Beijing, pp 553–557
Google Scholar
Saabni R, El-Sana J (2011) Language-Independent text lines extraction using seam carving. In: International conference on document analysis and recognition, Beijing, pp 563–568
Google Scholar
Kang L, Doermann D (2011) Template based segmentation of touching components in handwritten text lines. In: International conference on document analysis and recognition, Beijing, pp 569–573
Google Scholar
Bukhari S, Shafait F, Breuel T (2011) Text-Line extraction using a convolution of isotropic Gaussian filter with a set of line filters. In: International conference on document analysis and recognition, Beijing, pp 579–583
Google Scholar
Marinai S, Fujisawa H (eds) (2010) Machine learning in document analysis and recognition, 1st edn. Studies in computational intelligence, vol 90. Springer, Berlin
Google Scholar
Cheriet M, Kharma N, Liu C-L, Suen C (2007) Character Recognition systems: a guide for students and practitioners. Wiley, Hoboken
Book Google Scholar

Author information

Authors and Affiliations

Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia University, Montréal, QC, Canada
Nicola Nobile
Department of Computer Science and Software Engineering, Centre for Pattern Recognition and Machine Intelligence (CENPARMI), Concordia University, Montréal, QC, Canada
Ching Y. Suen

Authors

Nicola Nobile
View author publications
You can also search for this author in PubMed Google Scholar
Ching Y. Suen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Nobile .

Editor information

Editors and Affiliations

University of Maryland, College Park, MD, USA
David Doermann
Université de Lorraine, Nancy, France
Karl Tombre

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Nobile, N., Suen, C.Y. (2014). Text Segmentation for Document Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_8

Download citation

DOI: https://doi.org/10.1007/978-0-85729-859-1_8
Published: 24 July 2019
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Text Segmentation for Document Recognition

Abstract

Access this chapter

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Publish with us

Navigation

Text Segmentation for Document Recognition

Abstract

Access this chapter

References

Further Reading

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation