Skip to main content

Text Segmentation for Document Recognition

  • Reference work entry
  • First Online:
Book cover Handbook of Document Image Processing and Recognition

Abstract

Document segmentation is the process of dividing a document (handwritten or printed) into its base components (lines, words, characters). Once the zones (text and non-text) have been identified, the segmentation of the text elements can begin. Several challenges exist which need to be worked out in order to segment the elements correctly. For line segmentation, touching, broken, or overlapping text lines frequently occur. Handwritten documents have the additional challenge of curvilinear lines. Once a line has been segmented, it is processed to further segment it into characters. Similar problems of touching and broken elements exist for characters.

An added level of complexity exists since documents have a degree of noise which can come from scanning, photocopying, or from physical damage. Historical documents have some amount of degradation to them. In addition, variation of typefaces, for printed text, and styles for handwritten text bring new difficulties for segmentation and recognition algorithms.

This chapter contains descriptions of some methodologies, presented from recent research, that propose solutions that overcome these obstacles. Line segmentation solutions include horizontal projection, region growth techniques, probability density, and the level set method as possible, albeit partial, solutions. A method of angle stepping to detect angles for slanted lines is presented. Locating the boundaries of characters in historical, degraded ancient documents employs multi-level classifiers, and a level set active contour scheme as a possible solution. Mathematical expressions are generally more complex since the layout does not follow standard and typical text blocks. Lines can be composed of split sections (numerator and denominator), can have symbols spanning and overlapping other elements, and contain a higher concentration of superscript and subscript characters than regular text lines. Template matching is described as a partial solution to segment these characters.

The methods described here apply to both printed and handwritten. They have been tested on Latin-based scripts as well as Arabic, Dari, Farsi, Pashto, and Urdu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 549.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-Independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329

    Article  Google Scholar 

  2. Brodić D (2010) Optimization of the anisotropic Gaussian kernel for text segmentation and parameter extraction. In: Theoretical computer science. Springer, Brisbane, pp 140–152

    Chapter  Google Scholar 

  3. Sumengen B (2004) Variational image segmentation and curve evolution on natural images. Ph. D. Thesis, University of California, Santa Barbara

    Google Scholar 

  4. Li Y, Zheng Y, Doermann D, Jaeger S (2006) A new algorithm for detecting text line in handwritten documents. In: Tenth international workshop on frontiers in handwriting recognition, La Baule, pp 35–40

    Google Scholar 

  5. Suen C, Nikfal S, Li Y, Zhang Y, Nobile N (2010) Evaluation of typeface legibility. In: ATypI, Dublin, Sept 2010

    Google Scholar 

  6. Li Y, Naoi S, Cheriet M, Suen C (2004) A segmentation method for touching Italic characters. In: International conference on pattern recognition (ICPR), Cambridge, pp 594–597, Aug 2004

    Google Scholar 

  7. Lu Y (1995) Machine printed character segmentation – an overview. Pattern Recognit 28: 67–80

    Article  Google Scholar 

  8. Moghaddam R, Rivest-Hénault D, Cheriet M (2009) Restoration and segmentation of highly degraded characters using a shape-independent level set approach and multi-level classifiers. In: International conference on document analysis and recognition (ICDAR), Barcelona, pp 828–832, July 2009

    Google Scholar 

  9. Moghaddam R, Cheriet M (2009) RSLDI: restoration of single-sided low-quality document images. Pattern Recognit 42(12):3355–3364

    Article  Google Scholar 

  10. Nomura A, Michishita K, Uchida S, Suzuki M (2003) Detection and segmentation of touching characters in mathematical expressions. In: Seventh international conference on document analysis and recognition – ICDAR2003, Edinburgh, pp 126–130

    Google Scholar 

  11. Ball G, Srihari S, Srinivasan H (2006) Segmentation-Based and segmentation-free approaches to Arabic word spotting. In: Proceedings of the international workshop on frontiers in handwriting recognition (IWFHR-10), La Baule, pp 53–58, Oct 2006

    Google Scholar 

  12. Liu C, Suen C (2008) A new benchmark on the recognition of handwritten Bangla and Farsi numeral characters. In: Proceedings of eleventh international conference on frontiers in handwriting recognition (ICFHR 2008), Montreal, pp 278–283

    Google Scholar 

  13. Liu C, Nakashima K, Sako H, Fujisawa H (2004) Handwritten Digit Recognition: Investigation of Normalization and Feature Extraction Techniques. Pattern Recognition 37(2):265–279

    Article  Google Scholar 

  14. McLachlan G (1992) Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience, New York

    Book  Google Scholar 

  15. Shi M, Fujisawa Y, Wakabayashi T, Kimura F (2002) Handwritten Numeral Recognition Using Gradient and Curvature of Gray Scale Image. Pattern Recognition 35(10):2051–2059

    Article  Google Scholar 

  16. Li Y, Zheng Y, Doermann D (2006) Detecting Text Lines in Handwritten Documents. In: International Conference on Pattern Recognition, Hong Kong, vol 2, pp 1030–1033

    Google Scholar 

  17. Likforman-Sulem L, Vinciarelli A (2008) HMM-based Offline Recognition of Handwritten Words Crossed Out with Different Kinds of Strokes. In: Eleventh International Conference on Frontiers in Handwriting Recognition, Montreal, pp 70–75

    Google Scholar 

  18. Zheng D, Sun J, Naoi S, Hotta Y, Minagawa A, Suwa M, Fujimoto K (2008) Handwritten Email address recognition with syntax and lexicons. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 119–124

    Google Scholar 

  19. Kessentini Y, Paquet T, Benhamadou A (2008) A multi-stream HMM-based approach for off-line multi-script handwritten word recognition. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 147–152

    Google Scholar 

  20. Fei Y, Liu C-L (2008) Handwritten text line segmentation by clustering with distance metric learning. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 229–234

    Google Scholar 

  21. Roy P, Pal U, LLados J (2008) Morphology based handwritten line segmentation using foreground and background information. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 241–246

    Google Scholar 

  22. Du X, Pan W, Bui T (2008) Text line segmentation in handwritten documents using Mumford-Shah model. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 253–258

    Google Scholar 

  23. Liu C-L, Suen C (2008) A new benchmark on the recognition of handwritten Bangla and Farsi numeral characters. In: Eleventh international conference on frontiers in handwriting recognition, Montreal, pp 278–283

    Google Scholar 

  24. Mori S, Nishida H, Yamada H (1999) Optical character recognition. Wiley-Interscience, New York

    Google Scholar 

  25. Chaudhuri B (2007) Digital document processing: major directions and recent advances. Springer, London

    Book  Google Scholar 

  26. Bunke H, Wang P (1997) Handbook of character recognition and document image analysis. World Scientific, Singapore

    Book  Google Scholar 

  27. Garain U, Paquet T, Heutte L (2006) On foreground – background separation in low quality document images. Int J Doc Anal Recognit 8(1):47–63

    Article  Google Scholar 

  28. Morita M, Sabourin R, Bortolozzi F, Suen C (2004) Segmentation and recognition of handwritten dates: an HMM-MLP hybrid approach. Int J Doc Anal Recognit 6(4):248–262

    Article  Google Scholar 

  29. Hase H, Yoneda M, Tokai S, Kato J, Suen C (2004) Color segmentation for text extraction. Int J Doc Anal Recognit 6(4):271–284

    Article  Google Scholar 

  30. Sarhan A (2009) Arabic character recognition using a combination of k-means and k-NN algorithms. Int J Comput Process Lang 22(4):305–320

    Article  Google Scholar 

  31. Karthik S, Hemanth V, Balaji V, Soman K (2012) Level set methodology for Tamil document image binarization and segmentation. Int J Comput Appl 39(9):7–12

    Google Scholar 

  32. Ouwayed N, Belaïd A (2008) Multi-Oriented text line extraction from handwritten Arabic documents. In: Eighth IAPR international workshop on document analysis systems, Nara, pp 339–346

    Google Scholar 

  33. Pan P, Zhu Y, Sun J, Naoi S (2011) Recognizing characters with severe perspective distortion using hash tables and perspective invariants. In: International conference on document analysis and recognition, Beijing, pp 548–552

    Google Scholar 

  34. Silva G, Lins R (2011) An automatic method for enhancing character recognition in degraded historical documents. In: International conference on document analysis and recognition, Beijing, pp 553–557

    Google Scholar 

  35. Saabni R, El-Sana J (2011) Language-Independent text lines extraction using seam carving. In: International conference on document analysis and recognition, Beijing, pp 563–568

    Google Scholar 

  36. Kang L, Doermann D (2011) Template based segmentation of touching components in handwritten text lines. In: International conference on document analysis and recognition, Beijing, pp 569–573

    Google Scholar 

  37. Bukhari S, Shafait F, Breuel T (2011) Text-Line extraction using a convolution of isotropic Gaussian filter with a set of line filters. In: International conference on document analysis and recognition, Beijing, pp 579–583

    Google Scholar 

  38. Marinai S, Fujisawa H (eds) (2010) Machine learning in document analysis and recognition, 1st edn. Studies in computational intelligence, vol 90. Springer, Berlin

    Google Scholar 

  39. Cheriet M, Kharma N, Liu C-L, Suen C (2007) Character Recognition systems: a guide for students and practitioners. Wiley, Hoboken

    Book  Google Scholar 

Further Reading

  • Bunke H, Wang P (1997) Handbook of character recognition and document image analysis. World Scientific, Singapore

    Book  Google Scholar 

  • Chaudhuri B (2007) Digital document processing: major directions and recent advances. Springer, London

    Book  Google Scholar 

  • Cheriet M, Kharma N, Liu C-L, Suen C (2007) Character recognition systems: a guide for students and practitioners. Wiley, Hoboken

    Book  Google Scholar 

  • Li H, Doermann D, Zheng Y (2008) Handwritten document image processing: identification, matching, and indexing of handwriting in noisy document images. VDM, Saarbrücken

    Google Scholar 

  • Marinai S, Fujisawa H (eds) (2010) Machine learning in document analysis and recognition, 1st edn. Studies in computational intelligence, vol 90. Springer, Berlin

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Nobile .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Nobile, N., Suen, C.Y. (2014). Text Segmentation for Document Recognition. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_8

Download citation

Publish with us

Policies and ethics