Abstract
This paper presents a new approach for text-line segmentation based on Block Covering which solves the problem of overlapping and multi-touching components. Block Covering is the core of a system which processes a set of ancient Arabic documents from historical archives. The system is designed for separating text-lines even if they are overlapping and multi-touching. We exploit the Block Covering technique in three steps: a new fractal analysis (Block Counting) for document classification, a statistical analysis of block heights for block classification and a neighboring analysis for building text-lines. The Block Counting fractal analysis, associated with a fuzzy C-means scheme, is performed on document images in order to classify them according to their complexity: tightly (closely) spaced documents (TSD) or widely spaced documents (WSD). An optimal Block Covering is applied on TSD documents which include overlapping and multi-touching lines. The large blocks generated by the covering are then segmented by relying on the statistical analysis of block heights. The final labeling into text-lines is based on a block neighboring analysis. Experimental results provided on images of the Tunisian Historical Archives reveal the feasibility of the Block Covering technique for segmenting ancient Arabic documents.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig11_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig12_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig13_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig14_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig15_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig16_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig17_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig18_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig19_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig20_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10044-008-0127-9/MediaObjects/10044_2008_127_Fig21_HTML.gif)
Similar content being viewed by others
References
Kolcz A, Alspector J, Augusteyn M, Carlson R, Viorel Popescu G (2000) A line-oriented approach to word spotting in handwritten documents. Pattern Anal Appl 3:155–168
Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Anal Appl 7:190–204
Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138
Abuhaiba ISI, Datta S, Holt MJJ (2005) Line extraction and stroke ordering of text pages. In: Proceedings of ICDAR’05, Seoul (South Korea), pp 390–393
Oztop E, Mulayim AY, Atalay V, Yarman-Vural F (1999) Repulsive attractive network for baseline extraction on document images. Signal Process 75:1–10
Li Y, Zheng Y, Doermann D (2006) Detecting text lines in handwritten documents. In: Proceedings of ICPR’06, Hong Kong, pp 1030–1033
Khorsheed MS (2002) Off-Line Arabic character recognition—a review. Pattern Anal Appl 5:31–45
Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition—a survey. IEEE PAMI 28(5):712–724
Arivazhagan M, Srinivasan H, Srihari S (2007) A statistical approach to line segmentation in handwritten documents. In: Proceedings of Document Recognition and Retrieval XIV, IST&SPIE, San Jose
Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Proceedings of ICDAR’01, 10–13 Sept., Seattle, USA, pp 281–285
Amin A, Fischer S (2000) A document skew detection method using the Hough transform. Pattern Anal Appl 3:243–253
Boussellaa W, Zahour A, El Abed H (2006) A concept for the separation of foreground/background in Arabic historical manuscripts using hybrid methods. In: Ioannides M, Arnold D, Niccolucci F, Mania K (eds) Proceedings of the 7th internat. symp. on virtual reality, archaeology and cultural heritage VAST, pp 1–5
Dodson M, Kristensen S (2004) Hausdorff dimension and diophantine approximation. Fractal geometry and applications: a jubilee of Benoit Mandelbrot. Part 1. Proceedings of Sympos. Pure Math., vol 72, Part 1, Amer. Math. Soc., Providence, pp 305–347
Boulétreau V, Vincent N, Emptoz H, Sabourin R (2000) How to use fractal dimension to qualify writings and writers. Fractals Complex Geometry Patterns Scaling Nat Soc 8(1):85–98
Vincent N, Emptoz H (1995) A classification of writing based on fractals. In: Novak MM (ed) Fractal reviews in the natural and applied sciences. Chapman & Hall, London, pp 320–331
Ben Moussa S, Zahour A, Alimi MA, Benabdelhafid A (2005) Can fractal dimension be used in font classification. In: Proceedings of ICDAR 2005, Seoul (South Korea)
Hausdorff F (1919) Dimension und äußeres Maß. Math Ann 79:157
Wu S, Chow TWS (2005) Clustering of the self-organizing map using a clustering validity index based on inter and intra-cluster density. Pattern Recognit 37(2):175–188
Falconer K (1997) Techniques in fractal geometry. Willey, New York, ISBN 0–471-92287-0
Author information
Authors and Affiliations
Corresponding author
Appendix 1. Composing density between and with clusters (CDbw) criterion
Appendix 1. Composing density between and with clusters (CDbw) criterion
CDbw is defined as the product:
intra_den expresses the quality of intra-class clustering. sep is the cluster separation measure.
We calculate now the terms of the product:
For a given number of vertical strips vsn = 1/r, let \( V_{i} \; = \;\left\{ {v_{{i1}} ,v_{{i2}} , \cdots ,v_{{in_{i} }} } \right\} \) be the set of blocks of the ith class, and n i be the number of blocks in this class. The standard deviation stddev (i) of the ith class is defined as:
with h ik being the height of the kth block of the ith class and m i being the average height of the blocks of the ith class.
The average stddev is:
The quality of intra-class clustering, denoted by intra_den is defined as:
with density(v ij ) defined as:
and f (v il ,v ij ) defined as:
The interclass density Inter_den is defined as the number of blocks being in the close neighborhood of several classes. This density should be very low. It is defined as:
u ij is a virtual block of height h ij = (m i +m j )/2
density (u ij ) is defined as:
with v k belonging to the union set of blocks of classes i and j.
f (v k ,u ij ) is defined as:
The cluster separation measure is defined as:
Rights and permissions
About this article
Cite this article
Zahour, A., Taconet, B., Likforman-Sulem, L. et al. Overlapping and multi-touching text-line segmentation by Block Covering analysis. Pattern Anal Applic 12, 335–351 (2009). https://doi.org/10.1007/s10044-008-0127-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-008-0127-9