Skip to main content
Log in

Overlapping and multi-touching text-line segmentation by Block Covering analysis

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

This paper presents a new approach for text-line segmentation based on Block Covering which solves the problem of overlapping and multi-touching components. Block Covering is the core of a system which processes a set of ancient Arabic documents from historical archives. The system is designed for separating text-lines even if they are overlapping and multi-touching. We exploit the Block Covering technique in three steps: a new fractal analysis (Block Counting) for document classification, a statistical analysis of block heights for block classification and a neighboring analysis for building text-lines. The Block Counting fractal analysis, associated with a fuzzy C-means scheme, is performed on document images in order to classify them according to their complexity: tightly (closely) spaced documents (TSD) or widely spaced documents (WSD). An optimal Block Covering is applied on TSD documents which include overlapping and multi-touching lines. The large blocks generated by the covering are then segmented by relying on the statistical analysis of block heights. The final labeling into text-lines is based on a block neighboring analysis. Experimental results provided on images of the Tunisian Historical Archives reveal the feasibility of the Block Covering technique for segmenting ancient Arabic documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. http://www.bibliotheque.nat.tn; http://www.archives.nat.tn

  2. Kolcz A, Alspector J, Augusteyn M, Carlson R, Viorel Popescu G (2000) A line-oriented approach to word spotting in handwritten documents. Pattern Anal Appl 3:155–168

    Article  Google Scholar 

  3. Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Anal Appl 7:190–204

    MathSciNet  Google Scholar 

  4. Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138

    Google Scholar 

  5. Abuhaiba ISI, Datta S, Holt MJJ (2005) Line extraction and stroke ordering of text pages. In: Proceedings of ICDAR’05, Seoul (South Korea), pp 390–393

  6. Oztop E, Mulayim AY, Atalay V, Yarman-Vural F (1999) Repulsive attractive network for baseline extraction on document images. Signal Process 75:1–10

    Article  Google Scholar 

  7. Li Y, Zheng Y, Doermann D (2006) Detecting text lines in handwritten documents. In: Proceedings of ICPR’06, Hong Kong, pp 1030–1033

  8. Khorsheed MS (2002) Off-Line Arabic character recognition—a review. Pattern Anal Appl 5:31–45

    Article  MathSciNet  Google Scholar 

  9. Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition—a survey. IEEE PAMI 28(5):712–724

    Google Scholar 

  10. Arivazhagan M, Srinivasan H, Srihari S (2007) A statistical approach to line segmentation in handwritten documents. In: Proceedings of Document Recognition and Retrieval XIV, IST&SPIE, San Jose

  11. Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Proceedings of ICDAR’01, 10–13 Sept., Seattle, USA, pp 281–285

  12. Amin A, Fischer S (2000) A document skew detection method using the Hough transform. Pattern Anal Appl 3:243–253

    Article  MATH  Google Scholar 

  13. Boussellaa W, Zahour A, El Abed H (2006) A concept for the separation of foreground/background in Arabic historical manuscripts using hybrid methods. In: Ioannides M, Arnold D, Niccolucci F, Mania K (eds) Proceedings of the 7th internat. symp. on virtual reality, archaeology and cultural heritage VAST, pp 1–5

  14. Dodson M, Kristensen S (2004) Hausdorff dimension and diophantine approximation. Fractal geometry and applications: a jubilee of Benoit Mandelbrot. Part 1. Proceedings of Sympos. Pure Math., vol 72, Part 1, Amer. Math. Soc., Providence, pp 305–347

  15. Boulétreau V, Vincent N, Emptoz H, Sabourin R (2000) How to use fractal dimension to qualify writings and writers. Fractals Complex Geometry Patterns Scaling Nat Soc 8(1):85–98

    Article  Google Scholar 

  16. Vincent N, Emptoz H (1995) A classification of writing based on fractals. In: Novak MM (ed) Fractal reviews in the natural and applied sciences. Chapman & Hall, London, pp 320–331

    Google Scholar 

  17. Ben Moussa S, Zahour A, Alimi MA, Benabdelhafid A (2005) Can fractal dimension be used in font classification. In: Proceedings of ICDAR 2005, Seoul (South Korea)

  18. Hausdorff F (1919) Dimension und äußeres Maß. Math Ann 79:157

    Article  MathSciNet  Google Scholar 

  19. Wu S, Chow TWS (2005) Clustering of the self-organizing map using a clustering validity index based on inter and intra-cluster density. Pattern Recognit 37(2):175–188

    Article  Google Scholar 

  20. Falconer K (1997) Techniques in fractal geometry. Willey, New York, ISBN 0–471-92287-0

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurence Likforman-Sulem.

Appendix 1. Composing density between and with clusters (CDbw) criterion

Appendix 1. Composing density between and with clusters (CDbw) criterion

CDbw is defined as the product:

$$ {CDbw(vsn)} = {intra}\_{{den}}({vsn})*{sep}(m{vsn})$$

intra_den expresses the quality of intra-class clustering. sep is the cluster separation measure.

We calculate now the terms of the product:

For a given number of vertical strips vsn = 1/r, let \( V_{i} \; = \;\left\{ {v_{{i1}} ,v_{{i2}} , \cdots ,v_{{in_{i} }} } \right\} \) be the set of blocks of the ith class, and n i be the number of blocks in this class. The standard deviation stddev (i) of the ith class is defined as:

$$ {stddev}(i) = \sqrt {\sum\limits_{{k = 1}}^{{n_{i} }} {\frac{{(h_{{ik}} - m_{i} )^{2} }}{{(n_{i} - 1)}}} } $$

with h ik being the height of the kth block of the ith class and m i being the average height of the blocks of the ith class.

The average stddev is:

$$ {stddev} = \sqrt {\sum\limits_{{i = 1}}^{3} {\frac{{\left\| {{stddev}(i)} \right\|^{2} }}{3}} } ; $$

The quality of intra-class clustering, denoted by intra_den is defined as:

$$ intra\_den(vsn) = \frac{1}{3}\sum\limits_{{i = 1}}^{3} {\sum\limits_{{j = 1}}^{{n_{i} }} {density(v_{{ij}} )} }; $$

with density(v ij ) defined as:

$$ density(v_{{ij}} ) = \sum\limits_{{l = 1}}^{{n_{i} }} {f(v_{{il}} ,v_{{ij}} )}; $$

and f (v il ,v ij ) defined as:

$$f(v_{{il}} ,v_{{ij}} ) = \left\{\begin{array}{*{20}l}1 &\quad {\text{if}}\;||h_{{il}} - h_{{ij}} ||\; \le stddev \\0 &\quad{\text{otherwise}}\\ \end{array}\right.$$

The interclass density Inter_den is defined as the number of blocks being in the close neighborhood of several classes. This density should be very low. It is defined as:

$$ Inter\_den(vsn) = \sum\limits_{{i = 1}}^{3} {\sum\limits_{\begin{subarray}{l} j = 1 \\ j \neq i \end{subarray} }^{3} {\frac{{\left\| {m_{i} - m_{j} } \right\|}}{{\| {{stddev}(i) + {stddev}(j)}\|}}} } \times density(u_{{ij}} ); $$

u ij is a virtual block of height h ij  = (m i +m j )/2

density (u ij ) is defined as:

$$ density(u_{{ij}} ) = \sum\limits_{{k = 1}}^{{n_{i} + n_{j} }} {f(v_{k} ,u_{{ij}} )} $$

with v k belonging to the union set of blocks of classes i and j.

f (v k ,u ij ) is defined as:

$$ f(v_{k} ,u_{{ij}} ) = \left\{ \begin{array}{*{20}l} 1&{\rm if}\ \| {h_{k} - h_{{ij}} } \| \le (\|{{stddev}(i)}\| + \| {{stddev}(j)} \|)/2, \\0& {\rm otherwise} \\ \end{array} \right. $$

The cluster separation measure is defined as:

$$ sep(vsn) = \sum\limits_{{i = 1}}^{3} {\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne i \end{subarray} }^{3} {\frac{{\left\| {m_{i} - m_{j} } \right\|}}{{1 + Inter\_den}}} } $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zahour, A., Taconet, B., Likforman-Sulem, L. et al. Overlapping and multi-touching text-line segmentation by Block Covering analysis. Pattern Anal Applic 12, 335–351 (2009). https://doi.org/10.1007/s10044-008-0127-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-008-0127-9

Keywords

Navigation