Overlapping and multi-touching text-line segmentation by Block Covering analysis

Zahour, Abderrazak; Taconet, Brunco; Likforman-Sulem, Laurence; Boussellaa, Wafa

doi:10.1007/s10044-008-0127-9

Overlapping and multi-touching text-line segmentation by Block Covering analysis

Theoretical Advances
Published: 09 July 2008

Volume 12, pages 335–351, (2009)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Abderrazak Zahour¹,
Brunco Taconet¹,
Laurence Likforman-Sulem² &
…
Wafa Boussellaa³

294 Accesses
Explore all metrics

Abstract

This paper presents a new approach for text-line segmentation based on Block Covering which solves the problem of overlapping and multi-touching components. Block Covering is the core of a system which processes a set of ancient Arabic documents from historical archives. The system is designed for separating text-lines even if they are overlapping and multi-touching. We exploit the Block Covering technique in three steps: a new fractal analysis (Block Counting) for document classification, a statistical analysis of block heights for block classification and a neighboring analysis for building text-lines. The Block Counting fractal analysis, associated with a fuzzy C-means scheme, is performed on document images in order to classify them according to their complexity: tightly (closely) spaced documents (TSD) or widely spaced documents (WSD). An optimal Block Covering is applied on TSD documents which include overlapping and multi-touching lines. The large blocks generated by the covering are then segmented by relying on the statistical analysis of block heights. The final labeling into text-lines is based on a block neighboring analysis. Experimental results provided on images of the Tunisian Historical Archives reveal the feasibility of the Block Covering technique for segmenting ancient Arabic documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arabic document layout analysis

Article 08 February 2017

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Article 11 September 2020

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

http://www.bibliotheque.nat.tn; http://www.archives.nat.tn
Kolcz A, Alspector J, Augusteyn M, Carlson R, Viorel Popescu G (2000) A line-oriented approach to word spotting in handwritten documents. Pattern Anal Appl 3:155–168
Article Google Scholar
Lakshmi CV, Patvardhan C (2004) An optical character recognition system for printed Telugu text. Pattern Anal Appl 7:190–204
MathSciNet Google Scholar
Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138
Google Scholar
Abuhaiba ISI, Datta S, Holt MJJ (2005) Line extraction and stroke ordering of text pages. In: Proceedings of ICDAR’05, Seoul (South Korea), pp 390–393
Oztop E, Mulayim AY, Atalay V, Yarman-Vural F (1999) Repulsive attractive network for baseline extraction on document images. Signal Process 75:1–10
Article Google Scholar
Li Y, Zheng Y, Doermann D (2006) Detecting text lines in handwritten documents. In: Proceedings of ICPR’06, Hong Kong, pp 1030–1033
Khorsheed MS (2002) Off-Line Arabic character recognition—a review. Pattern Anal Appl 5:31–45
Article MathSciNet Google Scholar
Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition—a survey. IEEE PAMI 28(5):712–724
Google Scholar
Arivazhagan M, Srinivasan H, Srihari S (2007) A statistical approach to line segmentation in handwritten documents. In: Proceedings of Document Recognition and Retrieval XIV, IST&SPIE, San Jose
Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Proceedings of ICDAR’01, 10–13 Sept., Seattle, USA, pp 281–285
Amin A, Fischer S (2000) A document skew detection method using the Hough transform. Pattern Anal Appl 3:243–253
Article MATH Google Scholar
Boussellaa W, Zahour A, El Abed H (2006) A concept for the separation of foreground/background in Arabic historical manuscripts using hybrid methods. In: Ioannides M, Arnold D, Niccolucci F, Mania K (eds) Proceedings of the 7th internat. symp. on virtual reality, archaeology and cultural heritage VAST, pp 1–5
Dodson M, Kristensen S (2004) Hausdorff dimension and diophantine approximation. Fractal geometry and applications: a jubilee of Benoit Mandelbrot. Part 1. Proceedings of Sympos. Pure Math., vol 72, Part 1, Amer. Math. Soc., Providence, pp 305–347
Boulétreau V, Vincent N, Emptoz H, Sabourin R (2000) How to use fractal dimension to qualify writings and writers. Fractals Complex Geometry Patterns Scaling Nat Soc 8(1):85–98
Article Google Scholar
Vincent N, Emptoz H (1995) A classification of writing based on fractals. In: Novak MM (ed) Fractal reviews in the natural and applied sciences. Chapman & Hall, London, pp 320–331
Google Scholar
Ben Moussa S, Zahour A, Alimi MA, Benabdelhafid A (2005) Can fractal dimension be used in font classification. In: Proceedings of ICDAR 2005, Seoul (South Korea)
Hausdorff F (1919) Dimension und äußeres Maß. Math Ann 79:157
Article MathSciNet Google Scholar
Wu S, Chow TWS (2005) Clustering of the self-organizing map using a clustering validity index based on inter and intra-cluster density. Pattern Recognit 37(2):175–188
Article Google Scholar
Falconer K (1997) Techniques in fractal geometry. Willey, New York, ISBN 0–471-92287-0

Download references

Author information

Authors and Affiliations

IUT, Université du Havre/GED, Place Robert Schuman, 76610, Le Havre, France
Abderrazak Zahour & Brunco Taconet
TELECOM ParisTech/TSI and CNRS-LTCI, 46 rue Barrault, 75013, Paris, France
Laurence Likforman-Sulem
Université de Sfax, REGIM, ENIS Route Soukra, 3038, Sfax (BPW), Tunisia
Wafa Boussellaa

Authors

Abderrazak Zahour
View author publications
You can also search for this author in PubMed Google Scholar
Brunco Taconet
View author publications
You can also search for this author in PubMed Google Scholar
Laurence Likforman-Sulem
View author publications
You can also search for this author in PubMed Google Scholar
Wafa Boussellaa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurence Likforman-Sulem.

Appendix 1. Composing density between and with clusters (CDbw) criterion

CDbw is defined as the product:

$$ {CDbw(vsn)} = {intra}\_{{den}}({vsn})*{sep}(m{vsn})$$

intra_den expresses the quality of intra-class clustering. sep is the cluster separation measure.

We calculate now the terms of the product:

For a given number of vertical strips vsn = 1/r, let $ V_{i} \; = \;\left\{ {v_{{i1}} ,v_{{i2}} , \cdots ,v_{{in_{i} }} } \right\} $ be the set of blocks of the ith class, and n _i be the number of blocks in this class. The standard deviation stddev (i) of the ith class is defined as:

$$ {stddev}(i) = \sqrt {\sum\limits_{{k = 1}}^{{n_{i} }} {\frac{{(h_{{ik}} - m_{i} )^{2} }}{{(n_{i} - 1)}}} } $$

with h _ik being the height of the kth block of the ith class and m _i being the average height of the blocks of the ith class.

The average stddev is:

$$ {stddev} = \sqrt {\sum\limits_{{i = 1}}^{3} {\frac{{\left\| {{stddev}(i)} \right\|^{2} }}{3}} } ; $$

The quality of intra-class clustering, denoted by intra_den is defined as:

$$ intra\_den(vsn) = \frac{1}{3}\sum\limits_{{i = 1}}^{3} {\sum\limits_{{j = 1}}^{{n_{i} }} {density(v_{{ij}} )} }; $$

with density(v _ij) defined as:

$$ density(v_{{ij}} ) = \sum\limits_{{l = 1}}^{{n_{i} }} {f(v_{{il}} ,v_{{ij}} )}; $$

and f (v _il,v _ij) defined as:

$$f(v_{{il}} ,v_{{ij}} ) = \left\{\begin{array}{*{20}l}1 &\quad {\text{if}}\;||h_{{il}} - h_{{ij}} ||\; \le stddev \\0 &\quad{\text{otherwise}}\\ \end{array}\right.$$

The interclass density Inter_den is defined as the number of blocks being in the close neighborhood of several classes. This density should be very low. It is defined as:

$$ Inter\_den(vsn) = \sum\limits_{{i = 1}}^{3} {\sum\limits_{\begin{subarray}{l} j = 1 \\ j \neq i \end{subarray} }^{3} {\frac{{\left\| {m_{i} - m_{j} } \right\|}}{{\| {{stddev}(i) + {stddev}(j)}\|}}} } \times density(u_{{ij}} ); $$

u _ij is a virtual block of height h _ij= (m _i +m _j)/2

density (u _ij) is defined as:

$$ density(u_{{ij}} ) = \sum\limits_{{k = 1}}^{{n_{i} + n_{j} }} {f(v_{k} ,u_{{ij}} )} $$

with v _k belonging to the union set of blocks of classes i and j.

f (v _k ,u _ij) is defined as:

$$ f(v_{k} ,u_{{ij}} ) = \left\{ \begin{array}{*{20}l} 1&{\rm if}\ \| {h_{k} - h_{{ij}} } \| \le (\|{{stddev}(i)}\| + \| {{stddev}(j)} \|)/2, \\0& {\rm otherwise} \\ \end{array} \right. $$

The cluster separation measure is defined as:

$$ sep(vsn) = \sum\limits_{{i = 1}}^{3} {\sum\limits_{\begin{subarray}{l} j = 1 \\ j \ne i \end{subarray} }^{3} {\frac{{\left\| {m_{i} - m_{j} } \right\|}}{{1 + Inter\_den}}} } $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zahour, A., Taconet, B., Likforman-Sulem, L. et al. Overlapping and multi-touching text-line segmentation by Block Covering analysis. Pattern Anal Applic 12, 335–351 (2009). https://doi.org/10.1007/s10044-008-0127-9

Download citation

Received: 25 October 2007
Accepted: 12 April 2008
Published: 09 July 2008
Issue Date: December 2009
DOI: https://doi.org/10.1007/s10044-008-0127-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overlapping and multi-touching text-line segmentation by Block Covering analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arabic document layout analysis

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1. Composing density between and with clusters (CDbw) criterion

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Overlapping and multi-touching text-line segmentation by Block Covering analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Arabic document layout analysis

An adaptive text-line extraction algorithm for printed Arabic documents with diacritics

Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1. Composing density between and with clusters (CDbw) criterion

Appendix 1. Composing density between and with clusters (CDbw) criterion

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation