Efficient and flexible text extraction from document pages

Parodi, Pietro; Fontana, Roberto

doi:10.1007/s100320050038

Efficient and flexible text extraction from document pages

Original papers
Published: December 1999

Volume 2, pages 67–79, (1999)
Cite this article

International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Pietro Parodi¹ &
Roberto Fontana¹

85 Accesses
Explore all metrics

Abstract.

This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting pieces of text lines in small overlapping columns of width \(w^{'}\), shifted with respect to each other by \(\epsilon < w^{'}\) image elements (good default values are: \(\epsilon=1\%\) of the image width, \(w^{'}=2\epsilon\)) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation accuracy achieved by the algorithm as a function of noise and skew has been carried out.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Connected Operators for Non-text Object Segmentation in Grayscale Document Images

Text Extraction from Images: A Review

A Performance Comparison of Segmentation Techniques for the Urdu Text

Author information

Authors and Affiliations

International School for Advanced Studies, Via Beirut 2-4, I-34014 Trieste, Italy; e-mail: parodi@sissa.it , , , , , , IT
Pietro Parodi & Roberto Fontana

Authors

Pietro Parodi
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Fontana
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received April 4, 1999 / Revised June 1, 1999

Rights and permissions

Reprints and permissions

About this article

Cite this article

Parodi, P., Fontana, R. Efficient and flexible text extraction from document pages. IJDAR 2, 67–79 (1999). https://doi.org/10.1007/s100320050038

Download citation

Issue Date: December 1999
DOI: https://doi.org/10.1007/s100320050038

Key words:Text extraction – Document segmentation – Computational complexity – Segmentation accuracy

Access this article

Log in via an institution

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Efficient and flexible text extraction from document pages

Abstract.

Access this article

Similar content being viewed by others

Connected Operators for Non-text Object Segmentation in Grayscale Document Images

Text Extraction from Images: A Review

A Performance Comparison of Segmentation Techniques for the Urdu Text

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Efficient and flexible text extraction from document pages

Abstract.

Access this article

Similar content being viewed by others

Connected Operators for Non-text Object Segmentation in Grayscale Document Images

Text Extraction from Images: A Review

A Performance Comparison of Segmentation Techniques for the Urdu Text

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation