Skip to main content
Log in

Stop word location and identification for adaptive text recognition

  • Original papers
  • Published:
International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Abstract. We propose a new adaptive strategy for text recognition that attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. A small dictionary of such words is compiled from the Brown corpus. An arbitrary text page first goes through layout analysis that produces word segmentation. A fast procedure is then applied to locate the most likely candidates for those words, using only widths of the word images. The identity of each word is determined using a word shape classifier. Using the word images together with their identities, character prototypes can be extracted using a previously proposed method. We describe experiments using simulated and real images. In an experiment using 400 real page images, we show that on average, eight distinct characters can be learned from each page, and the method is successful on 90% of all the pages. These can serve as useful seeds to bootstrap font learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received October 8, 1999 / Revised March 29, 2000

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ho, T. Stop word location and identification for adaptive text recognition. IJDAR 3, 16–26 (2000). https://doi.org/10.1007/PL00013551

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/PL00013551

Navigation