Abstract. We propose a new adaptive strategy for text recognition that attempts to derive knowledge about the dominant font on a given page. The strategy uses a linguistic observation that over half of all words in a typical English passage are contained in a small set of less than 150 stop words. A small dictionary of such words is compiled from the Brown corpus. An arbitrary text page first goes through layout analysis that produces word segmentation. A fast procedure is then applied to locate the most likely candidates for those words, using only widths of the word images. The identity of each word is determined using a word shape classifier. Using the word images together with their identities, character prototypes can be extracted using a previously proposed method. We describe experiments using simulated and real images. In an experiment using 400 real page images, we show that on average, eight distinct characters can be learned from each page, and the method is successful on 90% of all the pages. These can serve as useful seeds to bootstrap font learning.
Similar content being viewed by others
Author information
Authors and Affiliations
Additional information
Received October 8, 1999 / Revised March 29, 2000
Rights and permissions
About this article
Cite this article
Ho, T. Stop word location and identification for adaptive text recognition. IJDAR 3, 16–26 (2000). https://doi.org/10.1007/PL00013551
Issue Date:
DOI: https://doi.org/10.1007/PL00013551