Skip to main content
Log in

An experimental evaluation of OCR text representations for learning document classifiers

  • Published:
International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Abstract.

In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received February 17, 1998 / Revised April 8, 1998

Rights and permissions

Reprints and permissions

About this article

Cite this article

Junker, M., Hoch, R. An experimental evaluation of OCR text representations for learning document classifiers. IJDAR 1, 116–122 (1998). https://doi.org/10.1007/s100320050012

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s100320050012

Navigation