Abstract.
In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.
Similar content being viewed by others
Author information
Authors and Affiliations
Additional information
Received February 17, 1998 / Revised April 8, 1998
Rights and permissions
About this article
Cite this article
Junker, M., Hoch, R. An experimental evaluation of OCR text representations for learning document classifiers. IJDAR 1, 116–122 (1998). https://doi.org/10.1007/s100320050012
Issue Date:
DOI: https://doi.org/10.1007/s100320050012