An experimental evaluation of OCR text representations for learning document classifiers

Junker, Markus; Hoch, Rainer

doi:10.1007/s100320050012

An experimental evaluation of OCR text representations for learning document classifiers

Published: July 1998

Volume 1, pages 116–122, (1998)
Cite this article

International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Markus Junker¹ &
Rainer Hoch²

108 Accesses
15 Citations
Explore all metrics

Abstract.

In the literature, many feature types are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not yet been done. In particular, evaluations on OCR documents are very rare. In this paper we investigate seven text representations based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

German Research Center for Artificial Intelligence (DFKI GmbH), P.O. 2080, D-67608 Kaiserslautern, Germany , , , , , , DE
Markus Junker
SAP AG, Basis Systems & Services, Neurottstrasse 16, D-69190 Walldorf, Germany , , , , , , DE
Rainer Hoch

Authors

Markus Junker
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Hoch
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received February 17, 1998 / Revised April 8, 1998

Rights and permissions

Reprints and permissions

About this article

Cite this article

Junker, M., Hoch, R. An experimental evaluation of OCR text representations for learning document classifiers. IJDAR 1, 116–122 (1998). https://doi.org/10.1007/s100320050012

Download citation

Issue Date: July 1998
DOI: https://doi.org/10.1007/s100320050012

Key words:Document classification – Feature selection – Learning – OCR –N-grams

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An experimental evaluation of OCR text representations for learning document classifiers

Abstract.

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A Review on Word Embedding Techniques for Text Classification

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

An experimental evaluation of OCR text representations for learning document classifiers

Abstract.

Access this article

Similar content being viewed by others

TextConvoNet: a convolutional neural network based architecture for text classification

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A Review on Word Embedding Techniques for Text Classification

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation