skip to main content
10.1145/1600193.1600237acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

A panlingual anomalous text detector

Published: 16 September 2009 Publication History

Abstract

In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.

References

[1]
K. Atkinson. GNU Aspell. http://aspell.net/.
[2]
A. Dengel, R. Hoch, F. Hönes, and A. Weigel. Techniques for improving OCR results. In P. S. P. Wang and H. Bunke, editors, Handbook on Character Recognition and Document Image Analysis, pages 227--258. World Scientific, 1997.
[3]
A. R. Golding and D. Roth. A winnow-based approach to context-sensitive spelling correction. Mach. Learn., 34(1-3):107--130, 1999.
[4]
R. Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 15(3/4), March/April 2009.
[5]
W. C. Janssen, J. Breidenbach, L. Good, and A. Popat. Making UpLib useful: Personal document engineering. Technical Report TR-05-5, Xerox PARC, Palo Alto, USA, 2005.
[6]
M. G. Kendall. Rank correlation methods. Hafner, 1962.
[7]
O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic OCR model for NLP applications. In Proc. HLT-NAACL, pages 55--62, 2003.
[8]
S. Kulp and A. Kontostathis. On retrieving legal files: Shortening documents and weeding out garbage. In Proc. TREC, November 2007.
[9]
A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, Nov 1990.
[10]
R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195--239, Apr. 1984.
[11]
K. Taghva, T. Nartker, A. Condit, and J. Borsack. Automatic removal of garbage strings in OCR text: An implementation. In Proc. 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, USA, 2001.

Cited By

View all
  • (2018)Unsupervised profiling of OCRed historical documentsPattern Recognition10.1016/j.patcog.2012.10.00246:5(1346-1357)Online publication date: 30-Dec-2018
  • (2014)Estimating and rating the quality of optically character recognised textProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595214(97-102)Online publication date: 19-May-2014
  • (2012)Natural Language Processing for Historical TextsSynthesis Lectures on Human Language Technologies10.2200/S00436ED1V01Y201207HLT0175:2(1-157)Online publication date: 24-Sep-2012

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '09: Proceedings of the 9th ACM symposium on Document engineering
September 2009
264 pages
ISBN:9781605585758
DOI:10.1145/1600193
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. garbage strings
  2. language identification
  3. mixture models
  4. ppm
  5. text quality
  6. witten-bell

Qualifiers

  • Research-article

Conference

DocEng '09
DocEng '09: ACM Symposium on Document Engineering
September 16 - 18, 2009
Munich, Germany

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Unsupervised profiling of OCRed historical documentsPattern Recognition10.1016/j.patcog.2012.10.00246:5(1346-1357)Online publication date: 30-Dec-2018
  • (2014)Estimating and rating the quality of optically character recognised textProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595214(97-102)Online publication date: 19-May-2014
  • (2012)Natural Language Processing for Historical TextsSynthesis Lectures on Human Language Technologies10.2200/S00436ED1V01Y201207HLT0175:2(1-157)Online publication date: 24-Sep-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media