ABSTRACT
In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.
- K. Atkinson. GNU Aspell. http://aspell.net/.Google Scholar
- A. Dengel, R. Hoch, F. Hönes, and A. Weigel. Techniques for improving OCR results. In P. S. P. Wang and H. Bunke, editors, Handbook on Character Recognition and Document Image Analysis, pages 227--258. World Scientific, 1997.Google ScholarCross Ref
- A. R. Golding and D. Roth. A winnow-based approach to context-sensitive spelling correction. Mach. Learn., 34(1-3):107--130, 1999. Google ScholarDigital Library
- R. Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 15(3/4), March/April 2009.Google Scholar
- W. C. Janssen, J. Breidenbach, L. Good, and A. Popat. Making UpLib useful: Personal document engineering. Technical Report TR-05-5, Xerox PARC, Palo Alto, USA, 2005.Google Scholar
- M. G. Kendall. Rank correlation methods. Hafner, 1962.Google Scholar
- O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic OCR model for NLP applications. In Proc. HLT-NAACL, pages 55--62, 2003. Google ScholarDigital Library
- S. Kulp and A. Kontostathis. On retrieving legal files: Shortening documents and weeding out garbage. In Proc. TREC, November 2007.Google Scholar
- A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, Nov 1990.Google ScholarCross Ref
- R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195--239, Apr. 1984.Google ScholarDigital Library
- K. Taghva, T. Nartker, A. Condit, and J. Borsack. Automatic removal of garbage strings in OCR text: An implementation. In Proc. 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, USA, 2001.Google Scholar
Index Terms
- A panlingual anomalous text detector
Recommendations
Text analysis and language identification for polyglot text-to-speech synthesis
In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a ...
CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text
FIRE '23: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval EvaluationWord-level Language Identification (LI) task determines the language of each word in a given code-mixed sentence, where a sentence is made up of words belonging to more than one language at word/sub-word level. This task is explored to a greater extent ...
Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications
Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these ...
Comments