research-article

A panlingual anomalous text detector

Author:

Ashok C. PopatAuthors Info & Claims

DocEng '09: Proceedings of the 9th ACM symposium on Document engineering

Pages 201 - 204

https://doi.org/10.1145/1600193.1600237

Published: 16 September 2009 Publication History

Get Access

Abstract

In a large-scale book scanning operation, material can vary widely in language, script, genre, domain, print quality, and other factors, giving rise to a corresponding variability in the OCRed text. It is often desirable to automatically detect errorful and otherwise anomalous text segments, so that they can be filtered out or appropriately flagged, for such applications as indexing, mining, analyzing, displaying, and selectively re-processing such data. Moreover, it is advantageous to require that the automated detector be independent of the underlying OCR engine (or engines), that it work over a broad range of languages, that it seamlessly handle mixed-language material, and that it accommodate documents that contain domain-specific and otherwise rare terminology. A technique is presented that satisfies these requirements, using an adaptive mixture of character-level N-gram language models. Its design, training, implementation, and evaluation are described within the context of high-volume book scanning.

References

[1]

K. Atkinson. GNU Aspell. http://aspell.net/.

Google Scholar

[2]

A. Dengel, R. Hoch, F. Hönes, and A. Weigel. Techniques for improving OCR results. In P. S. P. Wang and H. Bunke, editors, Handbook on Character Recognition and Document Image Analysis, pages 227--258. World Scientific, 1997.

Crossref

Google Scholar

[3]

A. R. Golding and D. Roth. A winnow-based approach to context-sensitive spelling correction. Mach. Learn., 34(1-3):107--130, 1999.

Digital Library

Google Scholar

[4]

R. Holley. How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 15(3/4), March/April 2009.

Google Scholar

[5]

W. C. Janssen, J. Breidenbach, L. Good, and A. Popat. Making UpLib useful: Personal document engineering. Technical Report TR-05-5, Xerox PARC, Palo Alto, USA, 2005.

Google Scholar

[6]

M. G. Kendall. Rank correlation methods. Hafner, 1962.

Google Scholar

[7]

O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic OCR model for NLP applications. In Proc. HLT-NAACL, pages 55--62, 2003.

Digital Library

Google Scholar

[8]

S. Kulp and A. Kontostathis. On retrieving legal files: Shortening documents and weeding out garbage. In Proc. TREC, November 2007.

Google Scholar

[9]

A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917--1921, Nov 1990.

Crossref

Google Scholar

[10]

R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195--239, Apr. 1984.

Digital Library

Google Scholar

[11]

K. Taghva, T. Nartker, A. Condit, and J. Borsack. Automatic removal of garbage strings in OCR text: An implementation. In Proc. 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, USA, 2001.

Google Scholar

Cited By

View all

Reffle URinglstetter C(2018)Unsupervised profiling of OCRed historical documentsPattern Recognition10.1016/j.patcog.2012.10.00246:5(1346-1357)Online publication date: 30-Dec-2018
https://dl.acm.org/doi/10.1016/j.patcog.2012.10.002
Alex BBurns JAntonacopoulos ASchulz K(2014)Estimating and rating the quality of optically character recognised textProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595214(97-102)Online publication date: 19-May-2014
https://dl.acm.org/doi/10.1145/2595188.2595214
Piotrowski M(2012)Natural Language Processing for Historical TextsSynthesis Lectures on Human Language Technologies10.2200/S00436ED1V01Y201207HLT0175:2(1-157)Online publication date: 24-Sep-2012
https://doi.org/10.2200/S00436ED1V01Y201207HLT017

Index Terms

A panlingual anomalous text detector

Recommendations

Text analysis and language identification for polyglot text-to-speech synthesis

In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a ...
CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text
FIRE '23: Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation

Word-level Language Identification (LI) task determines the language of each word in a given code-mixed sentence, where a sentence is made up of words belonging to more than one language at word/sub-word level. This task is explored to a greater extent ...
Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications
Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these ...

Comments

Information & Contributors

Information

Published In

DocEng '09: Proceedings of the 9th ACM symposium on Document engineering

September 2009

264 pages

ISBN:9781605585758

DOI:10.1145/1600193

General Chair:
Uwe M. Borghoff
Universität der Bundeswehr München, Germany
,
Program Chair:
Boris Chidlovskii
Xerox Research Centre Europe, France

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DocEng '09

Sponsor:

DocEng '09: ACM Symposium on Document Engineering

September 16 - 18, 2009

Munich, Germany

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
292
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Reffle URinglstetter C(2018)Unsupervised profiling of OCRed historical documentsPattern Recognition10.1016/j.patcog.2012.10.00246:5(1346-1357)Online publication date: 30-Dec-2018
https://dl.acm.org/doi/10.1016/j.patcog.2012.10.002
Alex BBurns JAntonacopoulos ASchulz K(2014)Estimating and rating the quality of optically character recognised textProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595214(97-102)Online publication date: 19-May-2014
https://dl.acm.org/doi/10.1145/2595188.2595214
Piotrowski M(2012)Natural Language Processing for Historical TextsSynthesis Lectures on Human Language Technologies10.2200/S00436ED1V01Y201207HLT0175:2(1-157)Online publication date: 24-Sep-2012
https://doi.org/10.2200/S00436ED1V01Y201207HLT017

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Text analysis and language identification for polyglot text-to-speech synthesis

CoLI@FIRE2023: Findings of Word-level Language Identification in Code-mixed Tulu Text

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations