Using topic models for OCR correction

Farooq, Faisal; Bhardwaj, Anurag; Govindaraju, Venu

doi:10.1007/s10032-009-0095-7

Faisal Farooq¹,
Anurag Bhardwaj² &
Venu Govindaraju²

172 Accesses
10 Citations
Explore all metrics

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Kim G., Govindaraju V., Srihari S.: Architecture for handwriting recognition systems. Int. J. Doc. Anal. Recognit. 2(1), 37–44 (1999)
Article Google Scholar
Senior A., Robinson A.: An off-line cursive handwriting recognition system. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 309–321 (1998)
Article Google Scholar
Srihari, S., Keubert, E.: Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 892–896. Ulm, Germany (1997)
Impedovo, S., Wang, P.S.P., Bunke, H. (eds.): Automatic Bankcheck Processing. Series in Machine Perception and Artificial Intelligence, vol. 28. World Scientific (1997)
Govindaraju, V., Ramanaprasad, V., Lee, D., Srihari, S.: Reading handwritten us census forms. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 82–85. Montreal, Canada (1997)
Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)
Article Google Scholar
Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Perez-Cortes, J., Amerngual, J., Arlandis, J., Llobet, R.: Stochastic error-correcting parsing for OCR postprocessing. In: International Conference on Pattern Recognition, pp. 4405–4408. Barcelona, Spain (2000)
Pal U., Kundu P., Chaudhuri B.: OCR error correction of an inflectional Indian language using morphological parsing. J. Inform. Sci. Eng. 16(6), 903–922 (2000)
Google Scholar
Taghva K., Stofsky E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3(3), 125–137 (2001)
Article Google Scholar
Farooq, F., Jose, D., Govindaraju, V.: Phrase based direct model for improving handwriting recognition accuracies. In: Proceedings of International Conference on Frontiers in Handwriting Recognition. Montreal, Canada (2008)
Wick, M., Ross, M., Learned-Miller, E.: Context-sensitive error correction: using topic models to improve OCR. In: Proceedings of 9th International Conference on Document Analysis and Recognition, pp. 1168–1172. Brazil (2007)
Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)
Article Google Scholar
Koerich A., Sabourin R., Suen C.: Large vocabulary offline handwriting recognition using a constrained level building algorithm. Pattern Anal. Appl. 6(2), 97–121 (2003)
Article MathSciNet Google Scholar
Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an hmm-framework based on quantized feature vectors. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1097–1101. Ulm, Germany (1997)
Powalka N.S.R.K., Whitrow R.J.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)
Article Google Scholar
Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition, pp. 373–382 (2000)
Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction for handwritten word recognition. In: Proceedings of the SPIE-Document Recognition III, pp. 224–234. San Jose, CA (1996)
Madhvanath S., Govindaraju V.: Syntatic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)
Article MATH Google Scholar
Milewski, R., Setlur, S., Govindaraju, V.: A lexicon reduction strategy in the context of handwritten medical forms. In: Proceedings of Eigth International Conference on Document Analysis and Recognition, pp. 1146–1150. Seoul, Korea (2005)
Yang Y., Chute C.: An example-based mapping method for text categorization and retrieval. ACM Trans. Inform. Syst. 12(3), 252–277 (1994)
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of AAAI Workshop on Learning for Text Categorization, pp. 41–48. Madison, USA (1998)
Price, R., Zukas, A.: Accurate document categorization of OCR generared text. In: Proceedings of Symposium on Document Image Understanding Technology, pp. 97–102. Maryland, USA (2005)
Manning C.D., Raghavan P., Schtze H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
MATH Google Scholar
Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of Workshop on Machine Learning for Information Filtering-IJCAI, pp. 61–67. Stockholm, Sweden (1999)
Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. In: IRCS Report 97–08. University of Pennsylvania (1997)
Marti U., Bunke H.: The IAM-database: an english sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recognit. 5, 39–46 (2002)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Image and Knowledge Management, Siemens Medical Solutions, Malvern, PA, USA
Faisal Farooq
Department of Computer Science and Engineering, University at Buffalo, Buffalo, NY, USA
Anurag Bhardwaj & Venu Govindaraju

Authors

Faisal Farooq
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Bhardwaj
View author publications
You can also search for this author in PubMed Google Scholar
Venu Govindaraju
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Faisal Farooq.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farooq, F., Bhardwaj, A. & Govindaraju, V. Using topic models for OCR correction. IJDAR 12, 153–164 (2009). https://doi.org/10.1007/s10032-009-0095-7

Download citation

Received: 19 December 2008
Revised: 24 August 2009
Accepted: 26 August 2009
Published: 25 September 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s10032-009-0095-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using topic models for OCR correction

Abstract

Access this article

Similar content being viewed by others

Testing of detection tools for AI-generated text

How to Fine-Tune BERT for Text Classification?

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using topic models for OCR correction

Abstract

Access this article

Similar content being viewed by others

Testing of detection tools for AI-generated text

How to Fine-Tune BERT for Text Classification?

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation