Skip to main content
Log in

Using topic models for OCR correction

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kim G., Govindaraju V., Srihari S.: Architecture for handwriting recognition systems. Int. J. Doc. Anal. Recognit. 2(1), 37–44 (1999)

    Article  Google Scholar 

  2. Senior A., Robinson A.: An off-line cursive handwriting recognition system. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 309–321 (1998)

    Article  Google Scholar 

  3. Srihari, S., Keubert, E.: Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In: Proceedings of 4th International Conference on Document Analysis and Recognition, pp. 892–896. Ulm, Germany (1997)

  4. Impedovo, S., Wang, P.S.P., Bunke, H. (eds.): Automatic Bankcheck Processing. Series in Machine Perception and Artificial Intelligence, vol. 28. World Scientific (1997)

  5. Govindaraju, V., Ramanaprasad, V., Lee, D., Srihari, S.: Reading handwritten us census forms. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, pp. 82–85. Montreal, Canada (1997)

  6. Vinciarelli A., Bengio S., Bunke H.: Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)

    Article  Google Scholar 

  7. Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)

    Article  Google Scholar 

  8. Perez-Cortes, J., Amerngual, J., Arlandis, J., Llobet, R.: Stochastic error-correcting parsing for OCR postprocessing. In: International Conference on Pattern Recognition, pp. 4405–4408. Barcelona, Spain (2000)

  9. Pal U., Kundu P., Chaudhuri B.: OCR error correction of an inflectional Indian language using morphological parsing. J. Inform. Sci. Eng. 16(6), 903–922 (2000)

    Google Scholar 

  10. Taghva K., Stofsky E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3(3), 125–137 (2001)

    Article  Google Scholar 

  11. Farooq, F., Jose, D., Govindaraju, V.: Phrase based direct model for improving handwriting recognition accuracies. In: Proceedings of International Conference on Frontiers in Handwriting Recognition. Montreal, Canada (2008)

  12. Wick, M., Ross, M., Learned-Miller, E.: Context-sensitive error correction: using topic models to improve OCR. In: Proceedings of 9th International Conference on Document Analysis and Recognition, pp. 1168–1172. Brazil (2007)

  13. Kim G., Govindaraju V.: A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Trans. Pattern Anal. Mach. Intell. 19(4), 366–379 (1997)

    Article  Google Scholar 

  14. Koerich A., Sabourin R., Suen C.: Large vocabulary offline handwriting recognition using a constrained level building algorithm. Pattern Anal. Appl. 6(2), 97–121 (2003)

    Article  MathSciNet  Google Scholar 

  15. Kaufmann, G., Bunke, H., Hadorn, M.: Lexicon reduction in an hmm-framework based on quantized feature vectors. In: Proceedings of the 4th International Conference on Document Analysis and Recognition, pp. 1097–1101. Ulm, Germany (1997)

  16. Powalka N.S.R.K., Whitrow R.J.: Word shape analysis for a hybrid recognition system. Pattern Recognit. 30(3), 421–445 (1997)

    Article  Google Scholar 

  17. Guillevic, D., Nishiwaki, D., Yamada, K.: Word lexicon reduction by character spotting. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition, pp. 373–382 (2000)

  18. Madhvanath, S., Govindaraju, V.: Holistic lexicon reduction for handwritten word recognition. In: Proceedings of the SPIE-Document Recognition III, pp. 224–234. San Jose, CA (1996)

  19. Madhvanath S., Govindaraju V.: Syntatic methodology of pruning large lexicons in cursive script recognition. Pattern Recognit. 34(1), 37–46 (2001)

    Article  MATH  Google Scholar 

  20. Milewski, R., Setlur, S., Govindaraju, V.: A lexicon reduction strategy in the context of handwritten medical forms. In: Proceedings of Eigth International Conference on Document Analysis and Recognition, pp. 1146–1150. Seoul, Korea (2005)

  21. Yang Y., Chute C.: An example-based mapping method for text categorization and retrieval. ACM Trans. Inform. Syst. 12(3), 252–277 (1994)

    Article  Google Scholar 

  22. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of AAAI Workshop on Learning for Text Categorization, pp. 41–48. Madison, USA (1998)

  23. Price, R., Zukas, A.: Accurate document categorization of OCR generared text. In: Proceedings of Symposium on Document Image Understanding Technology, pp. 97–102. Maryland, USA (2005)

  24. Manning C.D., Raghavan P., Schtze H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  25. Nigam, K., Lafferty, J., Mccallum, A.: Using maximum entropy for text classification. In: Proceedings of Workshop on Machine Learning for Information Filtering-IJCAI, pp. 61–67. Stockholm, Sweden (1999)

  26. Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. In: IRCS Report 97–08. University of Pennsylvania (1997)

  27. Marti U., Bunke H.: The IAM-database: an english sentence database for off-line handwriting recognition. Int. J. Doc. Anal. Recognit. 5, 39–46 (2002)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Faisal Farooq.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Farooq, F., Bhardwaj, A. & Govindaraju, V. Using topic models for OCR correction. IJDAR 12, 153–164 (2009). https://doi.org/10.1007/s10032-009-0095-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-009-0095-7

Keywords

Navigation