Skip to main content

OCR Error Correction for Vietnamese OCR Text with Different Edit Distances

  • Conference paper
  • First Online:
Advances in Intelligent Networking and Collaborative Systems (INCoS 2022)

Abstract

Candidate word generation by character edit operations is an important method that has been employed in many OCR error correction approaches. In this paper, we study how character edit distances impact the performance of OCR error correction. We propose the algorithm of generating correction candidates with different edit distances. Correction candidates for both non-word and real-word errors are considered. The candidates are scored and ranked based on linguistic features and edit probability. The experiments are tested on the VNOnDB database used in the Vietnamese online handwritten text recognition competition (VOHTR 2018). We evaluate the error correction performance on different edit distances in terms of two error metrics, character error rate (CER) and word error rate (WER). It is shown that the edit distances of 1 and 2 obtain better correction results instead of higher edit distances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://catalog.ldc.upenn.edu/LDC2006T13.

References

  1. Hoang, C.D.V., Aw, A.T.: An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (HYBRID 2012), pp. 36–44. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

  2. Mei, J., Islam, A., Moh’d, A., Wu, Y., Milios, E.E.: Statistical learning for OCR error correction. Inf. Process. Manag. 54(6), 874–887 (2018). https://doi.org/10.1016/j.ipm.2018.06.001

  3. Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198–203 (2016). https://doi.org/10.1109/DAS.2016.44

  4. Nguyen, T.-T.-H., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, N.-V.: Adaptive edit-distance and regression approach for post-OCR text correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 278–289. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_29

    Chapter  Google Scholar 

  5. Nguyen, D.Q., Le, A.D., Zelinka, I.: OCR error correction for unconstrained Vietnamese handwritten text. In: Proceedings of the Tenth International Symposium on Information and Communication Technology (SoICT 2019), pp. 132–138. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3368926.3369686

  6. Nguyen, Q.-D., Le, D.-A., Phan, N.-M., Zelinka, I.: OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal. Appl. 24(2), 701–721 (2020). https://doi.org/10.1007/s10044-020-00936-y

    Article  Google Scholar 

  7. Afli, H., Qiu, Z., Way, A., Sheridan, P.: Using SMT for OCR error correction of historical texts. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pp. 962–966 (2016)

    Google Scholar 

  8. Dong, R., Smith, D.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2363–2372 (2018)

    Google Scholar 

  9. Amrhein, C., Clematide, S.: Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Linguist. (JLCL) 33(1), 49–76 (2018)

    Google Scholar 

  10. Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn. Prentice Hall (2008)

    Google Scholar 

  11. Segaran, T., Hammerbacher, J.: Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media, Inc. (2009)

    Google Scholar 

  12. Nguyen, Q.-D., Le, D.-A., Phan, N.-M., Zelinka, I.: An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. LNCS, vol. 12466, pp. 448–461. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63924-2_26

    Chapter  Google Scholar 

  13. Nguyen, H.T.T., Jatowt, A., Coustaty, M., Nguyen, V.N., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, pp. 29–38 (2019). https://doi.org/10.1109/JCDL.2019.00015

  14. Nguyen, T.P., Vu, L.X., Nguyen, H.T.M., Nguyen, H.V., Le, P.H.: Building a large syntactically annotated corpus of Vietnamese. In: Proceedings of the 3rd Linguistic Annotation Workshop ACL-IJCNLP 2009, pp. 182–185. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  15. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  16. Le, A.D., Nguyen, H.T., Nakagawa, M.: An end-to-end recognition system for unconstrained Vietnamese handwriting. SN Comput. Sci. 1(1), 1–8 (2019). https://doi.org/10.1007/s42979-019-0001-4

    Article  Google Scholar 

  17. Nguyen, H.T., Nguyen, C.T., Pham, B.T., Nakagawa, M.: A database of unconstrained Vietnamese online handwriting and recognition experiments by recurrent neural networks. Pattern Recogn. 78, 291–306 (2018). https://doi.org/10.1016/j.patcog.2018.01.013

    Article  Google Scholar 

  18. Nguyen, H.T., Nguyen, C.T., Nakagawa, M.: ICFHR 2018 - competition on Vietnamese online handwritten text recognition using HANDS-VNOnDB (VOHTR 2018). In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 494–499 (2018)

    Google Scholar 

  19. Le, A.D., Nguyen, H.T., Nakagawa, M.: Recognizing unconstrained Vietnamese handwriting by attention based encoder decoder model. In: 2018 International Conference on Advanced Computing and Applications (ACOMP), pp. 83–87 (2018)

    Google Scholar 

  20. Nguyen, D.Q., Le, A.D., Phan, M.N., Kromer, P., Zelinka, I.: OCR error correction for Vietnamese handwritten text using neural machine translation. In: The 1st International Conference on Van Lang Heritage and Technology, AIP Conference Proceedings, vol. 2406, p. 020022 (2021). https://doi.org/10.1063/5.0066679

  21. Nguyen, H.T.T., Jatowt, A., Nguyen, V.N., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL 2020), pp. 333–336. Association for Computing Machinery, New York (2020)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Van Lang University, Vietnam for funding this work. This work was also supported by VSB-TU Ostrava, Czech Republic, through the SGS grants no. SP2022/12 and SP2022/77.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Kromer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, QD., Phan, NM., Kromer, P. (2022). OCR Error Correction for Vietnamese OCR Text with Different Edit Distances. In: Barolli, L., Miwa, H. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2022. Lecture Notes in Networks and Systems, vol 527. Springer, Cham. https://doi.org/10.1007/978-3-031-14627-5_13

Download citation

Publish with us

Policies and ethics