Abstract
Candidate word generation by character edit operations is an important method that has been employed in many OCR error correction approaches. In this paper, we study how character edit distances impact the performance of OCR error correction. We propose the algorithm of generating correction candidates with different edit distances. Correction candidates for both non-word and real-word errors are considered. The candidates are scored and ranked based on linguistic features and edit probability. The experiments are tested on the VNOnDB database used in the Vietnamese online handwritten text recognition competition (VOHTR 2018). We evaluate the error correction performance on different edit distances in terms of two error metrics, character error rate (CER) and word error rate (WER). It is shown that the edit distances of 1 and 2 obtain better correction results instead of higher edit distances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hoang, C.D.V., Aw, A.T.: An unsupervised and data-driven approach for spell checking in Vietnamese OCR-scanned texts. In: Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (HYBRID 2012), pp. 36–44. Association for Computational Linguistics, Stroudsburg (2012)
Mei, J., Islam, A., Moh’d, A., Wu, Y., Milios, E.E.: Statistical learning for OCR error correction. Inf. Process. Manag. 54(6), 874–887 (2018). https://doi.org/10.1016/j.ipm.2018.06.001
Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 198–203 (2016). https://doi.org/10.1109/DAS.2016.44
Nguyen, T.-T.-H., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, N.-V.: Adaptive edit-distance and regression approach for post-OCR text correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds.) ICADL 2018. LNCS, vol. 11279, pp. 278–289. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04257-8_29
Nguyen, D.Q., Le, A.D., Zelinka, I.: OCR error correction for unconstrained Vietnamese handwritten text. In: Proceedings of the Tenth International Symposium on Information and Communication Technology (SoICT 2019), pp. 132–138. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3368926.3369686
Nguyen, Q.-D., Le, D.-A., Phan, N.-M., Zelinka, I.: OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal. Appl. 24(2), 701–721 (2020). https://doi.org/10.1007/s10044-020-00936-y
Afli, H., Qiu, Z., Way, A., Sheridan, P.: Using SMT for OCR error correction of historical texts. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC), pp. 962–966 (2016)
Dong, R., Smith, D.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2363–2372 (2018)
Amrhein, C., Clematide, S.: Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Linguist. (JLCL) 33(1), 49–76 (2018)
Jurafsky, D., Martin, J.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn. Prentice Hall (2008)
Segaran, T., Hammerbacher, J.: Beautiful Data: The Stories Behind Elegant Data Solutions. O’Reilly Media, Inc. (2009)
Nguyen, Q.-D., Le, D.-A., Phan, N.-M., Zelinka, I.: An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting. In: Dang, T.K., Küng, J., Takizawa, M., Chung, T.M. (eds.) FDSE 2020. LNCS, vol. 12466, pp. 448–461. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63924-2_26
Nguyen, H.T.T., Jatowt, A., Coustaty, M., Nguyen, V.N., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, pp. 29–38 (2019). https://doi.org/10.1109/JCDL.2019.00015
Nguyen, T.P., Vu, L.X., Nguyen, H.T.M., Nguyen, H.V., Le, P.H.: Building a large syntactically annotated corpus of Vietnamese. In: Proceedings of the 3rd Linguistic Annotation Workshop ACL-IJCNLP 2009, pp. 182–185. Association for Computational Linguistics, Stroudsburg (2009)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Le, A.D., Nguyen, H.T., Nakagawa, M.: An end-to-end recognition system for unconstrained Vietnamese handwriting. SN Comput. Sci. 1(1), 1–8 (2019). https://doi.org/10.1007/s42979-019-0001-4
Nguyen, H.T., Nguyen, C.T., Pham, B.T., Nakagawa, M.: A database of unconstrained Vietnamese online handwriting and recognition experiments by recurrent neural networks. Pattern Recogn. 78, 291–306 (2018). https://doi.org/10.1016/j.patcog.2018.01.013
Nguyen, H.T., Nguyen, C.T., Nakagawa, M.: ICFHR 2018 - competition on Vietnamese online handwritten text recognition using HANDS-VNOnDB (VOHTR 2018). In: 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 494–499 (2018)
Le, A.D., Nguyen, H.T., Nakagawa, M.: Recognizing unconstrained Vietnamese handwriting by attention based encoder decoder model. In: 2018 International Conference on Advanced Computing and Applications (ACOMP), pp. 83–87 (2018)
Nguyen, D.Q., Le, A.D., Phan, M.N., Kromer, P., Zelinka, I.: OCR error correction for Vietnamese handwritten text using neural machine translation. In: The 1st International Conference on Van Lang Heritage and Technology, AIP Conference Proceedings, vol. 2406, p. 020022 (2021). https://doi.org/10.1063/5.0066679
Nguyen, H.T.T., Jatowt, A., Nguyen, V.N., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL 2020), pp. 333–336. Association for Computing Machinery, New York (2020)
Acknowledgments
The authors would like to thank Van Lang University, Vietnam for funding this work. This work was also supported by VSB-TU Ostrava, Czech Republic, through the SGS grants no. SP2022/12 and SP2022/77.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nguyen, QD., Phan, NM., Kromer, P. (2022). OCR Error Correction for Vietnamese OCR Text with Different Edit Distances. In: Barolli, L., Miwa, H. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2022. Lecture Notes in Networks and Systems, vol 527. Springer, Cham. https://doi.org/10.1007/978-3-031-14627-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-14627-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14626-8
Online ISBN: 978-3-031-14627-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)