OCR error correction using correction patterns and self-organizing migrating algorithm

Nguyen, Quoc-Dung; Le, Duc-Anh; Phan, Nguyet-Minh; Zelinka, Ivan

doi:10.1007/s10044-020-00936-y

OCR error correction using correction patterns and self-organizing migrating algorithm

Theoretical Advances
Published: 23 November 2020

Volume 24, pages 701–721, (2021)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Quoc-Dung Nguyen ORCID: orcid.org/0000-0003-1580-9032^1,4,
Duc-Anh Le^2,5,
Nguyet-Minh Phan³ &
…
Ivan Zelinka⁴

1033 Accesses
13 Citations
Explore all metrics

Abstract

Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 3

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

Article 14 June 2021

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

Sign Language Recognition Systems: A Decade Systematic Literature Review

Article 17 December 2019

Notes

http://opennmt.net.
Source: http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz.
Extracted from the English monograph 19.txt in the evaluation dataset.
Extracted from the English monograph 3.txt in the evaluation dataset.
https://sites.google.com/view/icdar2017-postcorrectionocr/dataset, last accessed on 3 May 2019.
https://bit.ly/2BLsN7B.
Evaluation scripts: https://git.univ-lr.fr/gchiro01/icdar2017/tree/master.

References

Afli H, Barrault L, Schwenk H (2016a) OCR error correction using statistical machine translation. Int J Comput Linguist Appl 7(1):175–191
Google Scholar
Afli H, Qui Z, Way A, Sheridan P (2016b) Using SMT for OCR error correction of historical texts. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp 962–966
Amrhein C, Clematide S (2018) Supervised OCR error detection and correction using statistical and neural machine translation methods. J Language Technol Comput Linguist 33(1):49–76.
Google Scholar
Bassil Y, Alwani M (2012a) Context-sensitive spelling correction using google Web 1T 5-gram information. Comput Inf Sci 5(3):37–48. https://doi.org/10.5539/cis.v5n3p37
Article Google Scholar
Bassil Y, Alwani M (2012b) OCR post-processing error correction algorithm using google’s online spelling suggestion. J Emerg Trends Comput Inf Sci 3(1):90–99
Google Scholar
Brill E, Moore RC (2000) An Improved Error Model for Noisy Channel Spelling Correction. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, ACL ’00, pp 286–293> https://doi.org/10.3115/1075218.1075255
Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2014) One billion word benchmark for measuring progress in statistical language modeling. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp 2635–2639
Chiron G, Doucet A, Coustaty M, Moreux J (2017) ICDAR2017 Competition on Post-OCR Text Correction. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) Kyoto, Japan 01:1423–1428. https://doi.org/10.1109/ICDAR.2017.232
Cuevas E, Zaldivar D, Cisneros M, Ramirez-Ortegon M (2011) Circle detection using discrete differential evolution optimization. Pattern Anal Appl 14(1):93–107. https://doi.org/10.1007/s10044-010-0183-9
Article MathSciNet Google Scholar
Davendra D, Zelinka I (2016) Self-organizing migrating algorithm: Methodology and Implementation. Springer, Berlin
Book Google Scholar
Davendra D, Zelinka I, Senkerik R, Jasek R (2013) Discrete self-organising migrating algorithm for flow shop scheduling with no wait makespan. Math Comput Modell 57:100–110. https://doi.org/10.1016/j.mcm.2011.05.029
Article MathSciNet MATH Google Scholar
Davendra D, Zelinka I, Senkerik R, Pluhacek M (2014) Complex network analysis of the discrete self-organising migrating algorithm. In: Nostradamus 2014: Prediction, Modeling and Analysis of Complex Systems. Advances in Intelligent Systems and Computing, Springer, Cham, vol 289, pp 161–174. https://doi.org/10.1007/978-3-319-07401-6_16
Del Ser J, Osaba E, Molina D, Yang X, Salcedo-Sanz S, Camacho D, Das S, Suganthan P, Coello C, Herrera F (2019) Bio-inspired computation: where we stand and what’s next. Swarm Evolut Comput 48:220–250. https://doi.org/10.1016/J.SWEVO.2019.04.008
Article Google Scholar
Desai AA (2010) Gujarati handwritten numeral optical character reorganization through neural network. Pattern Recognit. 43(7):2582–2589. https://doi.org/10.1016/j.patcog.2010.01.008
Article MATH Google Scholar
Diep QB (2019) Self-Organizing Migrating Algorithm Team To Team Adaptive – SOMA T3A. In: Proceedings of the IEEE Congress on Evolutionary Computation (CEC), Wellington, New Zealand. https://doi.org/10.1109/CEC.2019.8790202
Diep QB, Zelinka I, Das S (2019) Self-organizing migrating algorithm pareto. Mendel 25(1):111–120 https://doi.org/10.13164/mendel.2019.1.111
Article Google Scholar
Dorigo M, Birattari M (2010) Ant colony optimization. Encyclopedia Machine Learn. https://doi.org/10.1007/978-0-387-30164-8_22
Article Google Scholar
Evershed J, Fitch K (2014) Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH ’14, pp 45–51. https://doi.org/10.1145/2595188.2595200
Fancellu F, Way A, O’Brien M (2014) Standard language variety conversion for content localisation via SMT. 17th Annual Conference of the European Association for Machine Translation pp 143–149
García S, Molina D, Lozano M, Herrera F (2009) A Study on the Use of Non-Parametric Tests for Analyzing the Evolutionary Algorithms’ Behaviour: A Case Study on the CEC’2005 Special Session on Real Parameter Optimization. J Heuristics 15. https://doi.org/10.1007/s10732-008-9080-4
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868. https://doi.org/10.1109/TPAMI.2008.137
Article Google Scholar
Gupta MR, Jacobson NP, Garcia EK (2007) OCR binarization and image pre-processing for searching historical documents. Pattern Recognit 40(2):389–397. https://doi.org/10.1016/j.patcog.2006.04.043
Article MATH Google Scholar
Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discovery Data 2(2):1–25. https://doi.org/10.1145/1376815.1376819
Article Google Scholar
Islam A, Inkpen D (2009a) Real-word Spelling Correction Using Google Web 1T n-gram Data Set. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, ACM, New York, NY, USA, CIKM ’09, pp 1689–1692. https://doi.org/10.1145/1645953.1646205
Islam A, Inkpen D (2009b) Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’09, vol 3, pp 1241–1249. https://doi.org/10.3115/1699648.1699670
Jurafsky D, Martin J (2008) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edn. Prentice Hall
Kennedy J (2010) Particle swarm optimization. In Encyclopedia of machine learning, Springer pp 760–766
Kim Y, Jernite Y, Sontag D, Rush AM (2016) Character-aware Neural Language Models. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, AAAI ’16, pp 2741–2749
Kissos I, Dershowitz N (2016) OCR error correction using character correction and feature-based word classification. 2016 12th IAPR Workshop on Document Analysis Systems (DAS) pp 198–203. https://doi.org/10.1109/DAS.2016.44
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, et al (2007) Moses: Open Source Toolkit for Statistical Machine Translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’07, pp 177–180
Lam-Adesina AM, Jones GJ (2006) Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Inf Process Manage 42(3):633–649. https://doi.org/10.1016/j.ipm.2005.06.006
Article Google Scholar
Lund WB, Kennard DJ, Ringger EK (2013) Combining multiple thresholding binarization values to improve OCR output. In: Proceedings of SPIE 8658, Document Recognition and Retrieval XX, 86580R. https://doi.org/10.1117/12.2006228
Lund WB, Ringger EK, Walker DD (2014) How well does multiple OCR error correction generalize? In: Proceedings of SPIE 9021, Document Recognition and Retrieval XXI, 90210A. https://doi.org/10.1117/12.2042502
Luong MT, Pham HH, Manning C (2015) Effective Approaches to Attention-based Neural Machine Translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 1412–1421. https://doi.org/10.18653/v1/D15-1166
Mei J, Islam A, Wu Y, Moh’d A, Milios EE (2016) Statistical Learning for OCR Text Correction. CoRR abs/1611.06950, http://arxiv.org/abs/1611.06950
Mei J, Islam A, Moh’d A, Wu Y, Milios EE (2018) Statistical learning for OCR error correction. Inf Process Manage 54(6):874–887. https://doi.org/10.1016/j.ipm.2018.06.001
Article Google Scholar
Nguyen DQ, Le AD, Zelinka I (2019a) OCR Error Correction for Unconstrained Vietnamese Handwritten Text. In: Proceedings of the Tenth International Symposium on Information and Communication Technology (SoICT 2019). Association for Computing Machinery, New York, NY, USA, pp 132–138. https://doi.org/10.1145/3368926.3369686
Nguyen HT, Jatowt A, Coustaty M, Nguyen V, Doucet A (2019b) Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, pp 29–38. https://doi.org/10.1109/JCDL.2019.00015
Nguyen TTH, Coustaty M, Doucet A, Jatowt A, Nguyen NV (2018) Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction. Dobreva M, Hinze A, Žumer M (eds) Maturity and Innovation in Digital Libraries ICADL 2018 Lecture Notes in Computer Science 11279:278–289. https://doi.org/10.1007/978-3-030-04257-8_29
Nolle L, Zelinka I, Hopgood AA, Goodyear A (2005) Comparison of a self-organizing migration algorithm with simulated annealing and differential evolution for automated waveform tuning. Adv Eng Softw 36(10):645–653. https://doi.org/10.1016/j.advengsoft.2005.03.012
Article Google Scholar
Pereda R, Taghva K (2011) Fuzzy Information Extraction on OCR Text. In: 2011 Eighth International Conference on Information Technology: New Generations (ITNG), pp 543–546. https://doi.org/10.1109/ITNG.2011.99
Ros F, Guillaume S, Pintore M, Chrétien J (2008) Hybrid genetic algorithm for dual selection. Pattern Anal Appl 11(2):179–198. https://doi.org/10.1007/s10044-007-0089-3
Article MathSciNet Google Scholar
Samorani M, Wang Y, Wang Y, Lv Z, Glover F (2019) Clustering-driven evolutionary algorithms: an application of path relinking to the quadratic unconstrained binary optimization problem. J Heuristics 25(4):629–642. https://doi.org/10.1007/s10732-018-9403-z
Article Google Scholar
Schulz S, Kuhn J (2017) Multi-modular domain-tailored OCR post-correction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, pp 2716–2726. https://doi.org/10.18653/v1/D17-1288
Singh D, Agrawal S (2016) Self organizing migrating algorithm with quadratic interpolation for solving large scale global optimization problems. Appl Soft Comput 38:1040–1048. https://doi.org/10.1016/j.asoc.2015.09.033
Article Google Scholar
Strange C, McNamara D, Wodak J, Wood I (2014) Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly 8(1)
Taghva K, Borsack J, Condit A (1996) Effects of ocr errors on ranking and feedback using the vector space model. Inf Process Manage 32(3):317–327. https://doi.org/10.1016/0306-4573(95)00058-5
Article Google Scholar
Tappert CC, Suen CY, Wakahara T (1990) The state of the art in online handwriting recognition. IEEE Trans Pattern Anal Mach Intell 12(8):787–808. https://doi.org/10.1109/34.57669
Article Google Scholar
Thompson P, McNaught J (2015) Ananiadou S (2015) Customised OCR correction for historical medical text. Digital Heritage 1:35–42. https://doi.org/10.1109/DigitalHeritage.2015.7413829
Article Google Scholar
Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J et al (2016) Text mining the history of medicine. PLoS ONE 11(1):1–33. https://doi.org/10.1371/journal.pone.0144717
Article Google Scholar
Zelinka I (2004) SOMA - Self-Organizing Migrating Algorithm. New Optim Techniq Eng Stud Fuzziness Soft Comput 141:167–217. https://doi.org/10.1007/978-3-540-39930-8_7
Article MATH Google Scholar
Zelinka I, Lampinen J (2000) SOMA - Self-Organizing Migrating Algorithm Mendel. In: 6th International Conference on Soft Computing, Brno, Czech Republic
Zelinka I, Sikora L (2015) StarCraft: Brood War—Strategy powered by the SOMA swarm algorithm. In: 2015 IEEE Conference on Computational Intelligence and Games (CIG), Tainan, Taiwan, pp 511–516. https://doi.org/10.1109/CIG.2015.7317903
Zelinka I, Tomaszek L (2016) Competition on learning-based real-parameter single objective optimization by SOMA swarm-based algorithm with SOMARemove strategy. In: 2016 IEEE Congress on Evolutionary Computation (CEC), Vancouver, BC, Canada, pp 4981–4987. https://doi.org/10.1109/CEC.2016.7744428
Zelinka I, Němec M, Šenkeřík R (2018) Gamesourcing: Perspectives and Implementations. In: In Simulation and Gaming. InTech, 2018. https://doi.org/10.5772/intechopen.71703

Download references

Acknowledgements

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Van Lang University, 45 Nguyen Khac Nhu, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
Quoc-Dung Nguyen
Center for Open Data in the Humanities, Tokyo, 101-8430, Japan
Duc-Anh Le
University of Information Technology, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
Nguyet-Minh Phan
Department of Computer Science, FEECS VŠB - Technical University of Ostrava, 17. listopadu 15, 708 33, Ostrava-Poruba, Czech Republic
Quoc-Dung Nguyen & Ivan Zelinka
NTT Hi-Tech Institute, Nguyen Tat Thanh University, 300A Nguyen Tat Thanh, District 4, Ho Chi Minh city, Vietnam
Duc-Anh Le

Authors

Quoc-Dung Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Duc-Anh Le
View author publications
You can also search for this author in PubMed Google Scholar
Nguyet-Minh Phan
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Zelinka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quoc-Dung Nguyen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: List of installed packages

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, QD., Le, DA., Phan, NM. et al. OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal Applic 24, 701–721 (2021). https://doi.org/10.1007/s10044-020-00936-y

Download citation

Received: 12 October 2019
Accepted: 29 October 2020
Published: 23 November 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10044-020-00936-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OCR error correction using correction patterns and self-organizing migrating algorithm

Abstract

Access this article

Similar content being viewed by others

A Systematic Survey on CAPTCHA Recognition: Types, Creation and Breaking Techniques

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Sign Language Recognition Systems: A Decade Systematic Literature Review

Notes

References

Acknowledgements