Abstract
In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019)
Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
Fonseca Cacho, J.R., Taghva, K.: Aligning ground truth text with OCR degraded text. In: Intelligent Computing-Proceedings of the Computing Conference, pp. 815–833. Springer (2019)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)
Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)
Taghva, K., Nartker, T., Borsack, J.: Information access in the presence of OCR errors. In: Proceedings of the 1st ACM workshop on Hardcopy Document Processing, pp. 1–8. ACM (2004)
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retrieval 2(2–3), 165–176 (2000)
TREC-5 confusion track. https://trec.nist.gov/data/t5_confusion.html. Accessed 10 Oct 2017
Drakos, G.: Support vector machine vs logistic regression. https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f. Accessed 21 June 2019
Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools. Ph.D. dissertation, University of Nevada, Las Vegas (2019)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365.html
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Devi, D., Purkayastha, B., et al.: Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn. Lett. 93, 3–12 (2017)
Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018)
Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computer science. In: Latifi, S. (ed.) 17th International Conference on Information Technology-New Generations (ITNG 2020). Advances in Intelligent Systems and Computing, vol. 1134. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43020-7_68
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fonseca Cacho, J.R., Taghva, K. (2020). OCR Post Processing Using Support Vector Machines. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1229. Springer, Cham. https://doi.org/10.1007/978-3-030-52246-9_51
Download citation
DOI: https://doi.org/10.1007/978-3-030-52246-9_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-52245-2
Online ISBN: 978-3-030-52246-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)