OCR Post Processing Using Support Vector Machines

Fonseca Cacho, Jorge Ramón; Taghva, Kazem

doi:10.1007/978-3-030-52246-9_51

Jorge Ramón Fonseca Cacho¹⁷ &
Kazem Taghva¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1229))

Included in the following conference series:

Science and Information Conference

773 Accesses
5 Citations

Abstract

In this paper, we introduce a set of detailed experiment using Support Vector Machines (SVM) to try and improve accuracy selecting the correct candidate word to correct OCR generated errors. We use our alignment algorithm to create a one-to-one correspondence between the OCR text and the clean version of the TREC-5 data set (Confusion Track). We then extract five features from the candidates suggested by the Google web 1T corpus and use them to train and test our SVM model that will then generalize into the rest of the unseen text. We then improve on our initial results using a polynomial kernel, feature standardization with minmax normalization, and class balancing with SMOTE. Finally, we analyze the errors and suggest on future improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019)
Google Scholar
Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006)
Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys. Doklady 10(8), 707–710 (1966)
MathSciNet Google Scholar
Fonseca Cacho, J.R., Taghva, K.: Aligning ground truth text with OCR degraded text. In: Intelligent Computing-Proceedings of the Computing Conference, pp. 815–833. Springer (2019)
Google Scholar
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/~cjlin/libsvm
Article Google Scholar
Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012)
MATH Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al.: A practical guide to support vector classification (2003)
Google Scholar
Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)
Article Google Scholar
Taghva, K., Nartker, T., Borsack, J.: Information access in the presence of OCR errors. In: Proceedings of the 1st ACM workshop on Hardcopy Document Processing, pp. 1–8. ACM (2004)
Google Scholar
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf. Retrieval 2(2–3), 165–176 (2000)
Article Google Scholar
TREC-5 confusion track. https://trec.nist.gov/data/t5_confusion.html. Accessed 10 Oct 2017
Drakos, G.: Support vector machine vs logistic regression. https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f. Accessed 21 June 2019
Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools. Ph.D. dissertation, University of Nevada, Las Vegas (2019)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365.html
Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
Google Scholar
Devi, D., Purkayastha, B., et al.: Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recogn. Lett. 93, 3–12 (2017)
Article Google Scholar
Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018)
Google Scholar
Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computer science. In: Latifi, S. (ed.) 17th International Conference on Information Technology-New Generations (ITNG 2020). Advances in Intelligent Systems and Computing, vol. 1134. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43020-7_68
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Nevada, Las Vegas, Las Vegas, USA
Jorge Ramón Fonseca Cacho & Kazem Taghva

Authors

Jorge Ramón Fonseca Cacho
View author publications
You can also search for this author in PubMed Google Scholar
Kazem Taghva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorge Ramón Fonseca Cacho .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fonseca Cacho, J.R., Taghva, K. (2020). OCR Post Processing Using Support Vector Machines. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2020. Advances in Intelligent Systems and Computing, vol 1229. Springer, Cham. https://doi.org/10.1007/978-3-030-52246-9_51

Download citation

DOI: https://doi.org/10.1007/978-3-030-52246-9_51
Published: 04 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-52245-2
Online ISBN: 978-3-030-52246-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics