ABSTRACT
The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the challenge for optical character recognition (OCR), whose task is to convert images of printed text into machine-encoded text when the rare book has been digitized.
To reduce the error of the OCR for rare books, this research applies N-gram, long short-term memory (LSTM), and backward and forward N-gram (BF N-gram) statistics text models through substantial training data of texts to develop a more accurate OCR model. We build N-gram, LSTM, and BF N-gram statistics models at varying character lengths and experiment on different quantities of text to locate the best performance of character recognition through observing how these models carry out the tasks.
Once the text model capable of optimized performance is identified, we use further experiments to track down the most appropriate time and method to correct OCR errors with the aid of the text model. Our experiments suggest that the correction implemented by the text model yields more accurate OCR results than does falling back on OCR models only.
- L. Zhuang, T. Bao, X. Zhu, C. Wang, and S. Naoi, 2004, A Chinese OCR spelling check approach based on statistical language models, IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583). DOI: https://doi.org/10.1109/ICSMC.2004.1401278Google Scholar
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013, Efficient estimation of word representations in vector space, Google Inc., Mountain View, CA.Google Scholar
- M. Schuster, and K. K. Paliwal, (1997), Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 45(11), 2673--2681.Google ScholarDigital Library
- M. Hermans, and B. Schrauwen, 2013, Training and Analyzing Deep Recurrent Neural Networks, NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 1, 190--198.Google Scholar
- Scripta Sinica Database, http://hanchi.ihp.sinica.edu.tw/Google Scholar
- I. Kissos and N. Dershowitz, 2016, OCR Error Correction Using Character Correction and Feature-Based Word Classification. 12th IAPR Workshop on Document Analysis Systems (DAS). DOI: https://ieeexplore.ieee.org/document/7490117Google Scholar
Index Terms
- Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model
Recommendations
An optical character recognition system for printed Telugu text
Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, ...
Nastaliq optical character recognition
ACM-SE 46: Proceedings of the 46th Annual Southeast Regional Conference on XXNastaliq is a calligraphic, beautiful and more aesthetic style of writing Urdu, the national language of Pakistan, also used to read and write in India and other countries of the region.
OCRs developed for many world languages are already under ...
Character and numeral recognition for non-Indic and Indic scripts: a survey
AbstractA collection of different scripts is employed in writing languages throughout the world. Character and numeral recognition of a particular script is a key area in the field of pattern recognition. In this paper, we have presented a comprehensive ...
Comments