research-article

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

Authors:
Hsiang-An Wang

Center for Digital Cultures, Academia Sinica, Taipei Taiwan

Center for Digital Cultures, Academia Sinica, Taipei Taiwan
View Profile

,
Pin-Ting Liu

Computer Science and Engineering, Yuan Ze University, Taoyuan Taiwan

Computer Science and Engineering, Yuan Ze University, Taoyuan Taiwan
View Profile

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural HeritageMay 2019Pages 15–18https://doi.org/10.1145/3322905.3322922

Published:08 May 2019Publication History

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Pages 15–18

ABSTRACT

The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the challenge for optical character recognition (OCR), whose task is to convert images of printed text into machine-encoded text when the rare book has been digitized.

To reduce the error of the OCR for rare books, this research applies N-gram, long short-term memory (LSTM), and backward and forward N-gram (BF N-gram) statistics text models through substantial training data of texts to develop a more accurate OCR model. We build N-gram, LSTM, and BF N-gram statistics models at varying character lengths and experiment on different quantities of text to locate the best performance of character recognition through observing how these models carry out the tasks.

Once the text model capable of optimized performance is identified, we use further experiments to track down the most appropriate time and method to correct OCR errors with the aid of the text model. Our experiments suggest that the correction implemented by the text model yields more accurate OCR results than does falling back on OCR models only.

References

L. Zhuang, T. Bao, X. Zhu, C. Wang, and S. Naoi, 2004, A Chinese OCR spelling check approach based on statistical language models, IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583). DOI: https://doi.org/10.1109/ICSMC.2004.1401278Google Scholar
T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013, Efficient estimation of word representations in vector space, Google Inc., Mountain View, CA.Google Scholar
M. Schuster, and K. K. Paliwal, (1997), Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 45(11), 2673--2681.Google ScholarDigital Library
M. Hermans, and B. Schrauwen, 2013, Training and Analyzing Deep Recurrent Neural Networks, NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 1, 190--198.Google Scholar
Scripta Sinica Database, http://hanchi.ihp.sinica.edu.tw/Google Scholar
I. Kissos and N. Dershowitz, 2016, OCR Error Correction Using Character Correction and Feature-Based Word Classification. 12th IAPR Workshop on Document Analysis Systems (DAS). DOI: https://ieeexplore.ieee.org/document/7490117Google Scholar

Index Terms

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model
1. General and reference
  1. Cross-computing tools and techniques
    1. Evaluation

Recommendations

An optical character recognition system for printed Telugu text

Telugu is one of the oldest and popular languages of India, spoken by more than 66 million people, especially in South India. Not much work has been reported on the development of optical character recognition (OCR) systems for Telugu text. Therefore, ...
Read More
Nastaliq optical character recognition
ACM-SE 46: Proceedings of the 46th Annual Southeast Regional Conference on XX

Nastaliq is a calligraphic, beautiful and more aesthetic style of writing Urdu, the national language of Pakistan, also used to read and write in India and other countries of the region.

OCRs developed for many world languages are already under ...
Read More
Character and numeral recognition for non-Indic and Indic scripts: a survey
Abstract
A collection of different scripts is employed in writing languages throughout the world. Character and numeral recognition of a particular script is a key area in the field of pattern recognition. In this paper, we have presented a comprehensive ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
May 2019
163 pages
ISBN:9781450371940
DOI:10.1145/3322905

Copyright © 2019 ACM
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LSTM
N-gram
OCR
text models
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate60of86submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 94
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

An optical character recognition system for printed Telugu text

Nastaliq optical character recognition

Character and numeral recognition for non-Indic and Indic scripts: a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

An optical character recognition system for printed Telugu text

Nastaliq optical character recognition

Character and numeral recognition for non-Indic and Indic scripts: a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media