skip to main content
10.1145/3322905.3322922acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

Authors Info & Claims
Published:08 May 2019Publication History

ABSTRACT

The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the challenge for optical character recognition (OCR), whose task is to convert images of printed text into machine-encoded text when the rare book has been digitized.

To reduce the error of the OCR for rare books, this research applies N-gram, long short-term memory (LSTM), and backward and forward N-gram (BF N-gram) statistics text models through substantial training data of texts to develop a more accurate OCR model. We build N-gram, LSTM, and BF N-gram statistics models at varying character lengths and experiment on different quantities of text to locate the best performance of character recognition through observing how these models carry out the tasks.

Once the text model capable of optimized performance is identified, we use further experiments to track down the most appropriate time and method to correct OCR errors with the aid of the text model. Our experiments suggest that the correction implemented by the text model yields more accurate OCR results than does falling back on OCR models only.

References

  1. L. Zhuang, T. Bao, X. Zhu, C. Wang, and S. Naoi, 2004, A Chinese OCR spelling check approach based on statistical language models, IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583). DOI: https://doi.org/10.1109/ICSMC.2004.1401278Google ScholarGoogle Scholar
  2. T. Mikolov, K. Chen, G. Corrado, and J. Dean, 2013, Efficient estimation of word representations in vector space, Google Inc., Mountain View, CA.Google ScholarGoogle Scholar
  3. M. Schuster, and K. K. Paliwal, (1997), Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing, 45(11), 2673--2681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Hermans, and B. Schrauwen, 2013, Training and Analyzing Deep Recurrent Neural Networks, NIPS'13 Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 1, 190--198.Google ScholarGoogle Scholar
  5. Scripta Sinica Database, http://hanchi.ihp.sinica.edu.tw/Google ScholarGoogle Scholar
  6. I. Kissos and N. Dershowitz, 2016, OCR Error Correction Using Character Correction and Feature-Based Word Classification. 12th IAPR Workshop on Document Analysis Systems (DAS). DOI: https://ieeexplore.ieee.org/document/7490117Google ScholarGoogle Scholar

Index Terms

  1. Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
      May 2019
      163 pages
      ISBN:9781450371940
      DOI:10.1145/3322905

      Copyright © 2019 ACM

      © 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate60of86submissions,70%
    • Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader