Published March 7, 2022 | Version v1
Conference paper Open

Building and Improving an OCR Classifier for Republican Chinese Newspaper Text

  • 1. Heidelberg Centre for Transcultural Studies, Universität Heidelberg, Germany
  • 2. Institut für Computerlinguistik, Universität Heidelberg, Germany
  • 1. Universität der Bundeswehr München, Deutschland
  • 2. Universität Potsdam, Deutschland
  • 3. Digital Humanities im deutschsprachigen Raum e.V., Deutschland

Description

In our paper we present the first results from a systematic approach to full text extraction from a Republican China newspaper. Our basis is a small corpus for which also a ground truth exists. We present our character segmentation method which produces about 70.000 images of characters. Based on the hypothesis that pre-training on extensive amounts of suitably augmented character images will increase the OCR accuracy for evaluation on real-life character image data, we generated additional synthetic training data. We than compare the OCR recognition results and show that a combination of synthetic and real characters produces the best results. Finally, we propose a method that makes use of a masked language model to for OCR error correction.

Ein Beitrag zur 8. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2022 Kulturen des digitalen Gedächtnisses.

Files

ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.pdf

Additional details

Related works

Is part of
Book: 10.5281/zenodo.6304590 (DOI)
Is supplemented by
Poster: 10.5281/zenodo.6322593 (DOI)