Building and Improving an OCR Classifier for Republican Chinese Newspaper Text

doi:10.5281/zenodo.6327919

Published March 7, 2022 | Version v1

Conference paper Open

Building and Improving an OCR Classifier for Republican Chinese Newspaper Text

1. Heidelberg Centre for Transcultural Studies, Universität Heidelberg, Germany
2. Institut für Computerlinguistik, Universität Heidelberg, Germany

Editors:

1. Universität der Bundeswehr München, Deutschland
2. Universität Potsdam, Deutschland
3. Digital Humanities im deutschsprachigen Raum e.V., Deutschland

In our paper we present the first results from a systematic approach to full text extraction from a Republican China newspaper. Our basis is a small corpus for which also a ground truth exists. We present our character segmentation method which produces about 70.000 images of characters. Based on the hypothesis that pre-training on extensive amounts of suitably augmented character images will increase the OCR accuracy for evaluation on real-life character image data, we generated additional synthetic training data. We than compare the OCR recognition results and show that a combination of synthetic and real characters produces the best results. Finally, we propose a method that makes use of a masked language model to for OCR error correction.

Ein Beitrag zur 8. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2022 Kulturen des digitalen Gedächtnisses.

Files

ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.pdf

Files (2.7 MB)

Name	Size	Download all
ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.pdf md5:b4dda6d65c69001be9b04c36555e0193	2.7 MB	Preview Download
ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.xml md5:9378b04f648b169ab3ba7ae1ba1788f3	33.0 kB	Preview Download

Additional details

Is part of: Book: 10.5281/zenodo.6304590 (DOI)
Is supplemented by: Poster: 10.5281/zenodo.6322593 (DOI)

	All versions	This version
Views	98	98
Downloads	69	69
Data volume	187.7 MB	187.7 MB

Building and Improving an OCR Classifier for Republican Chinese Newspaper Text

Creators

Contributors

Editors:

Description

Files

ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.pdf

Files (2.7 MB)

Additional details

Related works