Building and Improving an OCR Classifier for Republican Chinese Newspaper Text
Creators
- 1. Heidelberg Centre for Transcultural Studies, Universität Heidelberg, Germany
- 2. Institut für Computerlinguistik, Universität Heidelberg, Germany
Contributors
- 1. Universität der Bundeswehr München, Deutschland
- 2. Universität Potsdam, Deutschland
- 3. Digital Humanities im deutschsprachigen Raum e.V., Deutschland
Description
In our paper we present the first results from a systematic approach to full text extraction from a Republican China newspaper. Our basis is a small corpus for which also a ground truth exists. We present our character segmentation method which produces about 70.000 images of characters. Based on the hypothesis that pre-training on extensive amounts of suitably augmented character images will increase the OCR accuracy for evaluation on real-life character image data, we generated additional synthetic training data. We than compare the OCR recognition results and show that a combination of synthetic and real characters produces the best results. Finally, we propose a method that makes use of a masked language model to for OCR error correction.
Ein Beitrag zur 8. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2022 Kulturen des digitalen Gedächtnisses.
Files
ARNOLD_Matthias_Building_and_Improving_an_OCR_Classifier_for.pdf
Files
(2.7 MB)
Name | Size | Download all |
---|---|---|
md5:b4dda6d65c69001be9b04c36555e0193
|
2.7 MB | Preview Download |
md5:9378b04f648b169ab3ba7ae1ba1788f3
|
33.0 kB | Preview Download |
Additional details
Related works
- Is part of
- Book: 10.5281/zenodo.6304590 (DOI)
- Is supplemented by
- Poster: 10.5281/zenodo.6322593 (DOI)