loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Konstantin Todorov and Giovanni Colavizza

Affiliation: Institute for Logic, Language and Computation (ILLC), University of Amsterdam, The Netherlands

Keyword(s): Machine Learning, Language Models, Optical Character Recognition (OCR).

Abstract: Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.141.31.240

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Todorov, K. and Colavizza, G. (2022). An Assessment of the Impact of OCR Noise on Language Models. In Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART; ISBN 978-989-758-547-0; ISSN 2184-433X, SciTePress, pages 674-683. DOI: 10.5220/0010945100003116

@conference{icaart22,
author={Konstantin Todorov. and Giovanni Colavizza.},
title={An Assessment of the Impact of OCR Noise on Language Models},
booktitle={Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART},
year={2022},
pages={674-683},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010945100003116},
isbn={978-989-758-547-0},
issn={2184-433X},
}

TY - CONF

JO - Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART
TI - An Assessment of the Impact of OCR Noise on Language Models
SN - 978-989-758-547-0
IS - 2184-433X
AU - Todorov, K.
AU - Colavizza, G.
PY - 2022
SP - 674
EP - 683
DO - 10.5220/0010945100003116
PB - SciTePress