ReadOCR: A Novel Dataset and Readability Assessment of OCRed Texts

Nguyen, Hai Thi Tuyet; Jatowt, Adam; Coustaty, Mickaël; Doucet, Antoine

doi:10.1007/978-3-031-06555-2_32

Hai Thi Tuyet Nguyen¹⁰,
Adam Jatowt¹¹,
Mickaël Coustaty¹² &
…
Antoine Doucet¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1712 Accesses

Abstract

Results of digitisation projects sometimes suffer from the limitations of optical character recognition software which is mainly designed for modern texts. Prior work has examined the impact of OCR errors on information retrieval (IR) and downstream natural language processing (NLP) tasks. However, questions remain open regarding the actual readability of the OCRed text to the end users, especially, considering that traditional OCR quality metrics consider only syntactic or surface features and are quite limited. This paper proposes a novel dataset and conducts a pilot study to investigate these questions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://tinyurl.com/ReadOCR.
2.
https://www.kaggle.com/c/commonlitreadabilityprize/data.
3.
https://en.wikipedia.org.
4.
https://www.africanstorybook.org.
5.
https://www.commonlit.org.
6.
The three annotators are sophomores, two of them are law students, and one is an information technology student.

References

Abdulkader, A., Casey, M.R.: Low cost correction of OCR errors using learning in a multi-engine environment. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2009, pp. 576–580. IEEE Computer Society (2009)
Google Scholar
Bazzo, G.T., Lorentz, G.A., Suarez Vargas, D., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 102–109. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_13
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp. 69–72 (2006)
Google Scholar
Boros, E., et al.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, pp. 431–441. Association for Computational Linguistics (2020)
Google Scholar
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2017 competition on post-OCR text correction. In: 14th IAPR International Conference on Document Analysis and Recognition, pp. 1423–1428. IEEE (2017)
Google Scholar
Crossley, S.A., Skalicky, S., Dascalu, M., McNamara, D.S., Kyle, K.: Predicting text comprehension, processing, and familiarity in adult readers: new approaches to readability formulas. Discourse Process. 54(5–6), 340–359 (2017)
Article Google Scholar
Dale, E., Chall, J.S.: A formula for predicting readability: instructions. Educ. Res. Bull. 27, 37–54 (1948)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019), pp. 4171–4186. Association for Computational Linguistics (2019)
Google Scholar
Kincaid, J.P., Fishburne, R.P., Jr., Rogers, R.L., Chissom, B.S.: Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for Navy enlisted personnel. Tech. rep, Naval Technical Training Command Millington TN Research Branch (1975)
Google Scholar
Koo, T., Li, M.: A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15(2), 155–163 (2016)
Article Google Scholar
Martinc, M., Pollak, S., Robnik-Šikonja, M.: Supervised and unsupervised neural approaches to text readability. Comput. Linguist. 47(1), 141–179 (2021)
Article Google Scholar
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. 54(6), 1–37 (2021)
Article Google Scholar
Nguyen, T., Jatowt, A., Coustaty, M., Nguyen, N., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 19th ACM/IEEE Joint Conference on Digital Libraries, pp. 29–38 (2019)
Google Scholar
Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR quality on named entity linking. In: Jatowt, A., Maeda, A., Syn, S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 102–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Ranganathan, P., Pramesh, C., Aggarwal, R.: Common pitfalls in statistical analysis: measures of agreement. Perspect. Clin. Res. 8, 187 (2017)
Article Google Scholar
Shrout, P.E., Fleiss, J.L.: Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86(2), 420 (1979)
Article Google Scholar
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, ICAART 2020. pp. 484–496. SCITEPRESS (2020)
Google Scholar
Traub, M.C., van Ossenbruggen, J., Hardman, L.: Impact analysis of OCR quality on research tasks in digital archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 252–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8_19
Chapter Google Scholar
Vajjala, S., Lučić, I.: OneStopEnglish corpus: a new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)
Google Scholar
Vajjala, S., Meurers, D.: On improving the accuracy of readability classification using insights from second language acquisition. In: Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 163–173 (2012)
Google Scholar
Xu, W., Callison-Burch, C., Napoles, C.: Problems in current text simplification research: new data can help. Trans. Assoc. Comput. Linguist. 3, 283–297 (2015)
Article Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification, pp. 1480–1489. Association for Computational Linguistics, San Diego (2016)
Google Scholar

Download references

Acknowledgements

This work has been supported by the “ANNA” and “Au-delà des Pyrénées” projects funded by the Nouvelle-Aquitaine region.

Author information

Authors and Affiliations

Posts and Telecommunications Institute of Technology, Ho Chi Minh, Vietnam
Hai Thi Tuyet Nguyen
Department of Computer Science, University of Innsbruck, Innsbruck, Austria
Adam Jatowt
L3i, La Rochelle University, La Rochelle, France
Mickaël Coustaty & Antoine Doucet

Authors

Hai Thi Tuyet Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Adam Jatowt
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hai Thi Tuyet Nguyen .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, H.T.T., Jatowt, A., Coustaty, M., Doucet, A. (2022). ReadOCR: A Novel Dataset and Readability Assessment of OCRed Texts. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_32
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

ReadOCR: A Novel Dataset and Readability Assessment of OCRed Texts