Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

Kettunen, Kimmo

doi:10.1007/978-3-319-41938-1_11

Kimmo Kettunen¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 612))

Included in the following conference series:

Italian Research Conference on Digital Libraries

428 Accesses
1 Citations

Abstract

There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future.

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 [1–3]; the collection, Digi, can be reached at http://digi.kansalliskirjasto.fi/. This collection contains approximately 1.95 million pages in Finnish and Swedish, the Finnish part being about 2.385 billion words. In the output of the Optical Character Recognition (OCR) process, errors are common especially when the texts are printed in the Gothic (Fraktur, blackletter) typeface. The errors lower the usability of the corpus both from the point of view of human users as well as considering possible elaborated text mining applications. Automatic spell checking and correction of the data is also difficult due to the historical spelling variants and low OCR quality level of the material.

This paper discusses the overall situation of the intended post-correction of the Digi content and evaluation of the correction. We shall present results of our post-correction trials, and discuss some aspects of methodology of evaluation. These are the first reported evaluation results of post-correction of the data and the experiences will be used in planning of the post-correction of the whole material.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bremer-Laamanen, M.-L.: A Nordic digital newspaper library. Int. Preserv. News 26, 18–20 (2001)
Google Scholar
Bremer-Laamanen, M.-L.: Connecting to the past – newspaper digitization in the Nordic countries. World Library and Information Congress. In: 71th IFLA General Conference and Council, “Libraries - A voyage of discovery”, 14th - 18th August 2005, Oslo, Norway (2005). http://archive.ifla.org/IV/ifla71/papers/019e-Bremer-Laamanen.pdf
Bremer-Laamanen, M.-L.: In the spotlight for crowdsourcing. Scand. Librarian Q. 1, 18–21 (2014)
Google Scholar
Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceeding of IFLA 2014, Lyon (2014). http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdf
Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 3 (2009). http://www.dlib.org/dlib/march09/holley/03holley.html
Furrer, L., Volk, M.: Reducing OCR errors in Gothic-script documents. In: Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, Hissar, Bulgaria, pp. 97–103 (2011)
Google Scholar
Klijn, E.: The current state-of-art in newspaper digitization. a market perspective. D-Lib Mag. 14, 5 (2008). http://www.dlib.org/dlib/january08/klijn/01klijn.html
Google Scholar
Niklas, K.: Unsupervised post-correction of OCR errors. Diploma thesis, Leibniz Universität, Hannover (2010). www.l3s.de/~tahmasebi/Diplomarbeit_Niklas.pdf
Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14, 64–93 (1996)
Article Google Scholar
Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the british library’s 19th century online newspaper Archive, D-Lib Magazine 15 (2009). http://www.dlib.org/dlib/july09/munoz/07munoz.html
Lopresti, D.: Optical character recognition errors and their effects on natural language processing. Int. J. Doc. Anal. Recogn. 12, 141–151 (2009)
Article Google Scholar
Chrons, O., Sundell, S.: Digitalkoot: making old archives accessible using crowdsourcing. In: Human Computation, Papers from the 2011 AAAI Workshop (2011). http://www.aaai.org/ocs/index.php/WS/AAAIW11/paper/view/3813/4246
Kettunen, K., Pääkkönen, T.: How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources. In: LREC 2016 (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24, 377–439 (1992)
Article Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Language Processing. The MIT Press, Cambridge (1999)
MATH Google Scholar
Norvig, P.: How to write a spelling corrector (2008). norvig.com/spell-correct.htm

Download references

Acknowledgments

This research was funded by the EU Commission through its European Regional Development Fund, and the program Leverage from the EU 2007–2013.

Author information

Authors and Affiliations

Center for Preservation and Digitisation, National Library of Finland, Mikkeli, Finland
Kimmo Kettunen

Authors

Kimmo Kettunen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kimmo Kettunen .

Editor information

Editors and Affiliations

Fac. di Scienze e TecnologieInformatiche, Libera Univ. di Bolzano, Bolzano, Italy
Diego Calvanese
Dip. di Sci. Matematic., Info. e Fisiche, Università degli Studi di Udine, Udine, Italy
Dario De Nart
Dip. di Sci. Matematic., Info. e Fisiche, Università degli Studi di Udine, Udine, Italy
Carlo Tasso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kettunen, K. (2016). Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection. In: Calvanese, D., De Nart, D., Tasso, C. (eds) Digital Libraries on the Move. IRCDL 2015. Communications in Computer and Information Science, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-319-41938-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-41938-1_11
Published: 01 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41937-4
Online ISBN: 978-3-319-41938-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics