Abstract
There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future.
The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 [1–3]; the collection, Digi, can be reached at http://digi.kansalliskirjasto.fi/. This collection contains approximately 1.95 million pages in Finnish and Swedish, the Finnish part being about 2.385 billion words. In the output of the Optical Character Recognition (OCR) process, errors are common especially when the texts are printed in the Gothic (Fraktur, blackletter) typeface. The errors lower the usability of the corpus both from the point of view of human users as well as considering possible elaborated text mining applications. Automatic spell checking and correction of the data is also difficult due to the historical spelling variants and low OCR quality level of the material.
This paper discusses the overall situation of the intended post-correction of the Digi content and evaluation of the correction. We shall present results of our post-correction trials, and discuss some aspects of methodology of evaluation. These are the first reported evaluation results of post-correction of the data and the experiences will be used in planning of the post-correction of the whole material.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bremer-Laamanen, M.-L.: A Nordic digital newspaper library. Int. Preserv. News 26, 18–20 (2001)
Bremer-Laamanen, M.-L.: Connecting to the past – newspaper digitization in the Nordic countries. World Library and Information Congress. In: 71th IFLA General Conference and Council, “Libraries - A voyage of discovery”, 14th - 18th August 2005, Oslo, Norway (2005). http://archive.ifla.org/IV/ifla71/papers/019e-Bremer-Laamanen.pdf
Bremer-Laamanen, M.-L.: In the spotlight for crowdsourcing. Scand. Librarian Q. 1, 18–21 (2014)
Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceeding of IFLA 2014, Lyon (2014). http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdf
Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 3 (2009). http://www.dlib.org/dlib/march09/holley/03holley.html
Furrer, L., Volk, M.: Reducing OCR errors in Gothic-script documents. In: Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, Hissar, Bulgaria, pp. 97–103 (2011)
Klijn, E.: The current state-of-art in newspaper digitization. a market perspective. D-Lib Mag. 14, 5 (2008). http://www.dlib.org/dlib/january08/klijn/01klijn.html
Niklas, K.: Unsupervised post-correction of OCR errors. Diploma thesis, Leibniz Universität, Hannover (2010). www.l3s.de/~tahmasebi/Diplomarbeit_Niklas.pdf
Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14, 64–93 (1996)
Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the british library’s 19th century online newspaper Archive, D-Lib Magazine 15 (2009). http://www.dlib.org/dlib/july09/munoz/07munoz.html
Lopresti, D.: Optical character recognition errors and their effects on natural language processing. Int. J. Doc. Anal. Recogn. 12, 141–151 (2009)
Chrons, O., Sundell, S.: Digitalkoot: making old archives accessible using crowdsourcing. In: Human Computation, Papers from the 2011 AAAI Workshop (2011). http://www.aaai.org/ocs/index.php/WS/AAAIW11/paper/view/3813/4246
Kettunen, K., Pääkkönen, T.: How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources. In: LREC 2016 (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf
Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24, 377–439 (1992)
Manning, C.D., Schütze, H.: Foundations of Statistical Language Processing. The MIT Press, Cambridge (1999)
Norvig, P.: How to write a spelling corrector (2008). norvig.com/spell-correct.htm
Acknowledgments
This research was funded by the EU Commission through its European Regional Development Fund, and the program Leverage from the EU 2007–2013.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kettunen, K. (2016). Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection. In: Calvanese, D., De Nart, D., Tasso, C. (eds) Digital Libraries on the Move. IRCDL 2015. Communications in Computer and Information Science, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-319-41938-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-41938-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41937-4
Online ISBN: 978-3-319-41938-1
eBook Packages: Computer ScienceComputer Science (R0)