Skip to main content

Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

  • Conference paper
  • First Online:
Digital Libraries on the Move (IRCDL 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 612))

Included in the following conference series:

Abstract

There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future.

The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 [13]; the collection, Digi, can be reached at http://digi.kansalliskirjasto.fi/. This collection contains approximately 1.95 million pages in Finnish and Swedish, the Finnish part being about 2.385 billion words. In the output of the Optical Character Recognition (OCR) process, errors are common especially when the texts are printed in the Gothic (Fraktur, blackletter) typeface. The errors lower the usability of the corpus both from the point of view of human users as well as considering possible elaborated text mining applications. Automatic spell checking and correction of the data is also difficult due to the historical spelling variants and low OCR quality level of the material.

This paper discusses the overall situation of the intended post-correction of the Digi content and evaluation of the correction. We shall present results of our post-correction trials, and discuss some aspects of methodology of evaluation. These are the first reported evaluation results of post-correction of the data and the experiences will be used in planning of the post-correction of the whole material.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/flammie/omorfi.

  2. 2.

    https://code.google.com/p/isri-ocr-evaluation-tools/.

  3. 3.

    http://awk.info/?doc/tools/spellcheck.html.

  4. 4.

    http://kaino.kotus.fi/sanat/taajuuslista/vns_frek.zip.

  5. 5.

    http://kaino.kotus.fi/korpus/1800/meta/1800_coll_rdf.xml.

References

  1. Bremer-Laamanen, M.-L.: A Nordic digital newspaper library. Int. Preserv. News 26, 18–20 (2001)

    Google Scholar 

  2. Bremer-Laamanen, M.-L.: Connecting to the past – newspaper digitization in the Nordic countries. World Library and Information Congress. In: 71th IFLA General Conference and Council, “Libraries - A voyage of discovery”, 14th - 18th August 2005, Oslo, Norway (2005). http://archive.ifla.org/IV/ifla71/papers/019e-Bremer-Laamanen.pdf

  3. Bremer-Laamanen, M.-L.: In the spotlight for crowdsourcing. Scand. Librarian Q. 1, 18–21 (2014)

    Google Scholar 

  4. Kettunen, K., Honkela, T., Lindén, K., Kauppinen, P., Pääkkönen, T., Kervinen, J.: Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In: Proceeding of IFLA 2014, Lyon (2014). http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdf

  5. Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. 3 (2009). http://www.dlib.org/dlib/march09/holley/03holley.html

  6. Furrer, L., Volk, M.: Reducing OCR errors in Gothic-script documents. In: Proceedings of Language Technologies for Digital Humanities and Cultural Heritage Workshop, Hissar, Bulgaria, pp. 97–103 (2011)

    Google Scholar 

  7. Klijn, E.: The current state-of-art in newspaper digitization. a market perspective. D-Lib Mag. 14, 5 (2008). http://www.dlib.org/dlib/january08/klijn/01klijn.html

    Google Scholar 

  8. Niklas, K.: Unsupervised post-correction of OCR errors. Diploma thesis, Leibniz Universität, Hannover (2010). www.l3s.de/~tahmasebi/Diplomarbeit_Niklas.pdf

  9. Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14, 64–93 (1996)

    Article  Google Scholar 

  10. Tanner, S., Muñoz, T., Ros, P.H.: Measuring mass text digitization quality and usefulness. Lessons learned from assessing the OCR accuracy of the british library’s 19th century online newspaper Archive, D-Lib Magazine 15 (2009). http://www.dlib.org/dlib/july09/munoz/07munoz.html

  11. Lopresti, D.: Optical character recognition errors and their effects on natural language processing. Int. J. Doc. Anal. Recogn. 12, 141–151 (2009)

    Article  Google Scholar 

  12. Chrons, O., Sundell, S.: Digitalkoot: making old archives accessible using crowdsourcing. In: Human Computation, Papers from the 2011 AAAI Workshop (2011). http://www.aaai.org/ocs/index.php/WS/AAAIW11/paper/view/3813/4246

  13. Kettunen, K., Pääkkönen, T.: How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources. In: LREC 2016 (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdf

  14. Kukich, K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24, 377–439 (1992)

    Article  Google Scholar 

  15. Manning, C.D., Schütze, H.: Foundations of Statistical Language Processing. The MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  16. Norvig, P.: How to write a spelling corrector (2008). norvig.com/spell-correct.htm

Download references

Acknowledgments

This research was funded by the EU Commission through its European Regional Development Fund, and the program Leverage from the EU 20072013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kimmo Kettunen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kettunen, K. (2016). Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection. In: Calvanese, D., De Nart, D., Tasso, C. (eds) Digital Libraries on the Move. IRCDL 2015. Communications in Computer and Information Science, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-319-41938-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41938-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41937-4

  • Online ISBN: 978-3-319-41938-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics