Facing Uncertainty in Digitisation

Gander, Lukas; Reffle, Ulrich; Ringlstetter, Christoph; Schlarb, Sven; Schulz, Klaus; Unterweger, Raphael

doi:10.1007/978-3-642-24672-2_10

Lukas Gander³,
Ulrich Reffle⁴,
Christoph Ringlstetter⁴,
Sven Schlarb⁵,
Klaus Schulz⁶ &
…
Raphael Unterweger⁷

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 273))

873 Accesses

Abstract

In actual practice, digitisation and text recognition (OCR) refers to a processing chain, starting with the scanning of original assets (newspaper, book, manuscript, etc.) and the creation of digital images of the asset’s pages, which is the basis for producing digital text documents. The core process is Optical Character Recognition (OCR) which is preceded by image enhancement steps, like deskewing, denoising, etc., and is followed by post-processing steps, such as linguistic correction of OCR errors or enrichment of the OCR results, like adding layout information and identifying semantic units of a page (e.g. page number). In this paper, the focus lies on the post-processing steps. Two selected research areas of the European project IMPACT (IMProving ACcess to Text) will be outlined. Firstly, we present a technology for OCR and information retrieval on historical document collections, and discuss the potential use of fuzzy logic. Secondly, we present the Functional Extension Parser, a software that implements a fuzzy rule-based system for detecting and reconstructing some of the main features of a digitised book based on the OCR results of the digitised images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Astrain, J.J., Villadangos, J.E., de González Mendívil, J.R., Garitagoitia, J.R., Fariña, F.: An Imperfect String Matching Experience Using Deformed Fuzzy Automata. In: Abraham, A., Ruiz-del-Solar, J., Köppen, M. (eds.) Soft Computing Systems - Design, Management and Applications, Hybrid Intelligent Systems: HIS 2002, Santiago, Chile, December 1-4, vol. 87, pp. 115–123. IOS Press (2002)
Google Scholar
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In: AND 2009: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 69–76. ACM, Barcelona (2009)
Chapter Google Scholar
Klink, S., Kieninger, T.: Rule-based Document Structure Understanding with a Fuzzy Combination of Layout and Textual Features. IJDAR - International Journal on Document Analysis and Recognition 4(1), 18–26 (2001)
Article Google Scholar
Kuiper, R., Wieringa, R.: Fuzzy Spatial Relations for Document Layout Analysis, Groningen (1999)
Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady 10, 707–710 (1966)
MathSciNet Google Scholar
Mendel, J.M.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall, Upper-Saddle River (2001)
MATH Google Scholar
Palmero, S., Ismael, G., Dimitriadis, Y.A.: Structured Document Labeling and Rule Extraction Using a New Recurrent Fuzzy-Neural System. In: Fifth International Conference on Document Analysis and Recognition (ICDAR 1999), icdar, p. 181 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department for Digitisation and Digital Preservation, University of Innsbruck, Innsbruck, Austria
Lukas Gander
Center for information and language processing, University of Munich, Munich, Germany
Ulrich Reffle & Christoph Ringlstetter
Austrian National Library, Wien, Austria
Sven Schlarb
Centrum für Informationsund Sprachverarbeitung (CIS) of the LMU, University of Munich, Munich, Germany
Klaus Schulz
Department for digitisation and digital preservation, University Innsbruck Library, Innsbruck, Austria
Raphael Unterweger

Authors

Lukas Gander
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Reffle
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Ringlstetter
View author publications
You can also search for this author in PubMed Google Scholar
Sven Schlarb
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Unterweger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

C/ Carlos Asensio Bretones 5 -- 8F, Oviedo, 33009, Spain
Rudolf Seising
Soto del Barco 2, 6°C. Esc.B, Oviedo, 33012, Spain
Veronica Sanz González

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gander, L., Reffle, U., Ringlstetter, C., Schlarb, S., Schulz, K., Unterweger, R. (2012). Facing Uncertainty in Digitisation. In: Seising, R., Sanz González, V. (eds) Soft Computing in Humanities and Social Sciences. Studies in Fuzziness and Soft Computing, vol 273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24672-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-24672-2_10
Published: 22 November 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24671-5
Online ISBN: 978-3-642-24672-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics