Abstract
In actual practice, digitisation and text recognition (OCR) refers to a processing chain, starting with the scanning of original assets (newspaper, book, manuscript, etc.) and the creation of digital images of the asset’s pages, which is the basis for producing digital text documents. The core process is Optical Character Recognition (OCR) which is preceded by image enhancement steps, like deskewing, denoising, etc., and is followed by post-processing steps, such as linguistic correction of OCR errors or enrichment of the OCR results, like adding layout information and identifying semantic units of a page (e.g. page number). In this paper, the focus lies on the post-processing steps. Two selected research areas of the European project IMPACT (IMProving ACcess to Text) will be outlined. Firstly, we present a technology for OCR and information retrieval on historical document collections, and discuss the potential use of fuzzy logic. Secondly, we present the Functional Extension Parser, a software that implements a fuzzy rule-based system for detecting and reconstructing some of the main features of a digitised book based on the OCR results of the digitised images.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Astrain, J.J., Villadangos, J.E., de González Mendívil, J.R., Garitagoitia, J.R., Fariña, F.: An Imperfect String Matching Experience Using Deformed Fuzzy Automata. In: Abraham, A., Ruiz-del-Solar, J., Köppen, M. (eds.) Soft Computing Systems - Design, Management and Applications, Hybrid Intelligent Systems: HIS 2002, Santiago, Chile, December 1-4, vol. 87, pp. 115–123. IOS Press (2002)
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In: AND 2009: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 69–76. ACM, Barcelona (2009)
Klink, S., Kieninger, T.: Rule-based Document Structure Understanding with a Fuzzy Combination of Layout and Textual Features. IJDAR - International Journal on Document Analysis and Recognition 4(1), 18–26 (2001)
Kuiper, R., Wieringa, R.: Fuzzy Spatial Relations for Document Layout Analysis, Groningen (1999)
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics - Doklady 10, 707–710 (1966)
Mendel, J.M.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall, Upper-Saddle River (2001)
Palmero, S., Ismael, G., Dimitriadis, Y.A.: Structured Document Labeling and Rule Extraction Using a New Recurrent Fuzzy-Neural System. In: Fifth International Conference on Document Analysis and Recognition (ICDAR 1999), icdar, p. 181 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gander, L., Reffle, U., Ringlstetter, C., Schlarb, S., Schulz, K., Unterweger, R. (2012). Facing Uncertainty in Digitisation. In: Seising, R., Sanz González, V. (eds) Soft Computing in Humanities and Social Sciences. Studies in Fuzziness and Soft Computing, vol 273. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24672-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-24672-2_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24671-5
Online ISBN: 978-3-642-24672-2
eBook Packages: EngineeringEngineering (R0)