Abstract
Many massive handwritten document images collections are available in archives and libraries worldwide with their textual contents being practically inaccessible. Automatic transcription results generally lack the level of accuracy needed for reliable text indexing and search purposes if the recognition systems are not trained with enough training data. Creating training data is expensive and time-consuming. The European Digital Treasures project intended to explore crowdsourcing techniques for producing accurate training data. This paper explores crowdsourcing techniques based on Probabilistic Indexes. A crowdsourcing tool was developed in which volunteers could amend incorrectly transcribed words. Confidence measures were used to guide and help the users in the correction process. In further steps, this new corrected data will be used to re-train the Probabilistic Indexing system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bluche, T.: Deep neural networks for large vocabulary handwritten text recognition. Ph.D. thesis, Ecole Doctorale Informatique de Paris-Sud - Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur, May 2015. Discipline: Informatique
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018
Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Univ. Politècnica de València (2018)
Puigcerver, J., Toselli, A.H., Vidal, E.: Advances in handwritten keyword indexing and search technologies. In: Fischer, A., Liwicki, M., Ingold, R. (eds.) Handwritten Historical Document Analysis, Recognition, and Retrieval-State of the Art and Future Trends, vol. 89, pp. 175–193. World Scientific (2020)
Quirós, L., Vidal, E.: Evaluation of a region proposal architecture for multi-task document layout analysis. CoRR, abs/2106.11797 (2021)
Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence (MPAI). World Scientific Publishing (2012)
Sánchez, J.A., Vidal, E.: Handwritten text recognition for the EDT project. Part I: model training and automatic transcription. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)
Toselli, A.H., Romero, V., Vidal, E., Sánchez, J.A.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 15th International Conference on Document Analysis and Recognition (ICDAR) (2019)
Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)
Vidal, E., Sánchez, J.A.: Handwritten text recognition for the EDT project. Part II: textual information search in untranscribed manuscripts. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)
Vinciarelli, A., Bengio, S., Bunke, H.: Off-line recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Sánchez, J.A., Vidal, E., Bosch, V. (2022). Effective Crowdsourcing in the EDT Project with Probabilistic Indexes. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-06555-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)