Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

Sánchez, Joan Andreu; Vidal, Enrique; Bosch, Vicente

doi:10.1007/978-3-031-06555-2_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13237))

Included in the following conference series:

International Workshop on Document Analysis Systems

1691 Accesses

Abstract

Many massive handwritten document images collections are available in archives and libraries worldwide with their textual contents being practically inaccessible. Automatic transcription results generally lack the level of accuracy needed for reliable text indexing and search purposes if the recognition systems are not trained with enough training data. Creating training data is expensive and time-consuming. The European Digital Treasures project intended to explore crowdsourcing techniques for producing accurate training data. This paper explores crowdsourcing techniques based on Probabilistic Indexes. A crowdsourcing tool was developed in which volunteers could amend incorrectly transcribed words. Confidence measures were used to guide and help the users in the correction process. In further steps, this new corrected data will be used to re-train the Probabilistic Indexing system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bluche, T.: Deep neural networks for large vocabulary handwritten text recognition. Ph.D. thesis, Ecole Doctorale Informatique de Paris-Sud - Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur, May 2015. Discipline: Informatique
Google Scholar
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017
Google Scholar
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Article Google Scholar
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018
Google Scholar
Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Univ. Politècnica de València (2018)
Google Scholar
Puigcerver, J., Toselli, A.H., Vidal, E.: Advances in handwritten keyword indexing and search technologies. In: Fischer, A., Liwicki, M., Ingold, R. (eds.) Handwritten Historical Document Analysis, Recognition, and Retrieval-State of the Art and Future Trends, vol. 89, pp. 175–193. World Scientific (2020)
Google Scholar
Quirós, L., Vidal, E.: Evaluation of a region proposal architecture for multi-task document layout analysis. CoRR, abs/2106.11797 (2021)
Google Scholar
Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence (MPAI). World Scientific Publishing (2012)
Google Scholar
Sánchez, J.A., Vidal, E.: Handwritten text recognition for the EDT project. Part I: model training and automatic transcription. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)
Google Scholar
Toselli, A.H., Romero, V., Vidal, E., Sánchez, J.A.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 15th International Conference on Document Analysis and Recognition (ICDAR) (2019)
Google Scholar
Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)
Google Scholar
Vidal, E., Sánchez, J.A.: Handwritten text recognition for the EDT project. Part II: textual information search in untranscribed manuscripts. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)
Google Scholar
Vinciarelli, A., Bengio, S., Bunke, H.: Off-line recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

tranSkriptorium AI, Valencia, Spain
Joan Andreu Sánchez, Enrique Vidal & Vicente Bosch

Authors

Joan Andreu Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Vidal
View author publications
You can also search for this author in PubMed Google Scholar
Vicente Bosch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joan Andreu Sánchez .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Seiichi Uchida
Boise State University, BOISE, ID, USA
Elisa Barney
LIRIS UMR CNRS, Villeurbanne, France
Véronique Eglin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sánchez, J.A., Vidal, E., Bosch, V. (2022). Effective Crowdsourcing in the EDT Project with Probabilistic Indexes. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-06555-2_20
Published: 18 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06554-5
Online ISBN: 978-3-031-06555-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)