Skip to main content

Effective Crowdsourcing in the EDT Project with Probabilistic Indexes

  • Conference paper
  • First Online:
Document Analysis Systems (DAS 2022)

Abstract

Many massive handwritten document images collections are available in archives and libraries worldwide with their textual contents being practically inaccessible. Automatic transcription results generally lack the level of accuracy needed for reliable text indexing and search purposes if the recognition systems are not trained with enough training data. Creating training data is expensive and time-consuming. The European Digital Treasures project intended to explore crowdsourcing techniques for producing accurate training data. This paper explores crowdsourcing techniques based on Probabilistic Indexes. A crowdsourcing tool was developed in which volunteers could amend incorrectly transcribed words. Confidence measures were used to guide and help the users in the correction process. In further steps, this new corrected data will be used to re-train the Probabilistic Indexing system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.digitaltreasures.eu.

  2. 2.

    http://www.zooniverse.org

    http://www.fromthepage.org.

References

  1. Bluche, T.: Deep neural networks for large vocabulary handwritten text recognition. Ph.D. thesis, Ecole Doctorale Informatique de Paris-Sud - Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur, May 2015. Discipline: Informatique

    Google Scholar 

  2. Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017

    Google Scholar 

  3. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)

    Article  Google Scholar 

  4. Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018

    Google Scholar 

  5. Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Univ. Politècnica de València (2018)

    Google Scholar 

  6. Puigcerver, J., Toselli, A.H., Vidal, E.: Advances in handwritten keyword indexing and search technologies. In: Fischer, A., Liwicki, M., Ingold, R. (eds.) Handwritten Historical Document Analysis, Recognition, and Retrieval-State of the Art and Future Trends, vol. 89, pp. 175–193. World Scientific (2020)

    Google Scholar 

  7. Quirós, L., Vidal, E.: Evaluation of a region proposal architecture for multi-task document layout analysis. CoRR, abs/2106.11797 (2021)

    Google Scholar 

  8. Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence (MPAI). World Scientific Publishing (2012)

    Google Scholar 

  9. Sánchez, J.A., Vidal, E.: Handwritten text recognition for the EDT project. Part I: model training and automatic transcription. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)

    Google Scholar 

  10. Toselli, A.H., Romero, V., Vidal, E., Sánchez, J.A.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 15th International Conference on Document Analysis and Recognition (ICDAR) (2019)

    Google Scholar 

  11. Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)

    Google Scholar 

  12. Vidal, E., Sánchez, J.A.: Handwritten text recognition for the EDT project. Part II: textual information search in untranscribed manuscripts. In: Bermejo, M.A., et al. (ed.) Proceedings of the EDT Alicante Workshop (2021, to appear)

    Google Scholar 

  13. Vinciarelli, A., Bengio, S., Bunke, H.: Off-line recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Trans. Pattern Anal. Mach. Intell. 26(6), 709–720 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joan Andreu Sánchez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sánchez, J.A., Vidal, E., Bosch, V. (2022). Effective Crowdsourcing in the EDT Project with Probabilistic Indexes. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06555-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06554-5

  • Online ISBN: 978-3-031-06555-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics