Skip to main content

Classification of Untranscribed Handwritten Notarial Documents by Textual Contents

  • Conference paper
  • First Online:
Pattern Recognition and Image Analysis (IbPRIA 2022)

Abstract

Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to allow proper organization of the archives and effective exploration by scholars and the general public. The class or “typology” of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Histórico Provincial de Cádiz, with promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See: http://transcriptorium.eu/demots/KWSdemos.

  2. 2.

    In http://prhlt-carabela.prhlt.upv.es/carabela the images of this collection and a PrIx-based search interface are available.

  3. 3.

    https://github.com/PRHLT/docClasifIbPRIA22.

References

  1. Aggarwal, C.C., Zhai, C.: Mining text data. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4

    Book  Google Scholar 

  2. Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Proc. Manag. 39(1), 45–65 (2003)

    Article  Google Scholar 

  3. Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017

    Google Scholar 

  4. Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 16th ICFHR, September 2020

    Google Scholar 

  5. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010)

    Google Scholar 

  6. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)

    Google Scholar 

  7. Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)

    Google Scholar 

  8. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science (1996)

    Google Scholar 

  9. Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)

    Google Scholar 

  10. Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018

    Google Scholar 

  11. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  Google Scholar 

  12. Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)

    Google Scholar 

  13. Prieto, J.R., Vidal, E., Sánchez, J.A., Alonso, C., Garrido, D.: Extracting descriptive words from untranscribed handwritten images. In: Proceedings of the 2022 Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA) (2022)

    Google Scholar 

  14. Puigcerver, J.: A Probabilistic Formulation of Keyword Spotting. Ph.D. thesis, Univ. Politècnica de València (2018)

    Google Scholar 

  15. Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs diplomatic transcripts for historical handwritten text recognition. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 103–114. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_11

    Chapter  Google Scholar 

  16. Ruder, S.: An overview of gradient descent optimization algorithms 14, 2–3 (2017)

    Google Scholar 

  17. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 24(5), 513/523 (1988)

    Google Scholar 

  18. Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recogn. 94, 122–134 (2019)

    Article  Google Scholar 

  19. Toselli, A., Romero, V., Vidal, E., Sánchez, J.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2019)

    Google Scholar 

  20. Toselli, A.H., Vidal, E., Puigcerver, J., Noya-García, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Anal. Appl. 22(1), 23–32 (2018). https://doi.org/10.1007/s10044-018-0742-z

    Article  MathSciNet  Google Scholar 

  21. Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)

    Article  Google Scholar 

  22. Vidal, E., Toselli, A.H., Puigcerver, J.: A probabilistic framework for lexicon-based keyword spotting in handwritten text images. Technical report, UPV (2017)

    Google Scholar 

Download references

Acknowledgments

Work partially supported by the research grants: Ministerio de Ciencia Innovación y Universidades “DocTIUM” (RTI2018-095645-B-C22), Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121) and PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033. The second author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Ramón Prieto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Flores, J.J., Prieto, J.R., Garrido, D., Alonso, C., Vidal, E. (2022). Classification of Untranscribed Handwritten Notarial Documents by Textual Contents. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04881-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04880-7

  • Online ISBN: 978-3-031-04881-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics