Abstract
Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to allow proper organization of the archives and effective exploration by scholars and the general public. The class or “typology” of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Histórico Provincial de Cádiz, with promising results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
In http://prhlt-carabela.prhlt.upv.es/carabela the images of this collection and a PrIx-based search interface are available.
- 3.
References
Aggarwal, C.C., Zhai, C.: Mining text data. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4
Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Proc. Manag. 39(1), 45–65 (2003)
Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 311–316, November 2017
Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 16th ICFHR, September 2020
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010)
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)
Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Technical report, Carnegie-mellon univ pittsburgh pa dept of computer science (1996)
Khan, A., Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49, August 2018
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)
Prieto, J.R., Vidal, E., Sánchez, J.A., Alonso, C., Garrido, D.: Extracting descriptive words from untranscribed handwritten images. In: Proceedings of the 2022 Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA) (2022)
Puigcerver, J.: A Probabilistic Formulation of Keyword Spotting. Ph.D. thesis, Univ. Politècnica de València (2018)
Romero, V., Toselli, A.H., Vidal, E., Sánchez, J.A., Alonso, C., Marqués, L.: Modern vs diplomatic transcripts for historical handwritten text recognition. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11808, pp. 103–114. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30754-7_11
Ruder, S.: An overview of gradient descent optimization algorithms 14, 2–3 (2017)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 24(5), 513/523 (1988)
Sánchez, J.A., Romero, V., Toselli, A.H., Villegas, M., Vidal, E.: A set of benchmarks for handwritten text recognition on historical documents. Pattern Recogn. 94, 122–134 (2019)
Toselli, A., Romero, V., Vidal, E., Sánchez, J.: Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing. In: 15th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2019)
Toselli, A.H., Vidal, E., Puigcerver, J., Noya-García, E.: Probabilistic multi-word spotting in handwritten text images. Pattern Anal. Appl. 22(1), 23–32 (2018). https://doi.org/10.1007/s10044-018-0742-z
Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: HMM word graph based keyword spotting in handwritten document images. Inf. Sci. 370–371, 497–518 (2016)
Vidal, E., Toselli, A.H., Puigcerver, J.: A probabilistic framework for lexicon-based keyword spotting in handwritten text images. Technical report, UPV (2017)
Acknowledgments
Work partially supported by the research grants: Ministerio de Ciencia Innovación y Universidades “DocTIUM” (RTI2018-095645-B-C22), Generalitat Valenciana under project DeepPattern (PROMETEO/2019/121) and PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033. The second author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Flores, J.J., Prieto, J.R., Garrido, D., Alonso, C., Vidal, E. (2022). Classification of Untranscribed Handwritten Notarial Documents by Textual Contents. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-04881-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04880-7
Online ISBN: 978-3-031-04881-4
eBook Packages: Computer ScienceComputer Science (R0)