Abstract
Archives around the world have vast uncatalogued series of image bundles of digitized historical manuscripts containing, among others, notarial records also known as “deeds” or “acts”. One of the first steps to provide metadata which describe the contents of those bundles is to segment these bundles into their individual deeds. Even if deeds are page-aligned, as in the bundles considered in the present work, this is a time-consuming task, often prohibitive given the huge scale of the manuscript series involved. Unlike traditional Layout Analysis methods for page-level segmentation, our approach goes beyond the realm of a single-page image, providing consistent deed detection results on full bundles. This is achieved in two tightly integrated steps: first, the probabilities that each bundle image is an “initial”, “middle” or “final” page of a deed are estimated, and then an optimal sequence of page labels is computed at the whole bundle level. Empirical results are reported which show that this approach achieves almost perfect segmentation of bundles of a massive Spanish series of historical notarial records.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
To avoid cumbersome equations, we will abuse the notation and index the elements of a sequence of sequences with a plain, rather than parenthesized superindex. That is, we will write \(D^k\), rather than \(D^{(k)}\).
- 3.
Following time-honored tradition in signal processing and automatic speech recognition, the term posteriorgram is used for this type of (variable-length) sequences of posterior probability vectors.
- 4.
In http://prhlt-carabela.prhlt.upv.es/carabela the images of this collection and a search interface based on Probabilistic Indexing are available.
References
Andrés, J., Prieto, J.R., Granell, E., Romero, V., Sánchez, J.A., Vidal, E.: Information extraction from handwritten tables in historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 184–198. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_13
Biswas, S., Riba, P., Lladós, J., Pal, U.: Beyond document object detection: instance-level segmentation of complex layouts. Int. J. Doc. Anal. Recogn. (IJDAR) 24(3), 269–281 (2021). https://doi.org/10.1007/s10032-021-00380-6
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2020)
Boillet, M., Kermorvant, C., Paquet, T.: Robust text line detection in historical documents: learning and evaluation methods. Int. J. Doc. Anal. Recogn. 25, 95–114 (2022)
Bosch, V., Toselli, A.H., Vidal, E.: Statistical text line analysis in handwritten documents. In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 201–206. IEEE (2012)
Campos, V.B.: Advances in document layout analysis. Ph.D. thesis, Universitat Politècnica de València (2020)
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–17 (2023)
Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 16th ICFHR (2020)
Flores, J.J., Prieto, J.R., Garrido, D., Alonso, C., Vidal, E.: Classification of untranscribed handwritten notarial documents by textual contents. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds.) IbPRIA 2022. LNCS, vol. 13256, pp. 14–26. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04881-4_2
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN, vol. 2017-October, pp. 2980–2988. Institute of Electrical and Electronics Engineers Inc. (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Hélary, X.: Le cartulaire de la seigneurie de nesle [chantilly, 14 f 22] (2006)
Kim, G., et al.: OCR-free document understanding transformer (2022)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF CVPR (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017)
Oliveira, S.A., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation, vol. 2018-August, pp. 7–12. IEEE (2018)
Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)
Prieto, J.R., Bosch, V., Vidal, E., Stutzmann, D., Hamel, S.: Text content based layout analysis. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 258–263. IEEE (2020)
Prieto, J.R., Flores, J.J., Vidal, E., Toselli, A.H., Garrido, D., Alonso, C.: Open set classification of untranscribed handwritten documents. arXiv preprint arXiv:2206.13342 (2022)
Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Univ. Politècnica de València (2018)
Quirós, L.: Layout analysis for handwritten documents. A probabilistic machine learning approach. Ph.D. thesis, Universitat Politècnica de València (2022)
Quirós, L., Bosch, V., Serrano, L., Toselli, A.H., Vidal, E.: From HMMs to RNNs: computer-assisted transcription of a handwritten notarial records collection. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 116–121. IEEE (2018)
Quirós, L., Toselli, A.H., Vidal, E.: Multi-task layout analysis of handwritten musical scores. In: Morales, A., Fierrez, J., Sánchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019. LNCS, vol. 11868, pp. 123–134. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0_11
Quirós, L., et al.: Oficio de Hipotecas de Girona. A dataset of Spanish notarial deeds (18th Century) for Handwritten Text Recognition and Layout Analysis of Historical Documents (2018)
Tarride, S., Maarand, M., Boillet, M., et al.: Large-scale genealogical information extraction from handwritten Quebec parish records. Int. J. Doc. Anal. Recogn. (2023)
Toselli, A.H., Vidal, E.: Revisiting bag-of-word metrics to assess end-to-end text image recognition results. Preprint (2023)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020)
Acknowledgements
Work partially supported by the research grants: the SimancasSearch project as Grant PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033 and ValgrAI - Valencian Graduate School and Research Network of Artificial Intelligence and the Generalitat Valenciana, co-funded by the European Union. The second author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010. The third author’s work is supported by a María Zambrano grant from the Spanish Ministerio de Universidades and the European Union NextGenerationEU/PRTR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Prieto, J.R., Becerra, D., Toselli, A.H., Alonso, C., Vidal, E. (2023). Segmentation of Large Historical Manuscript Bundles into Multi-page Deeds. In: Pertusa, A., Gallego, A.J., Sánchez, J.A., Domingues, I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2023. Lecture Notes in Computer Science, vol 14062. Springer, Cham. https://doi.org/10.1007/978-3-031-36616-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-36616-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36615-4
Online ISBN: 978-3-031-36616-1
eBook Packages: Computer ScienceComputer Science (R0)