Skip to main content

Segmentation of Large Historical Manuscript Bundles into Multi-page Deeds

  • Conference paper
  • First Online:
Pattern Recognition and Image Analysis (IbPRIA 2023)

Abstract

Archives around the world have vast uncatalogued series of image bundles of digitized historical manuscripts containing, among others, notarial records also known as “deeds” or “acts”. One of the first steps to provide metadata which describe the contents of those bundles is to segment these bundles into their individual deeds. Even if deeds are page-aligned, as in the bundles considered in the present work, this is a time-consuming task, often prohibitive given the huge scale of the manuscript series involved. Unlike traditional Layout Analysis methods for page-level segmentation, our approach goes beyond the realm of a single-page image, providing consistent deed detection results on full bundles. This is achieved in two tightly integrated steps: first, the probabilities that each bundle image is an “initial”, “middle” or “final” page of a deed are estimated, and then an optimal sequence of page labels is computed at the whole bundle level. Empirical results are reported which show that this approach achieves almost perfect segmentation of bundles of a massive Spanish series of historical notarial records.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://deeds.library.utoronto.ca/cartularies/0249.

  2. 2.

    To avoid cumbersome equations, we will abuse the notation and index the elements of a sequence of sequences with a plain, rather than parenthesized superindex. That is, we will write \(D^k\), rather than \(D^{(k)}\).

  3. 3.

    Following time-honored tradition in signal processing and automatic speech recognition, the term posteriorgram is used for this type of (variable-length) sequences of posterior probability vectors.

  4. 4.

    In http://prhlt-carabela.prhlt.upv.es/carabela the images of this collection and a search interface based on Probabilistic Indexing are available.

References

  1. Andrés, J., Prieto, J.R., Granell, E., Romero, V., Sánchez, J.A., Vidal, E.: Information extraction from handwritten tables in historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 184–198. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_13

    Chapter  Google Scholar 

  2. Biswas, S., Riba, P., Lladós, J., Pal, U.: Beyond document object detection: instance-level segmentation of complex layouts. Int. J. Doc. Anal. Recogn. (IJDAR) 24(3), 269–281 (2021). https://doi.org/10.1007/s10032-021-00380-6

    Article  Google Scholar 

  3. Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: International Conference on Pattern Recognition (ICPR), pp. 2134–2141 (2020)

    Google Scholar 

  4. Boillet, M., Kermorvant, C., Paquet, T.: Robust text line detection in historical documents: learning and evaluation methods. Int. J. Doc. Anal. Recogn. 25, 95–114 (2022)

    Article  Google Scholar 

  5. Bosch, V., Toselli, A.H., Vidal, E.: Statistical text line analysis in handwritten documents. In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 201–206. IEEE (2012)

    Google Scholar 

  6. Campos, V.B.: Advances in document layout analysis. Ph.D. thesis, Universitat Politècnica de València (2020)

    Google Scholar 

  7. Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 1–17 (2023)

    Google Scholar 

  8. Vidal, E., et al.: The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: 16th ICFHR (2020)

    Google Scholar 

  9. Flores, J.J., Prieto, J.R., Garrido, D., Alonso, C., Vidal, E.: Classification of untranscribed handwritten notarial documents by textual contents. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds.) IbPRIA 2022. LNCS, vol. 13256, pp. 14–26. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04881-4_2

    Chapter  Google Scholar 

  10. Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1

    Article  Google Scholar 

  11. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN, vol. 2017-October, pp. 2980–2988. Institute of Electrical and Electronics Engineers Inc. (2017)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)

    Google Scholar 

  13. Hélary, X.: Le cartulaire de la seigneurie de nesle [chantilly, 14 f 22] (2006)

    Google Scholar 

  14. Kim, G., et al.: OCR-free document understanding transformer (2022)

    Google Scholar 

  15. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV (2021)

    Google Scholar 

  16. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF CVPR (2022)

    Google Scholar 

  17. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017)

    Google Scholar 

  18. Oliveira, S.A., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation, vol. 2018-August, pp. 7–12. IEEE (2018)

    Google Scholar 

  19. Prieto, J.R., Bosch, V., Vidal, E., Alonso, C., Orcero, M.C., Marquez, L.: Textual-content-based classification of bundles of untranscribed manuscript images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3162–3169. IEEE (2021)

    Google Scholar 

  20. Prieto, J.R., Bosch, V., Vidal, E., Stutzmann, D., Hamel, S.: Text content based layout analysis. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 258–263. IEEE (2020)

    Google Scholar 

  21. Prieto, J.R., Flores, J.J., Vidal, E., Toselli, A.H., Garrido, D., Alonso, C.: Open set classification of untranscribed handwritten documents. arXiv preprint arXiv:2206.13342 (2022)

  22. Puigcerver, J.: A probabilistic formulation of keyword spotting. Ph.D. thesis, Univ. Politècnica de València (2018)

    Google Scholar 

  23. Quirós, L.: Layout analysis for handwritten documents. A probabilistic machine learning approach. Ph.D. thesis, Universitat Politècnica de València (2022)

    Google Scholar 

  24. Quirós, L., Bosch, V., Serrano, L., Toselli, A.H., Vidal, E.: From HMMs to RNNs: computer-assisted transcription of a handwritten notarial records collection. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 116–121. IEEE (2018)

    Google Scholar 

  25. Quirós, L., Toselli, A.H., Vidal, E.: Multi-task layout analysis of handwritten musical scores. In: Morales, A., Fierrez, J., Sánchez, J.S., Ribeiro, B. (eds.) IbPRIA 2019. LNCS, vol. 11868, pp. 123–134. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-31321-0_11

    Chapter  Google Scholar 

  26. Quirós, L., et al.: Oficio de Hipotecas de Girona. A dataset of Spanish notarial deeds (18th Century) for Handwritten Text Recognition and Layout Analysis of Historical Documents (2018)

    Google Scholar 

  27. Tarride, S., Maarand, M., Boillet, M., et al.: Large-scale genealogical information extraction from handwritten Quebec parish records. Int. J. Doc. Anal. Recogn. (2023)

    Google Scholar 

  28. Toselli, A.H., Vidal, E.: Revisiting bag-of-word metrics to assess end-to-end text image recognition results. Preprint (2023)

    Google Scholar 

  29. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020)

    Google Scholar 

Download references

Acknowledgements

Work partially supported by the research grants: the SimancasSearch project as Grant PID2020-116813RB-I00a funded by MCIN/AEI/ 10.13039/501100011033 and ValgrAI - Valencian Graduate School and Research Network of Artificial Intelligence and the Generalitat Valenciana, co-funded by the European Union. The second author’s work was partially supported by the Universitat Politècnica de València under grant FPI-I/SP20190010. The third author’s work is supported by a María Zambrano grant from the Spanish Ministerio de Universidades and the European Union NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Ramón Prieto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Prieto, J.R., Becerra, D., Toselli, A.H., Alonso, C., Vidal, E. (2023). Segmentation of Large Historical Manuscript Bundles into Multi-page Deeds. In: Pertusa, A., Gallego, A.J., Sánchez, J.A., Domingues, I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2023. Lecture Notes in Computer Science, vol 14062. Springer, Cham. https://doi.org/10.1007/978-3-031-36616-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-36616-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-36615-4

  • Online ISBN: 978-3-031-36616-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics