Skip to main content

Date Recognition in Historical Parish Records

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13639))

Abstract

In Northern Europe, parish records provide centuries of lineage information, useful not only for settling inheritance disputes, but also for studying hereditary diseases, social mobility, etc. The key information to extract from scans of parish records to obtain lineage information is dates: birth dates (of children and their parents) and dates of baptisms. We present a new dataset of birth dates from Danish parish records and use it to benchmark different approaches to handwritten date recognition, some based on classification and some based on transduction. We evaluate these approaches across several experimental protocols and different segmentation strategies. A state-of-the-art transformer-based transduction model exhibits lower error rates than image classifiers in most scenarios. The image classifiers can nevertheless offer a compelling trade-off in terms of accuracy and computational resource requirements.

Supported by Novo Nordisk Foundation (grant NNF 20SA0066568).

L. C. Piqueras, C. Fierro, J. F. Lotz, and P. Rust—Equal Contribution.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    An example page can be found at: https://www.sa.dk/ao-soegesider/da/billedviser?epid=17125564#167405,28108453.

  2. 2.

    Future work includes the integration of other columns containing dates in our training set.

  3. 3.

    github.com/coastalcph/mgr-birthdates.

  4. 4.

    https://pytorch.org/vision/stable/models.html.

  5. 5.

    https://github.com/microsoft/unilm/tree/master/trocr#fine-tuning-and-evaluation.

  6. 6.

    https://github.com/NVIDIA/apex.

References

  1. Andrés, J., Prieto, J.R., Granell, E., Romero, V., Sánchez, J.A., Vidal, E.: Information extraction from handwritten tables in historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds) International Workshop on Document Analysis Systems, DAS 2022. LNCS, pp. 184–198. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_13

  2. Bancroft, E.K.: Genetic testing for cancer predisposition and implications for nursing practice: narrative review. J. Adv. Nurs. 66(4), 710–737 (2010). https://doi.org/10.1111/j.1365-2648.2010.05286.x

    Article  Google Scholar 

  3. Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=p-BhZSz59o4

  4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint. arXiv:2004.10934 (2020)

  5. Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141. IEEE (2021)

    Google Scholar 

  6. Boillet, M., Kermorvant, C., Paquet, T.: Robust text line detection in historical documents: learning and evaluation methods. Int. J. Doc. Anal. Recogn. (IJDAR) 95, 1–20 (2022). https://doi.org/10.1007/s10032-022-00395-7

    Article  Google Scholar 

  7. Boone, P.M.: Adolescents, family history, and inherited disease risk: an opportunity. Pediatrics 138(2), e20160579 (2016). https://doi.org/10.1542/peds.2016-0579

  8. Bylstra, Y.: Family history assessment significantly enhances delivery of precision medicine in the genomics era. bioRxiv (2020). https://doi.org/10.1101/2020.01.29.926139, www.biorxiv.org/content/early/2020/01/30/2020.01.29.926139

  9. Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016). http://arxiv.org/abs/1511.07289

  10. Dahl, C.M., Johansen, T.S., Sørensen, E.N., Westermann, C.E., Wittrock, S.F.: Applications of machine learning in document digitisation. arXiv preprint. arXiv:2102.03239 (2021)

  11. Déjean, H., Meunier, J.L.: Table rows segmentation. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 461–466. IEEE (2019)

    Google Scholar 

  12. Ross, L.F., Saal, H.M., David, K.L., Anderson, R.R.: Technical report: ethical and policy issues in genetic testing and screening of children. Genet. Med. 15(3), 234–245 (2013). https://doi.org/10.1038/gim.2012.176

    Article  Google Scholar 

  13. Gao, L., et al.: ICDAR 2019 competition on table detection and recognition (cTDaR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515. IEEE (2019)

    Google Scholar 

  14. Granell, E., Chammas, E., Likforman-Sulem, L., Martínez-Hinarejos, C.D., Mokbel, C., Cîrstea, B.I.: Transcription of spanish historical handwritten documents with deep neural networks. J. Imaging 4(1), 15 (2018)

    Article  Google Scholar 

  15. Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1

    Article  Google Scholar 

  16. Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference, vol. 15, pp. 10–5244. Citeseer (1988)

    Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  18. Holden, L., Boudko, S., Thorvaldsen, G.: Lenking og kobling i historisk befolkningsregister. Heimen 57(3), 216–229 (2020)

    Article  Google Scholar 

  19. Hough, P.V.: Method and means for recognizing complex patterns (1962). US Patent 3,069,654

    Google Scholar 

  20. Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus-a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 19–24. IEEE (2017)

    Google Scholar 

  21. Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. Pattern Recogn. 129, 108766 (2022)

    Article  Google Scholar 

  22. Kang, L., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture. Pattern Recogn. 112, 107790 (2021)

    Article  Google Scholar 

  23. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA (2015). http://arxiv.org/abs/1412.6980

  24. Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 492–506. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_32

    Chapter  Google Scholar 

  25. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  26. Lehenmeier, C., Burghardt, M., Mischka, B.: Layout detection and table recognition – recent challenges in digitizing historical documents and handwritten tabular data. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 229–242. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_17

    Chapter  Google Scholar 

  27. Li, M., et al.: TrOCR: transformer-based optical character recognition with pre-trained models (2021). www.microsoft.com/en-us/research/publication/trocr-transformer-based-optical-character-recognition-with-pre-trained-models/

  28. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint (2019)

    Google Scholar 

  29. Monroc, C.B., Miret, B., Bonhomme, M.L., Kermorvant, C.: A comprehensive study of open-source libraries for named entity recognition on handwritten historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 429–444. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_29

    Chapter  Google Scholar 

  30. Muehlberger, G., et al.: Transforming scholarship in the archives through handwritten text recognition: transkribus as a case study. J. Doc. (2019)

    Google Scholar 

  31. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), 21–24 June 2010, Haifa, Israel, pp. 807–814. Omnipress (2010). https://icml.cc/Conferences/2010/papers/432.pdf

  32. Nion, T., et al.: Handwritten information extraction from historical census documents. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 822–826. IEEE (2013)

    Google Scholar 

  33. OCR, G.C.: https://cloud.google.com/vision/docs/ocr. Accessed 01 June 2022

  34. OCRopy: https://github.com/ocropus/ocropy. Accessed 01 June 2022

  35. Pedersen, B.R., Holsbø, E., Andersen, T., Shvetsov, N., Ravn, J., Sommerseth, H.L., Bongo, L.A.: Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes (2022)

    Google Scholar 

  36. Pedersen, C.B., Gøtzsche, H., Møller, J.O., Mortensen, P.B.: The danish civil registration system. a cohort of eight million persons. Dan. Med. Bull. 53, 441–449 (2006)

    Google Scholar 

  37. Perslev, M., Dam, E.B., Pai, A., Igel, C.: One network to segment them all: a general, lightweight system for accurate 3d medical image segmentation. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 30–38. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_4

    Chapter  Google Scholar 

  38. Perslev, M., Darkner, S., Kempfner, L., Nikolic, M., Jennum, P.J., Igel, C.: U-sleep: resilient high-frequency sleep staging. NPJ Digit. Med. 4(1), 1–12 (2021)

    Article  Google Scholar 

  39. Prasad, A., Déjean, H., Meunier, J.L.: Versatile layout understanding via conjugate graph. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 287–294. IEEE (2019)

    Google Scholar 

  40. Prechelt, L.: Early stopping — but when? In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 53–67. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_5

    Chapter  Google Scholar 

  41. Prieto, J.R., Vidal, E.: Improved graph methods for table layout understanding. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 507–522. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_33

    Chapter  Google Scholar 

  42. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint. arXiv:1804.02767 (2018)

  43. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  44. Romero, V., Fornés, A., Granell, E., Vidal, E., Sánchez, J.A.: Information extraction in handwritten marriage licenses books. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, pp. 66–71 (2019)

    Google Scholar 

  45. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  46. Sibade, C., Retornaz, T., Nion, T., Lerallut, R., Kermorvant, C.: Automatic indexing of french handwritten census registers for probate geneaology. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 51–58 (2011)

    Google Scholar 

  47. Ströbel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based HTR for historical documents. arXiv preprint. arXiv:2203.11008 (2022)

  48. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (2019). https://proceedings.mlr.press/v97/tan19a.html

  49. Tesseract: https://github.com/tesseract-ocr/tesseract. Accessed 01 June 2022

  50. Thorvaldsen, G.L., Sommerseth, H., Holden, L.: Anvendelser av norges historiske befolkningsregister. Heimen 57(3), 230–243 (2020)

    Article  Google Scholar 

  51. Toledo, J.I., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86, 27–36 (2019)

    Article  Google Scholar 

  52. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  53. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laura Cabello Piqueras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piqueras, L.C. et al. (2022). Date Recognition in Historical Parish Records. In: Porwal, U., Fornés, A., Shafait, F. (eds) Frontiers in Handwriting Recognition. ICFHR 2022. Lecture Notes in Computer Science, vol 13639. Springer, Cham. https://doi.org/10.1007/978-3-031-21648-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21648-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21647-3

  • Online ISBN: 978-3-031-21648-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics