Abstract
In Northern Europe, parish records provide centuries of lineage information, useful not only for settling inheritance disputes, but also for studying hereditary diseases, social mobility, etc. The key information to extract from scans of parish records to obtain lineage information is dates: birth dates (of children and their parents) and dates of baptisms. We present a new dataset of birth dates from Danish parish records and use it to benchmark different approaches to handwritten date recognition, some based on classification and some based on transduction. We evaluate these approaches across several experimental protocols and different segmentation strategies. A state-of-the-art transformer-based transduction model exhibits lower error rates than image classifiers in most scenarios. The image classifiers can nevertheless offer a compelling trade-off in terms of accuracy and computational resource requirements.
Supported by Novo Nordisk Foundation (grant NNF 20SA0066568).
L. C. Piqueras, C. Fierro, J. F. Lotz, and P. Rust—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
An example page can be found at: https://www.sa.dk/ao-soegesider/da/billedviser?epid=17125564#167405,28108453.
- 2.
Future work includes the integration of other columns containing dates in our training set.
- 3.
- 4.
- 5.
- 6.
References
Andrés, J., Prieto, J.R., Granell, E., Romero, V., Sánchez, J.A., Vidal, E.: Information extraction from handwritten tables in historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds) International Workshop on Document Analysis Systems, DAS 2022. LNCS, pp. 184–198. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_13
Bancroft, E.K.: Genetic testing for cancer predisposition and implications for nursing practice: narrative review. J. Adv. Nurs. 66(4), 710–737 (2010). https://doi.org/10.1111/j.1365-2648.2010.05286.x
Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=p-BhZSz59o4
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. arXiv preprint. arXiv:2004.10934 (2020)
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 2134–2141. IEEE (2021)
Boillet, M., Kermorvant, C., Paquet, T.: Robust text line detection in historical documents: learning and evaluation methods. Int. J. Doc. Anal. Recogn. (IJDAR) 95, 1–20 (2022). https://doi.org/10.1007/s10032-022-00395-7
Boone, P.M.: Adolescents, family history, and inherited disease risk: an opportunity. Pediatrics 138(2), e20160579 (2016). https://doi.org/10.1542/peds.2016-0579
Bylstra, Y.: Family history assessment significantly enhances delivery of precision medicine in the genomics era. bioRxiv (2020). https://doi.org/10.1101/2020.01.29.926139, www.biorxiv.org/content/early/2020/01/30/2020.01.29.926139
Clevert, D., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016). http://arxiv.org/abs/1511.07289
Dahl, C.M., Johansen, T.S., Sørensen, E.N., Westermann, C.E., Wittrock, S.F.: Applications of machine learning in document digitisation. arXiv preprint. arXiv:2102.03239 (2021)
Déjean, H., Meunier, J.L.: Table rows segmentation. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 461–466. IEEE (2019)
Ross, L.F., Saal, H.M., David, K.L., Anderson, R.R.: Technical report: ethical and policy issues in genetic testing and screening of children. Genet. Med. 15(3), 234–245 (2013). https://doi.org/10.1038/gim.2012.176
Gao, L., et al.: ICDAR 2019 competition on table detection and recognition (cTDaR). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1510–1515. IEEE (2019)
Granell, E., Chammas, E., Likforman-Sulem, L., Martínez-Hinarejos, C.D., Mokbel, C., Cîrstea, B.I.: Transcription of spanish historical handwritten documents with deep neural networks. J. Imaging 4(1), 15 (2018)
Grüning, T., Leifert, G., Strauß, T., Michael, J., Labahn, R.: A two-stage method for text line detection in historical documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22(3), 285–302 (2019). https://doi.org/10.1007/s10032-019-00332-1
Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference, vol. 15, pp. 10–5244. Citeseer (1988)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Holden, L., Boudko, S., Thorvaldsen, G.: Lenking og kobling i historisk befolkningsregister. Heimen 57(3), 216–229 (2020)
Hough, P.V.: Method and means for recognizing complex patterns (1962). US Patent 3,069,654
Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus-a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 19–24. IEEE (2017)
Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. Pattern Recogn. 129, 108766 (2022)
Kang, L., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture. Pattern Recogn. 112, 107790 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA (2015). http://arxiv.org/abs/1412.6980
Kodym, O., Hradiš, M.: Page layout analysis system for unconstrained historic documents. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 492–506. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_32
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Lehenmeier, C., Burghardt, M., Mischka, B.: Layout detection and table recognition – recent challenges in digitizing historical documents and handwritten tabular data. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 229–242. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_17
Li, M., et al.: TrOCR: transformer-based optical character recognition with pre-trained models (2021). www.microsoft.com/en-us/research/publication/trocr-transformer-based-optical-character-recognition-with-pre-trained-models/
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint (2019)
Monroc, C.B., Miret, B., Bonhomme, M.L., Kermorvant, C.: A comprehensive study of open-source libraries for named entity recognition on handwritten historical documents. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 429–444. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_29
Muehlberger, G., et al.: Transforming scholarship in the archives through handwritten text recognition: transkribus as a case study. J. Doc. (2019)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), 21–24 June 2010, Haifa, Israel, pp. 807–814. Omnipress (2010). https://icml.cc/Conferences/2010/papers/432.pdf
Nion, T., et al.: Handwritten information extraction from historical census documents. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 822–826. IEEE (2013)
OCR, G.C.: https://cloud.google.com/vision/docs/ocr. Accessed 01 June 2022
OCRopy: https://github.com/ocropus/ocropy. Accessed 01 June 2022
Pedersen, B.R., Holsbø, E., Andersen, T., Shvetsov, N., Ravn, J., Sommerseth, H.L., Bongo, L.A.: Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes (2022)
Pedersen, C.B., Gøtzsche, H., Møller, J.O., Mortensen, P.B.: The danish civil registration system. a cohort of eight million persons. Dan. Med. Bull. 53, 441–449 (2006)
Perslev, M., Dam, E.B., Pai, A., Igel, C.: One network to segment them all: a general, lightweight system for accurate 3d medical image segmentation. In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 30–38. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_4
Perslev, M., Darkner, S., Kempfner, L., Nikolic, M., Jennum, P.J., Igel, C.: U-sleep: resilient high-frequency sleep staging. NPJ Digit. Med. 4(1), 1–12 (2021)
Prasad, A., Déjean, H., Meunier, J.L.: Versatile layout understanding via conjugate graph. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 287–294. IEEE (2019)
Prechelt, L.: Early stopping — but when? In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 53–67. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_5
Prieto, J.R., Vidal, E.: Improved graph methods for table layout understanding. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 507–522. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_33
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint. arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Romero, V., Fornés, A., Granell, E., Vidal, E., Sánchez, J.A.: Information extraction in handwritten marriage licenses books. In: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, pp. 66–71 (2019)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Sibade, C., Retornaz, T., Nion, T., Lerallut, R., Kermorvant, C.: Automatic indexing of french handwritten census registers for probate geneaology. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, pp. 51–58 (2011)
Ströbel, P.B., Clematide, S., Volk, M., Hodel, T.: Transformer-based HTR for historical documents. arXiv preprint. arXiv:2203.11008 (2022)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (2019). https://proceedings.mlr.press/v97/tan19a.html
Tesseract: https://github.com/tesseract-ocr/tesseract. Accessed 01 June 2022
Thorvaldsen, G.L., Sommerseth, H., Holden, L.: Anvendelser av norges historiske befolkningsregister. Heimen 57(3), 230–243 (2020)
Toledo, J.I., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86, 27–36 (2019)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Piqueras, L.C. et al. (2022). Date Recognition in Historical Parish Records. In: Porwal, U., Fornés, A., Shafait, F. (eds) Frontiers in Handwriting Recognition. ICFHR 2022. Lecture Notes in Computer Science, vol 13639. Springer, Cham. https://doi.org/10.1007/978-3-031-21648-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-21648-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21647-3
Online ISBN: 978-3-031-21648-0
eBook Packages: Computer ScienceComputer Science (R0)