research-article

Document Layout Analysis with Deep Learning and Heuristics

Authors:

Vahid Rezanezhad,

Konstantin Baierer,

Clemens NeudeckerAuthors Info & Claims

HIP '23: Proceedings of the 7th International Workshop on Historical Document Imaging and Processing

Pages 73 - 78

https://doi.org/10.1145/3604951.3605513

Published: 25 August 2023 Publication History

Abstract

The automated yet highly accurate layout analysis (segmentation) of historical document images remains a key challenge for the improvement of Optical Character Recognition (OCR) results. But historical documents exhibit a wide array of features that disturb layout analysis, such as multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc. We present a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Our system can detect more layout classes (e.g. initials, marginals) and achieves higher accuracy than competitive approaches. We describe the algorithm, the different models and how they were trained and discuss our results in comparison to the state-of-the-art on the basis of three historical document datasets.

References

[1]

Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition. IEEE, New York, 296–300.

Digital Library

[2]

Alessandra Belézia Araújo. 2019. Análise de layout de página em jornais históricos germano-brasileiros.

[3]

Matthias Boenig, Konstantin Baierer, Volker Hartmann, Maria Federbusch, and Clemens Neudecker. 2019. Labelling OCR Ground Truth for Usage in Repositories. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. ACM, New York, 3–8.

Digital Library

[4]

Thomas M Breuel. 2017. Robust, simple page segmentation using hybrid convolutional mdlstm networks. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Vol. 1. IEEE, New York, 733–740.

[5]

Christian Clausner, Christos Papadopoulos, Stefan Pletschacher, and Apostolos Antonacopoulos. 2015. The ENP image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, New York, 931–935.

Digital Library

[6]

Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Scenario driven in-depth performance evaluation of document layout analysis methods. In 2011 International Conference on Document Analysis and Recognition. IEEE, New York, 1404–1408.

Digital Library

[7]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, New York, 248–255.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, and Filip Graliński. 2021. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In International Conference on Document Analysis and Recognition. Springer, New York, 532–547.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]

[11]

Frédéric Kaplan, Sofia Ares Oliveira, Simon Clematide, Maud Ehrmann, and Raphaël Barman. 2021. Combining visual and textual features for semantic segmentation of historical newspapers. Journal of Data Mining & Digital Humanities (2021).

[12]

Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [cs.LG]

[13]

Martin Kišš, Karel Beneš, and Michal Hradiš. 2021. AT-ST: Self-training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Document Analysis and Recognition – ICDAR 2021. Springer, New York, 463–477. https://doi.org/10.48550/arXiv.2104.13037

[14]

Oldřich Kodym and Michal Hradiš. 2021. Page Layout Analysis System for Unconstrained Historic Documents. https://doi.org/10.48550/ARXIV.2102.11838

[15]

Jan Kohút and Michal Hradiš. 2021. TS-Net: OCR Trained to Switch Between Text Transcription Styles. https://doi.org/10.48550/ARXIV.2103.05489

[16]

Liangcheng Li, Feiyu Gao, Jiajun Bu, Yongpan Wang, Zhi Yu, and Qi Zheng. 2020. An End-to-End OCR Text Re-organization Sequence Learning for Rich-Text Detail Image Comprehension. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 85–100.

Digital Library

[17]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, New York, 3431–3440.

[18]

Donato Malerba, Michelangelo Ceci, and Margherita Berardi. 2008. Machine learning for reading order detection in document image understanding. In Machine learning in document analysis and recognition. Springer, New York, 45–69.

[19]

Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, and Elisa Herrmann. 2019. OCR-D: An end-to-end open source OCR framework for historical printed documents. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. ACM, New York, 53–58.

Digital Library

[20]

Sofia Ares Oliveira, Benoit Seguin, and Frederic Kaplan. 2018. dhSegment: A generic deep-learning approach for document segmentation. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, New York, 7–12.

[21]

Christos Papadopoulos, Stefan Pletschacher, Christian Clausner, and Apostolos Antonacopoulos. 2013. The IMPACT dataset of historical document images. In Proceedings of the 2Nd international workshop on historical document imaging and processing. ACM, New York, 123–130.

Digital Library

[22]

Stefan Pletschacher and Apostolos Antonacopoulos. 2010. The PAGE (page analysis and ground-truth elements) format framework. In 2010 20th International Conference on Pattern Recognition. IEEE, New York, 257–260.

Digital Library

[23]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, New York, 234–241.

[24]

Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. Layoutparser: A unified toolkit for deep learning based document image analysis. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. Springer, New York, 131–146.

Digital Library

[25]

Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, New York, 629–633.

Digital Library

[26]

Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. 2017. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3. Springer, New York, 240–248.

[27]

Chris Tensmeyer, Brian Davis, Curtis Wigington, Iain Lee, and Bill Barrett. 2017. PageNet: Page Boundary Extraction in Historical Handwritten Documents. arxiv:1709.01618 [cs.CV]

[28]

Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. 2021. LayoutReader: Pre-training of Text and Layout for Reading Order Detection. https://doi.org/10.48550/ARXIV.2108.11591

[29]

Christoph Wick and Frank Puppe. 2018. Fully convolutional neural networks for page segmentation of historical document images. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, New York, 287–292.

[30]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, 1192–1200.

Digital Library

Cited By

Kordon FWeichselbaumer NHerz Rvan der Loop JMossman SPotten ESeuret MMayr MWu FChristlein V(2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70442-0_23

Index Terms

Document Layout Analysis with Deep Learning and Heuristics
1. Applied computing
  1. Document management and text processing
    1. Document capture
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

A Deep Learning-Based System for Document Layout Analysis
ICMLSC '22: Proceedings of the 2022 6th International Conference on Machine Learning and Soft Computing

Document image understanding is an essential process in the digital transformation era. Those systems automatically convert a paper document to a digital document for storing and information extracting. In practice, document layout analysis is a ...
High Performance Layout Analysis of Arabic and Urdu Document Images
ICDAR '11: Proceedings of the 2011 International Conference on Document Analysis and Recognition

Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers ...
Document Layout Analysis Based on Emergent Computation
ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition

A new method of document layout analysis is proposed for a document reader to be used for reading a wide variety of documents. Emergent computation, which is a key concept of artificial life, is adopted to analyze various complex document structures. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HIP '23: Proceedings of the 7th International Workshop on Historical Document Imaging and Processing

August 2023

117 pages

ISBN:9798400708411

DOI:10.1145/3604951

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

BKM

Conference

HIP '23

HIP '23: 7th International Workshop on Historical Document Imaging and Processing

August 25 - 26, 2023

CA, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
219
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)14

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kordon FWeichselbaumer NHerz Rvan der Loop JMossman SPotten ESeuret MMayr MWu FChristlein V(2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 30-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70442-0_23

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten