skip to main content
10.1145/3604951.3605513acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

Document Layout Analysis with Deep Learning and Heuristics

Published: 25 August 2023 Publication History

Abstract

The automated yet highly accurate layout analysis (segmentation) of historical document images remains a key challenge for the improvement of Optical Character Recognition (OCR) results. But historical documents exhibit a wide array of features that disturb layout analysis, such as multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc. We present a document layout analysis (DLA) system for historical documents implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect marginals and to determine the reading order of text regions. Our system can detect more layout classes (e.g. initials, marginals) and achieves higher accuracy than competitive approaches. We describe the algorithm, the different models and how they were trained and discuss our results in comparison to the state-of-the-art on the basis of three historical document datasets.

References

[1]
Apostolos Antonacopoulos, David Bridson, Christos Papadopoulos, and Stefan Pletschacher. 2009. A realistic dataset for performance evaluation of document layout analysis. In 2009 10th International Conference on Document Analysis and Recognition. IEEE, New York, 296–300.
[2]
Alessandra Belézia Araújo. 2019. Análise de layout de página em jornais históricos germano-brasileiros.
[3]
Matthias Boenig, Konstantin Baierer, Volker Hartmann, Maria Federbusch, and Clemens Neudecker. 2019. Labelling OCR Ground Truth for Usage in Repositories. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. ACM, New York, 3–8.
[4]
Thomas M Breuel. 2017. Robust, simple page segmentation using hybrid convolutional mdlstm networks. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Vol. 1. IEEE, New York, 733–740.
[5]
Christian Clausner, Christos Papadopoulos, Stefan Pletschacher, and Apostolos Antonacopoulos. 2015. The ENP image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, New York, 931–935.
[6]
Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Scenario driven in-depth performance evaluation of document layout analysis methods. In 2011 International Conference on Document Analysis and Recognition. IEEE, New York, 1404–1408.
[7]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, New York, 248–255.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Łukasz Garncarek, Rafał Powalski, Tomasz Stanisławek, Bartosz Topolski, Piotr Halama, Michał Turski, and Filip Graliński. 2021. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In International Conference on Document Analysis and Recognition. Springer, New York, 532–547.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]
[11]
Frédéric Kaplan, Sofia Ares Oliveira, Simon Clematide, Maud Ehrmann, and Raphaël Barman. 2021. Combining visual and textual features for semantic segmentation of historical newspapers. Journal of Data Mining & Digital Humanities (2021).
[12]
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [cs.LG]
[13]
Martin Kišš, Karel Beneš, and Michal Hradiš. 2021. AT-ST: Self-training Adaptation Strategy for OCR in Domains with Limited Transcriptions. In Document Analysis and Recognition – ICDAR 2021. Springer, New York, 463–477. https://doi.org/10.48550/arXiv.2104.13037
[14]
Oldřich Kodym and Michal Hradiš. 2021. Page Layout Analysis System for Unconstrained Historic Documents. https://doi.org/10.48550/ARXIV.2102.11838
[15]
Jan Kohút and Michal Hradiš. 2021. TS-Net: OCR Trained to Switch Between Text Transcription Styles. https://doi.org/10.48550/ARXIV.2103.05489
[16]
Liangcheng Li, Feiyu Gao, Jiajun Bu, Yongpan Wang, Zhi Yu, and Qi Zheng. 2020. An End-to-End OCR Text Re-organization Sequence Learning for Rich-Text Detail Image Comprehension. In Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 85–100.
[17]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, New York, 3431–3440.
[18]
Donato Malerba, Michelangelo Ceci, and Margherita Berardi. 2008. Machine learning for reading order detection in document image understanding. In Machine learning in document analysis and recognition. Springer, New York, 45–69.
[19]
Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, and Elisa Herrmann. 2019. OCR-D: An end-to-end open source OCR framework for historical printed documents. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. ACM, New York, 53–58.
[20]
Sofia Ares Oliveira, Benoit Seguin, and Frederic Kaplan. 2018. dhSegment: A generic deep-learning approach for document segmentation. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, New York, 7–12.
[21]
Christos Papadopoulos, Stefan Pletschacher, Christian Clausner, and Apostolos Antonacopoulos. 2013. The IMPACT dataset of historical document images. In Proceedings of the 2Nd international workshop on historical document imaging and processing. ACM, New York, 123–130.
[22]
Stefan Pletschacher and Apostolos Antonacopoulos. 2010. The PAGE (page analysis and ground-truth elements) format framework. In 2010 20th International Conference on Pattern Recognition. IEEE, New York, 257–260.
[23]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, New York, 234–241.
[24]
Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. Layoutparser: A unified toolkit for deep learning based document image analysis. In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part I 16. Springer, New York, 131–146.
[25]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, New York, 629–633.
[26]
Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. 2017. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3. Springer, New York, 240–248.
[27]
Chris Tensmeyer, Brian Davis, Curtis Wigington, Iain Lee, and Bill Barrett. 2017. PageNet: Page Boundary Extraction in Historical Handwritten Documents. arxiv:1709.01618 [cs.CV]
[28]
Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. 2021. LayoutReader: Pre-training of Text and Layout for Reading Order Detection. https://doi.org/10.48550/ARXIV.2108.11591
[29]
Christoph Wick and Frank Puppe. 2018. Fully convolutional neural networks for page segmentation of historical document images. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, New York, 287–292.
[30]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, 1192–1200.

Cited By

View all
  • (2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 30-Aug-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
HIP '23: Proceedings of the 7th International Workshop on Historical Document Imaging and Processing
August 2023
117 pages
ISBN:9798400708411
DOI:10.1145/3604951
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document layout analysis
  2. Reading order detection
  3. Segmentation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • BKM

Conference

HIP '23

Acceptance Rates

Overall Acceptance Rate 52 of 90 submissions, 58%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)130
  • Downloads (Last 6 weeks)14
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 30-Aug-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media