A Hybrid Approach to Document Layout Analysis for Heterogeneous Document Images

Zhong, Zhuoyao; Wang, Jiawei; Sun, Haiqing; Hu, Kai; Zhang, Erhan; Sun, Lei; Huo, Qiang

doi:10.1007/978-3-031-41734-4_12

Zhuoyao Zhong¹¹,
Jiawei Wang^11,12,
Haiqing Sun^11,13,
Kai Hu^11,12,
Erhan Zhang^11,13,
Lei Sun¹¹ &
…
Qiang Huo¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14191))

Included in the following conference series:

International Conference on Document Analysis and Recognition

746 Accesses

Abstract

We present a new hybrid document layout analysis approach to simultaneously detecting graphical page objects, group text-lines into text regions according to reading order, and recognize the logical roles of text regions from heterogeneous document images. For graphical page object detection, we leverage a state-of-the-art Transformer-based object detection model, namely DINO, as a new graphical page object detector to detect tables, figures, and (displayed) formulas in a top-down manner. Furthermore, we introduce a new bottom-up text region detection model to group text-lines located outside graphical page objects into text regions according to reading order and recognize the logical role of each text region by using both visual and textual features. Experimental results show that our bottom-up text region detection model achieves higher localization and logical role classification accuracy than previous top-down methods. Moreover, in addition to the locations of text regions, our approach can also output the reading order of text-lines in each text region directly. The state-of-the-art results obtained on DocLayNet and PubLayNet demonstrate the effectiveness of our approach.

J. Wang, H. Sun, K. Hu and E. Zhang—This work was done when Jiawei Wang, Haiqing Sun, Kai Hu and Erhan Zhang were interns in MMI Group, Microsoft Research Asia, Beijing, China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bi, H., et al.: Srrv: A novel document object detector based on spatial-related relation and vision. IEEE Transactions on Multimedia (2022)
Google Scholar
Binmakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. (CSUR) 52(6), 1–36 (2019)
Article Google Scholar
Biswas, S., Banerjee, A., Lladós, J., Pal, U.: Docsegtr: an instance-level end-to-end document image segmentation transformer. arXiv preprint arXiv:2201.11438 (2022)
Cai, Z., Vasconcelos, N.: Cascade r-cnn: high quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43(5), 1483–1498 (2019)
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision, pp. 213–229 (2020)
Google Scholar
Dai, X., et al.: Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Google Scholar
Doermann, D., Tombre, K. (eds.): Handbook of Document Image Processing and Recognition. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1
Girshick, R.: Fast r-cnn. In: Proceedings of the International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Gu, J., et al.: Unified pretraining framework for document understanding. arXiv preprint arXiv:2204.10939 (2022)
He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task fcn for semantic page segmentation and table detection. In: Proceedings of the International Conference on Document Analysis and Recognition. vol. 1, pp. 254–261 (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-CNN. In: Proceedings of the International Conference on Computer Visio, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre-training for document ai with unified text and image masking. In: Proceedings of the ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Google Scholar
Jocher, G., et al.: ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations (Apr 2021)
Google Scholar
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. arXiv preprint arXiv:2203.01305 (2022)
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: Self-supervised pre-training for document image transformer. In: Proceedings of the ACM International Conference on Multimedia. pp. 3530–3539 (2022)
Google Scholar
Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: Proceedings of the International Conference on Pattern Recognition, pp. 3627–3632 (2018)
Google Scholar
Li, X.H., Yin, F., Liu, C.L.: Page segmentation using convolutional neural network and graphical model. In: Proceedings of the International Workshop on Document Analysis Systems, pp. 231–245 (2020)
Google Scholar
Li, X.H., et al.: Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 514–519 (2019)
Google Scholar
Li, Y., Zou, Y., Ma, J.: Deeplayout: A semantic segmentation approach to page layout analysis. In: Proceedings of the International Conference on Intelligent Computing Methodologies, pp. 266–277 (2018)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Liu, S., et al.: Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022)
Liu, S., Wang, R., Raptis, M., Fujii, Y.: Unified line and paragraph detection by graph convolutional networks. In: Proceedings of the International Workshop on Document Analysis Systems, pp. 33–47 (2022)
Google Scholar
Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10012–10022 (2021)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Long, S., Qin, S., Panteleev, D., Bissacco, A., Fujii, Y., Raptis, M.: Towards end-to-end unified scene text detection and layout analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Luo, S., Ding, Y., Long, S., Han, S.C., Poon, J.: Doc-gcn: Heterogeneous graph convolutional networks for document layout analysis. arXiv preprint arXiv:2208.10970 (2022)
Minouei, M., Soheili, M.R., Stricker, D.: Document layout analysis with an enhanced object detector. In: Proceedings of the International Conference on Pattern Recognition and Image Analysis, pp. 1–5 (2021)
Google Scholar
Naik, S., Hashmi, K.A., Pagani, A., Liwicki, M., Stricker, D., Afzal, M.Z.: Investigating attention mechanism for page object detection in document images. Appl. Sci. 12(15), 7486 (2022)
Article Google Scholar
Oliveira, D.A.B., Viana, M.P.: Fast cnn-based document layout analysis. In: Proceedings of the International Conference on Computer Vision Workshops, pp. 1173–1180 (2017)
Google Scholar
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.W.: Doclaynet: A large human-annotated dataset for document-layout analysis. arXiv preprint arXiv:2206.01062 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Saha, R., Mondal, A., Jawahar, C.: Graphical object detection in document images. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 51–58 (2019)
Google Scholar
Sang, Y., Zeng, Y., Liu, R., Yang, F., Yao, Z., Pan, Y.: Exploiting spatial attention and contextual information for document image segmentation. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, pp. 261–274 (2022)
Google Scholar
Shi, C., Xu, C., Bi, H., Cheng, Y., Li, Y., Zhang, H.: Lateral feature enhancement network for page object detection. IEEE Trans. Instrum. Meas. 71, 1–10 (2022)
Google Scholar
Sun, P., et al.: Sparse r-cnn: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Google Scholar
Vo, N.D., Nguyen, K., Nguyen, T.V., Nguyen, K.: Ensemble of deep object detectors for page object detection. In: Proceedings of the International Conference on Ubiquitous Information Management and Communicatio, pp. 1–6 (2018)
Google Scholar
Wang, R., Fujii, Y., Popat, A.C.: Post-ocr paragraph recognition by graph convolutional networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 493–502 (2022)
Google Scholar
Wang, X., Zhang, R., Kong, T., Li, L., Shen, C.: Solov2: Dynamic and fast instance segmentation. In: Proceedings of the Advances in Neural information processing systems. vol. 33, pp. 17721–17732 (2020)
Google Scholar
Xue, C., Huang, J., Zhang, W., Lu, S., Wang, C., Bai, S.: Contextual text block detection towards scene text understanding. In: Proceedings of the European Conference on Computer Vision, pp. 374–391 (2022)
Google Scholar
Yang, H., Hsu, W.: Transformer-based approach for document layout understanding. In: Proceedings of the International Conference on Image Processing, pp. 4043–4047 (2022)
Google Scholar
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2017)
Google Scholar
Yi, X., Gao, L., Liao, Y., Zhang, X., Liu, R., Jiang, Z.: Cnn based page object detection in document images. In: Proceedings of the International Conference on Document Analysis and Recognition. vol. 1, pp. 230–235 (2017)
Google Scholar
Zhang, H., et al.: Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022)
Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., Elgammal, A.: Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5678–5686 (2017)
Google Scholar
Zhang, P., et al.: Vsr: a unified framework for document layout analysis combining vision, semantics and relations. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 115–130 (2021)
Google Scholar
Zhang, Y., Bo, Z., Wang, R., Cao, J., Li, C., Bao, Z.: Entity relation extraction as dependency parsing in visually rich documents. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2759–2768 (2021)
Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1015–1022 (2019)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research Asia, Beijing, China
Zhuoyao Zhong, Jiawei Wang, Haiqing Sun, Kai Hu, Erhan Zhang, Lei Sun & Qiang Huo
Department of EEIS, University of Science and Technology of China, Hefei, China
Jiawei Wang & Kai Hu
School of Software and Microelectronics, Peking University, Beijing, China
Haiqing Sun & Erhan Zhang

Authors

Zhuoyao Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haiqing Sun
View author publications
You can also search for this author in PubMed Google Scholar
Kai Hu
View author publications
You can also search for this author in PubMed Google Scholar
Erhan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Huo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhuoyao Zhong .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhong, Z. et al. (2023). A Hybrid Approach to Document Layout Analysis for Heterogeneous Document Images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-41734-4_12
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Hybrid Approach to Document Layout Analysis for Heterogeneous Document Images