HiM: hierarchical multimodal network for document layout analysis

Canhui, Xu; Yuteng, Li; Cao, Shi; Honghong, Zhang; Hengyue, Bi; Yinong, Chen

doi:10.1007/s10489-023-04782-3

HiM: hierarchical multimodal network for document layout analysis

Published: 23 July 2023

Volume 53, pages 24314–24326, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xu Canhui¹,
Li Yuteng¹,
Shi Cao ORCID: orcid.org/0000-0003-2748-5557¹,
Zhang Honghong¹,
Bi Hengyue¹ &
…
Chen Yinong²

287 Accesses
1 Citation
Explore all metrics

Abstract

In document layout analysis, both computer vision based and natural language processing based methods are employed individually or integrated to enrich the feature information resources and to enforce object detection. To simultaneously leverage visual and textual modalities, this paper proposes a hierarchical multimodal (HiM) network to aggregate representative features from multi-source inputs with the introduction of complementary semantics and non-local context dependencies across grained scales. Different channel and spatial attention mechanisms are adapted to different modalities. The visual modality is based on conventional convolution network, while the textual modality focuses on embedding hierarchical textual vectors and positioning. The feature representations from multiple modalities are then integrated adaptively in feature pyramid network for subsequent region proposal processing. We have made database adaptation on PubLayNet, including inserting semi-structure elements and extending ground truth annotations by parsing PDF pages. On three popular benchmarks, including Article Regions, PubLayNet and DocBank, extensive experiments are carried out to verify the effectiveness and adaptability of HiM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The YOLO model that still excels in document layout analysis

Article 19 November 2023

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Deep Layout Analysis of Multi-lingual and Composite Documents

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Liao M, Zou Z, Wan Z, Yao C, Bai X (2023) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):919–931. https://doi.org/10.1109/TPAMI.2022.3155612
Article Google Scholar
Xu C-h, Shi C, Chen Y-n (2021) End-to-end dilated convolution network for document image semantic segmentation. Journal of Central South University 28(6):1765–1774
Article Google Scholar
Xu C, Shi C, Bi H, Liu C, Yuan Y, Guo H, Chen Y (2021) A page object detection method based on mask r-cnn. IEEE Access 9:143448–143457. https://doi.org/10.1109/ACCESS.2021.3121152
Article Google Scholar
Tao X, Tang Z, Xu C (2014) Contextual modeling for logical labeling of pdf documents. Computers & Electrical Engineering 40(4):1363–1375
Article Google Scholar
Liao M, Lyu P, He M, Yao C, Wu W, Bai X (2021) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(2):532–548. https://doi.org/10.1109/TPAMI.2019.2937086
Article Google Scholar
Xu Y, Fu M, Wang Q, Wang Y, Chen K, Xia G-S, Bai X (2020) Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43(4):1452–1459
Article Google Scholar
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2019) Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9):2035–2048. https://doi.org/10.1109/TPAMI.2018.2848939
Article Google Scholar
Cai Y, Liu Y, Shen C, Jin L, Li Y, Ergu D (2022) Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognition 127:108608
Article Google Scholar
Liu Y, Shen C, Jin L, He T, Chen P, Liu C, Chen H (2022) Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11):8048–8064. https://doi.org/10.1109/TPAMI.2021.3107437
Article Google Scholar
Liu C, Liu Y, Jin L, Zhang S, Luo C, Wang Y (2020) Erasenet: End-to-end text removal in the wild. IEEE Transactions on Image Processing 29:8760–8775
Article MATH Google Scholar
Zhang, H, Yao, Q, Kwok, J.T, Bai, X (2022) Searching a high performance feature extractor for text recognition network. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–15. https://doi.org/10.1109/TPAMI.2022.3205748
Redmon, J, Divvala, S, Girshick, R, Farhadi, A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
He, K, Gkioxari, G, Dollár, P, Girshick, R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer Vision, pp 2961–2969
Cai Z, Vasconcelos N (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence 43(5):1483–1498
Article Google Scholar
Augusto Borges Oliveira, D, Palhares Viana, M (2017) Fast cnn-based document layout analysis. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1173–1180
Bi, H, Xu, C, Shi, C, Liu, G, Li, Y, Zhang, H, Qu, J (2022) Srrv: A novel document object detector based on spatial-related relation and vision. IEEE Transactions on Multimedia, pp 1–1. https://doi.org/10.1109/TMM.2022.3165717
Shi C, Xu C, Bi H, Cheng Y, Li Y, Zhang H (2022) Lateral feature enhancement network for page object detection. IEEE Transactions on Instrumentation and Measurement 71:1–10. https://doi.org/10.1109/TIM.2022.3201546
Article Google Scholar
LeCun, Y, Bengio, Y, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901
Google Scholar
Devlin, J, Chang, M-W, Lee, K, Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics,Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Liu, Y, Ott, M, Goyal, N, Du, J, Joshi, M, Chen, D, Levy, O, Lewis, M, Zettlemoyer, L, Stoyanov, V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Garncarek, Ł, Powalski, R, Stanisławek, T, Topolski, B, Halama, P, Turski, M, Graliński, F (2021) Lambert: Layout-aware language modeling for information extraction. In: International conference on document analysis and recognition, pp 532–547. Springer
Xu, Y, Li, M, Cui, L, Huang, S, Wei, F, Zhou, M (2020) Layoutlm: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1192–1200
Xu, Y, Lv, T, Cui, L, Wang, G, Lu, Y, Florencio, D, Zhang, C, Wei, F (2021) Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv:2104.08836
Yang, X, Yumer, E, Asente, P, Kraley, M, Kifer, D, Lee Giles, C (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5315–5324
Kaplan, F, Oliveira, SA, Clematide, S, Ehrmann, M, Barman, R (2021) Combining visual and textual features for semantic segmentation of historical newspapers. Journal of Data Mining & Digital Humanities
Zhang, P, Li, C, Qiao, L, Cheng, Z, Pu, S, Niu, Y, Wu, F (2021) Vsr: A unified framework for document layout analysis combining vision, semantics and relations. In: International conference on document analysis and recognition, pp 115–130. Springer
Xu, Y, Xu, Y, Lv, T, Cui, L, Wei, F, Wang, G, Lu, Y, Florencio, D, Zhang, C, Che, W, Zhang, M, Zhou, L (2021) LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (vol 1: Long Papers), pp 2579–2591. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201
Shi C, Xu C, He J, Chen Y, Cheng Y, Yang Q, Qiu H (2022) Graph-based convolution feature aggregation for retinal vessel segmentation. Simulation Modelling Practice and Theory 121:102653
Article Google Scholar
Amin A, Shiu R (2001) Page segmentation and classification utilizing bottom-up approach. International Journal of Image and Graphics 1(02):345–361
Article Google Scholar
Ha, J, Haralick, R.M, Phillips, IT: Document page decomposition by the bounding-box project. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol 2, pp 1119–1122 (1995). IEEE
Ha, J, Haralick, R.M, Phillips, IT: Recursive x-y cut using bounding boxes of connected components. In: Proceedings of 3rd international conference on document analysis and recognition, vol 2, pp 952–955 (1995). IEEE
Lebourgeois, F, Bublinski, Z, Emptoz, H: A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In: 11th IAPR international conference on pattern recognition. vol ii. conference b: pattern recognition methodology and systems, vol 1, pp 272–273 (1992). IEEE Computer Society
Shilman, M, Liang, P, Viola, P: Learning nongenerative grammatical models for document analysis. In: Tenth IEEE international conference on computer vision (ICCV’05) Volume 1, vol 2, pp 962–969 (2005). IEEE
Simon A, Pret J-C, Johnson AP (1997) A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3):273–277
Article Google Scholar
Mao, S, Rosenfeld, A, Kanungo, T: Document structure analysis algorithms: a literature survey. In: Document recognition and retrieval X, vol 5010, pp 197–207 (2003). International Society for Optics and Photonics
Wei, H, Baechler, M, Slimane, F, Ingold, R: Evaluation of svm, mlp and gmm classifiers for layout analysis of historical documents. In: 2013 12th international conference on document analysis and recognition, pp 1220–1224 (2013). IEEE
Schreiber, S, Agne, S, Wolf, I, Dengel, A, Ahmed, S: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 1162–1167 (2017). IEEE
He, D, Cohen, S, Price, B, Kifer, D, Giles, CL: Multi-scale multi-task fcn for semantic page segmentation and table detection. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 254–261 (2017). IEEE
Siegel, N, Lourie, N, Power, R, Ammar, W: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 223–232 (2018)
Lee, J, Hayashi, H, Ohyama, W, Uchida, S: Page segmentation using a convolutional neural network with trainable co-occurrence features. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1023–1028 (2019). IEEE
Agarwal, M, Mondal, A, Jawahar, C: Cdec-net: Composite deformable cascade network for table detection in document images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 9491–9498 (2021). IEEE
Paliwal, SS, Vishwanath, D, Rahul, R, Sharma, M, Vig, L: Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp 128–133 (2019). IEEE
Appalaraju, S, Jasani, B, Kota, B.U, Xie, Y, Manmatha, R: Docformer: End-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 993–1003 (2021)
Li, P, Gu, J, Kuen, J, Morariu, V.I, Zhao, H, Jain, R, Manjunatha, V, Liu, H: Selfdoc: Self-supervised document representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5652–5660 (2021)
Gu, Z, Meng, C, Wang, K, Lan, J, Wang, W, Gu, M, Zhang, L: Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4583–4592 (2022)
Jiao J, Tu W-C, Liu D, He S, Lau RW, Huang TS (2020) Formnet: Formatted learning for image restoration. IEEE Transactions on Image Processing 29:6302–6314
Article MATH Google Scholar
Huang, Y, Lv, T, Cui, L, Lu, Y, Wei, F: Layoutlmv3: Pre-training for document ai with unified text and image masking, pp 4083–4091 (2022). https://doi.org/10.1145/3503161.3548112
Lin, T-Y, Dollár, P, Girshick, R, He, K, Hariharan, B, Belongie, S: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125 (2017)
Hu, J, Shen, L, Sun, G: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 (2018)
Dai, J, Qi, H, Xiong, Y, Li, Y, Zhang, G, Hu, H, Wei, Y: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773 (2017)
Common-Object-in-Context: COCO Detection Evaluation. Website. https://cocodataset.org/#detection-eval (2021)
Jimeno Yepes, A, Zhong, P, Burdick, D: Icdar 2021 competition on scientific literature parsing. In: International conference on document analysis and recognition, pp 605–617 (2021). Springer
Zhang, S, Chi, C, Yao, Y, Lei, Z, Li, SZ: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768 (2020)
Li, J, Xu, Y, Lv, T, Cui, L, Zhang, C, Wei, F: Dit: Self-supervised pre-training for document image transformer, pp 3530–3539 (2022). https://doi.org/10.1145/3503161.3547911
Li, K, Wigington, C, Tensmeyer, C, Zhao, H, Barmpalios, N, Morariu, VI, Manjunatha, V, Sun, T, Fu, Y: Cross-domain document object detection: Benchmark suite and method. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12915–12924 (2020)
Li, M, Xu, Y, Cui, L, Huang, S, Wei, F, Li, Z, Zhou, M (2020) DocBank: A benchmark dataset for document layout analysis, pp 949–960. https://doi.org/10.18653/v1/2020.coling-main.82
Zhong, X, Tang, J, Yepes, AJ (2019) Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1015–1022. IEEE

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61806107 and 61702135, Shandong Key Laboratory of Wisdom Mine Information Technology, and the Opening Project of State Key Laboratory of Digital Publishing Technology.

Author information

Authors and Affiliations

School of Information Science and Technology, Qingdao University of Science and Technology, 266000, Qingdao, ShanDong, China
Xu Canhui, Li Yuteng, Shi Cao, Zhang Honghong & Bi Hengyue
School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, USA
Chen Yinong

Authors

Xu Canhui
View author publications
You can also search for this author in PubMed Google Scholar
Li Yuteng
View author publications
You can also search for this author in PubMed Google Scholar
Shi Cao
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Honghong
View author publications
You can also search for this author in PubMed Google Scholar
Bi Hengyue
View author publications
You can also search for this author in PubMed Google Scholar
Chen Yinong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Cao.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Canhui, X., Yuteng, L., Cao, S. et al. HiM: hierarchical multimodal network for document layout analysis. Appl Intell 53, 24314–24326 (2023). https://doi.org/10.1007/s10489-023-04782-3

Download citation

Accepted: 11 June 2023
Published: 23 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04782-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HiM: hierarchical multimodal network for document layout analysis

Abstract

Access this article

Similar content being viewed by others

The YOLO model that still excels in document layout analysis

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Deep Layout Analysis of Multi-lingual and Composite Documents

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HiM: hierarchical multimodal network for document layout analysis

Abstract

Access this article

Similar content being viewed by others

The YOLO model that still excels in document layout analysis

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Deep Layout Analysis of Multi-lingual and Composite Documents

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation