Abstract
In document layout analysis, both computer vision based and natural language processing based methods are employed individually or integrated to enrich the feature information resources and to enforce object detection. To simultaneously leverage visual and textual modalities, this paper proposes a hierarchical multimodal (HiM) network to aggregate representative features from multi-source inputs with the introduction of complementary semantics and non-local context dependencies across grained scales. Different channel and spatial attention mechanisms are adapted to different modalities. The visual modality is based on conventional convolution network, while the textual modality focuses on embedding hierarchical textual vectors and positioning. The feature representations from multiple modalities are then integrated adaptively in feature pyramid network for subsequent region proposal processing. We have made database adaptation on PubLayNet, including inserting semi-structure elements and extending ground truth annotations by parsing PDF pages. On three popular benchmarks, including Article Regions, PubLayNet and DocBank, extensive experiments are carried out to verify the effectiveness and adaptability of HiM.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Liao M, Zou Z, Wan Z, Yao C, Bai X (2023) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):919–931. https://doi.org/10.1109/TPAMI.2022.3155612
Xu C-h, Shi C, Chen Y-n (2021) End-to-end dilated convolution network for document image semantic segmentation. Journal of Central South University 28(6):1765–1774
Xu C, Shi C, Bi H, Liu C, Yuan Y, Guo H, Chen Y (2021) A page object detection method based on mask r-cnn. IEEE Access 9:143448–143457. https://doi.org/10.1109/ACCESS.2021.3121152
Tao X, Tang Z, Xu C (2014) Contextual modeling for logical labeling of pdf documents. Computers & Electrical Engineering 40(4):1363–1375
Liao M, Lyu P, He M, Yao C, Wu W, Bai X (2021) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(2):532–548. https://doi.org/10.1109/TPAMI.2019.2937086
Xu Y, Fu M, Wang Q, Wang Y, Chen K, Xia G-S, Bai X (2020) Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43(4):1452–1459
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2019) Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9):2035–2048. https://doi.org/10.1109/TPAMI.2018.2848939
Cai Y, Liu Y, Shen C, Jin L, Li Y, Ergu D (2022) Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognition 127:108608
Liu Y, Shen C, Jin L, He T, Chen P, Liu C, Chen H (2022) Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11):8048–8064. https://doi.org/10.1109/TPAMI.2021.3107437
Liu C, Liu Y, Jin L, Zhang S, Luo C, Wang Y (2020) Erasenet: End-to-end text removal in the wild. IEEE Transactions on Image Processing 29:8760–8775
Zhang, H, Yao, Q, Kwok, J.T, Bai, X (2022) Searching a high performance feature extractor for text recognition network. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–15. https://doi.org/10.1109/TPAMI.2022.3205748
Redmon, J, Divvala, S, Girshick, R, Farhadi, A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
He, K, Gkioxari, G, Dollár, P, Girshick, R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer Vision, pp 2961–2969
Cai Z, Vasconcelos N (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence 43(5):1483–1498
Augusto Borges Oliveira, D, Palhares Viana, M (2017) Fast cnn-based document layout analysis. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1173–1180
Bi, H, Xu, C, Shi, C, Liu, G, Li, Y, Zhang, H, Qu, J (2022) Srrv: A novel document object detector based on spatial-related relation and vision. IEEE Transactions on Multimedia, pp 1–1. https://doi.org/10.1109/TMM.2022.3165717
Shi C, Xu C, Bi H, Cheng Y, Li Y, Zhang H (2022) Lateral feature enhancement network for page object detection. IEEE Transactions on Instrumentation and Measurement 71:1–10. https://doi.org/10.1109/TIM.2022.3201546
LeCun, Y, Bengio, Y, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901
Devlin, J, Chang, M-W, Lee, K, Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics,Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Liu, Y, Ott, M, Goyal, N, Du, J, Joshi, M, Chen, D, Levy, O, Lewis, M, Zettlemoyer, L, Stoyanov, V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692
Garncarek, Ł, Powalski, R, Stanisławek, T, Topolski, B, Halama, P, Turski, M, Graliński, F (2021) Lambert: Layout-aware language modeling for information extraction. In: International conference on document analysis and recognition, pp 532–547. Springer
Xu, Y, Li, M, Cui, L, Huang, S, Wei, F, Zhou, M (2020) Layoutlm: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1192–1200
Xu, Y, Lv, T, Cui, L, Wang, G, Lu, Y, Florencio, D, Zhang, C, Wei, F (2021) Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv:2104.08836
Yang, X, Yumer, E, Asente, P, Kraley, M, Kifer, D, Lee Giles, C (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5315–5324
Kaplan, F, Oliveira, SA, Clematide, S, Ehrmann, M, Barman, R (2021) Combining visual and textual features for semantic segmentation of historical newspapers. Journal of Data Mining & Digital Humanities
Zhang, P, Li, C, Qiao, L, Cheng, Z, Pu, S, Niu, Y, Wu, F (2021) Vsr: A unified framework for document layout analysis combining vision, semantics and relations. In: International conference on document analysis and recognition, pp 115–130. Springer
Xu, Y, Xu, Y, Lv, T, Cui, L, Wei, F, Wang, G, Lu, Y, Florencio, D, Zhang, C, Che, W, Zhang, M, Zhou, L (2021) LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (vol 1: Long Papers), pp 2579–2591. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201
Shi C, Xu C, He J, Chen Y, Cheng Y, Yang Q, Qiu H (2022) Graph-based convolution feature aggregation for retinal vessel segmentation. Simulation Modelling Practice and Theory 121:102653
Amin A, Shiu R (2001) Page segmentation and classification utilizing bottom-up approach. International Journal of Image and Graphics 1(02):345–361
Ha, J, Haralick, R.M, Phillips, IT: Document page decomposition by the bounding-box project. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol 2, pp 1119–1122 (1995). IEEE
Ha, J, Haralick, R.M, Phillips, IT: Recursive x-y cut using bounding boxes of connected components. In: Proceedings of 3rd international conference on document analysis and recognition, vol 2, pp 952–955 (1995). IEEE
Lebourgeois, F, Bublinski, Z, Emptoz, H: A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In: 11th IAPR international conference on pattern recognition. vol ii. conference b: pattern recognition methodology and systems, vol 1, pp 272–273 (1992). IEEE Computer Society
Shilman, M, Liang, P, Viola, P: Learning nongenerative grammatical models for document analysis. In: Tenth IEEE international conference on computer vision (ICCV’05) Volume 1, vol 2, pp 962–969 (2005). IEEE
Simon A, Pret J-C, Johnson AP (1997) A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3):273–277
Mao, S, Rosenfeld, A, Kanungo, T: Document structure analysis algorithms: a literature survey. In: Document recognition and retrieval X, vol 5010, pp 197–207 (2003). International Society for Optics and Photonics
Wei, H, Baechler, M, Slimane, F, Ingold, R: Evaluation of svm, mlp and gmm classifiers for layout analysis of historical documents. In: 2013 12th international conference on document analysis and recognition, pp 1220–1224 (2013). IEEE
Schreiber, S, Agne, S, Wolf, I, Dengel, A, Ahmed, S: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 1162–1167 (2017). IEEE
He, D, Cohen, S, Price, B, Kifer, D, Giles, CL: Multi-scale multi-task fcn for semantic page segmentation and table detection. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 254–261 (2017). IEEE
Siegel, N, Lourie, N, Power, R, Ammar, W: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 223–232 (2018)
Lee, J, Hayashi, H, Ohyama, W, Uchida, S: Page segmentation using a convolutional neural network with trainable co-occurrence features. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1023–1028 (2019). IEEE
Agarwal, M, Mondal, A, Jawahar, C: Cdec-net: Composite deformable cascade network for table detection in document images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 9491–9498 (2021). IEEE
Paliwal, SS, Vishwanath, D, Rahul, R, Sharma, M, Vig, L: Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp 128–133 (2019). IEEE
Appalaraju, S, Jasani, B, Kota, B.U, Xie, Y, Manmatha, R: Docformer: End-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 993–1003 (2021)
Li, P, Gu, J, Kuen, J, Morariu, V.I, Zhao, H, Jain, R, Manjunatha, V, Liu, H: Selfdoc: Self-supervised document representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5652–5660 (2021)
Gu, Z, Meng, C, Wang, K, Lan, J, Wang, W, Gu, M, Zhang, L: Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4583–4592 (2022)
Jiao J, Tu W-C, Liu D, He S, Lau RW, Huang TS (2020) Formnet: Formatted learning for image restoration. IEEE Transactions on Image Processing 29:6302–6314
Huang, Y, Lv, T, Cui, L, Lu, Y, Wei, F: Layoutlmv3: Pre-training for document ai with unified text and image masking, pp 4083–4091 (2022). https://doi.org/10.1145/3503161.3548112
Lin, T-Y, Dollár, P, Girshick, R, He, K, Hariharan, B, Belongie, S: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125 (2017)
Hu, J, Shen, L, Sun, G: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 (2018)
Dai, J, Qi, H, Xiong, Y, Li, Y, Zhang, G, Hu, H, Wei, Y: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773 (2017)
Common-Object-in-Context: COCO Detection Evaluation. Website. https://cocodataset.org/#detection-eval (2021)
Jimeno Yepes, A, Zhong, P, Burdick, D: Icdar 2021 competition on scientific literature parsing. In: International conference on document analysis and recognition, pp 605–617 (2021). Springer
Zhang, S, Chi, C, Yao, Y, Lei, Z, Li, SZ: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768 (2020)
Li, J, Xu, Y, Lv, T, Cui, L, Zhang, C, Wei, F: Dit: Self-supervised pre-training for document image transformer, pp 3530–3539 (2022). https://doi.org/10.1145/3503161.3547911
Li, K, Wigington, C, Tensmeyer, C, Zhao, H, Barmpalios, N, Morariu, VI, Manjunatha, V, Sun, T, Fu, Y: Cross-domain document object detection: Benchmark suite and method. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12915–12924 (2020)
Li, M, Xu, Y, Cui, L, Huang, S, Wei, F, Li, Z, Zhou, M (2020) DocBank: A benchmark dataset for document layout analysis, pp 949–960. https://doi.org/10.18653/v1/2020.coling-main.82
Zhong, X, Tang, J, Yepes, AJ (2019) Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1015–1022. IEEE
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61806107 and 61702135, Shandong Key Laboratory of Wisdom Mine Information Technology, and the Opening Project of State Key Laboratory of Digital Publishing Technology.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Canhui, X., Yuteng, L., Cao, S. et al. HiM: hierarchical multimodal network for document layout analysis. Appl Intell 53, 24314–24326 (2023). https://doi.org/10.1007/s10489-023-04782-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04782-3