Skip to main content
Log in

HiM: hierarchical multimodal network for document layout analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In document layout analysis, both computer vision based and natural language processing based methods are employed individually or integrated to enrich the feature information resources and to enforce object detection. To simultaneously leverage visual and textual modalities, this paper proposes a hierarchical multimodal (HiM) network to aggregate representative features from multi-source inputs with the introduction of complementary semantics and non-local context dependencies across grained scales. Different channel and spatial attention mechanisms are adapted to different modalities. The visual modality is based on conventional convolution network, while the textual modality focuses on embedding hierarchical textual vectors and positioning. The feature representations from multiple modalities are then integrated adaptively in feature pyramid network for subsequent region proposal processing. We have made database adaptation on PubLayNet, including inserting semi-structure elements and extending ground truth annotations by parsing PDF pages. On three popular benchmarks, including Article Regions, PubLayNet and DocBank, extensive experiments are carried out to verify the effectiveness and adaptability of HiM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Liao M, Zou Z, Wan Z, Yao C, Bai X (2023) Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1):919–931. https://doi.org/10.1109/TPAMI.2022.3155612

    Article  Google Scholar 

  2. Xu C-h, Shi C, Chen Y-n (2021) End-to-end dilated convolution network for document image semantic segmentation. Journal of Central South University 28(6):1765–1774

    Article  Google Scholar 

  3. Xu C, Shi C, Bi H, Liu C, Yuan Y, Guo H, Chen Y (2021) A page object detection method based on mask r-cnn. IEEE Access 9:143448–143457. https://doi.org/10.1109/ACCESS.2021.3121152

    Article  Google Scholar 

  4. Tao X, Tang Z, Xu C (2014) Contextual modeling for logical labeling of pdf documents. Computers & Electrical Engineering 40(4):1363–1375

    Article  Google Scholar 

  5. Liao M, Lyu P, He M, Yao C, Wu W, Bai X (2021) Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(2):532–548. https://doi.org/10.1109/TPAMI.2019.2937086

    Article  Google Scholar 

  6. Xu Y, Fu M, Wang Q, Wang Y, Chen K, Xia G-S, Bai X (2020) Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE transactions on pattern analysis and machine intelligence 43(4):1452–1459

    Article  Google Scholar 

  7. Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2019) Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(9):2035–2048. https://doi.org/10.1109/TPAMI.2018.2848939

    Article  Google Scholar 

  8. Cai Y, Liu Y, Shen C, Jin L, Li Y, Ergu D (2022) Arbitrarily shaped scene text detection with dynamic convolution. Pattern Recognition 127:108608

    Article  Google Scholar 

  9. Liu Y, Shen C, Jin L, He T, Chen P, Liu C, Chen H (2022) Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11):8048–8064. https://doi.org/10.1109/TPAMI.2021.3107437

    Article  Google Scholar 

  10. Liu C, Liu Y, Jin L, Zhang S, Luo C, Wang Y (2020) Erasenet: End-to-end text removal in the wild. IEEE Transactions on Image Processing 29:8760–8775

    Article  MATH  Google Scholar 

  11. Zhang, H, Yao, Q, Kwok, J.T, Bai, X (2022) Searching a high performance feature extractor for text recognition network. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp 1–15. https://doi.org/10.1109/TPAMI.2022.3205748

  12. Redmon, J, Divvala, S, Girshick, R, Farhadi, A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  13. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  14. He, K, Gkioxari, G, Dollár, P, Girshick, R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer Vision, pp 2961–2969

  15. Cai Z, Vasconcelos N (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence 43(5):1483–1498

    Article  Google Scholar 

  16. Augusto Borges Oliveira, D, Palhares Viana, M (2017) Fast cnn-based document layout analysis. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1173–1180

  17. Bi, H, Xu, C, Shi, C, Liu, G, Li, Y, Zhang, H, Qu, J (2022) Srrv: A novel document object detector based on spatial-related relation and vision. IEEE Transactions on Multimedia, pp 1–1. https://doi.org/10.1109/TMM.2022.3165717

  18. Shi C, Xu C, Bi H, Cheng Y, Li Y, Zhang H (2022) Lateral feature enhancement network for page object detection. IEEE Transactions on Instrumentation and Measurement 71:1–10. https://doi.org/10.1109/TIM.2022.3201546

    Article  Google Scholar 

  19. LeCun, Y, Bengio, Y, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995

  20. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901

    Google Scholar 

  21. Devlin, J, Chang, M-W, Lee, K, Toutanova, K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers), pp 4171–4186. Association for Computational Linguistics,Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  22. Liu, Y, Ott, M, Goyal, N, Du, J, Joshi, M, Chen, D, Levy, O, Lewis, M, Zettlemoyer, L, Stoyanov, V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692

  23. Garncarek, Ł, Powalski, R, Stanisławek, T, Topolski, B, Halama, P, Turski, M, Graliński, F (2021) Lambert: Layout-aware language modeling for information extraction. In: International conference on document analysis and recognition, pp 532–547. Springer

  24. Xu, Y, Li, M, Cui, L, Huang, S, Wei, F, Zhou, M (2020) Layoutlm: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1192–1200

  25. Xu, Y, Lv, T, Cui, L, Wang, G, Lu, Y, Florencio, D, Zhang, C, Wei, F (2021) Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding. arXiv:2104.08836

  26. Yang, X, Yumer, E, Asente, P, Kraley, M, Kifer, D, Lee Giles, C (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5315–5324

  27. Kaplan, F, Oliveira, SA, Clematide, S, Ehrmann, M, Barman, R (2021) Combining visual and textual features for semantic segmentation of historical newspapers. Journal of Data Mining & Digital Humanities

  28. Zhang, P, Li, C, Qiao, L, Cheng, Z, Pu, S, Niu, Y, Wu, F (2021) Vsr: A unified framework for document layout analysis combining vision, semantics and relations. In: International conference on document analysis and recognition, pp 115–130. Springer

  29. Xu, Y, Xu, Y, Lv, T, Cui, L, Wei, F, Wang, G, Lu, Y, Florencio, D, Zhang, C, Che, W, Zhang, M, Zhou, L (2021) LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (vol 1: Long Papers), pp 2579–2591. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2021.acl-long.201. https://aclanthology.org/2021.acl-long.201

  30. Shi C, Xu C, He J, Chen Y, Cheng Y, Yang Q, Qiu H (2022) Graph-based convolution feature aggregation for retinal vessel segmentation. Simulation Modelling Practice and Theory 121:102653

    Article  Google Scholar 

  31. Amin A, Shiu R (2001) Page segmentation and classification utilizing bottom-up approach. International Journal of Image and Graphics 1(02):345–361

    Article  Google Scholar 

  32. Ha, J, Haralick, R.M, Phillips, IT: Document page decomposition by the bounding-box project. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol 2, pp 1119–1122 (1995). IEEE

  33. Ha, J, Haralick, R.M, Phillips, IT: Recursive x-y cut using bounding boxes of connected components. In: Proceedings of 3rd international conference on document analysis and recognition, vol 2, pp 952–955 (1995). IEEE

  34. Lebourgeois, F, Bublinski, Z, Emptoz, H: A fast and efficient method for extracting text paragraphs and graphics from unconstrained documents. In: 11th IAPR international conference on pattern recognition. vol ii. conference b: pattern recognition methodology and systems, vol 1, pp 272–273 (1992). IEEE Computer Society

  35. Shilman, M, Liang, P, Viola, P: Learning nongenerative grammatical models for document analysis. In: Tenth IEEE international conference on computer vision (ICCV’05) Volume 1, vol 2, pp 962–969 (2005). IEEE

  36. Simon A, Pret J-C, Johnson AP (1997) A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3):273–277

    Article  Google Scholar 

  37. Mao, S, Rosenfeld, A, Kanungo, T: Document structure analysis algorithms: a literature survey. In: Document recognition and retrieval X, vol 5010, pp 197–207 (2003). International Society for Optics and Photonics

  38. Wei, H, Baechler, M, Slimane, F, Ingold, R: Evaluation of svm, mlp and gmm classifiers for layout analysis of historical documents. In: 2013 12th international conference on document analysis and recognition, pp 1220–1224 (2013). IEEE

  39. Schreiber, S, Agne, S, Wolf, I, Dengel, A, Ahmed, S: Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 1162–1167 (2017). IEEE

  40. He, D, Cohen, S, Price, B, Kifer, D, Giles, CL: Multi-scale multi-task fcn for semantic page segmentation and table detection. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1, pp 254–261 (2017). IEEE

  41. Siegel, N, Lourie, N, Power, R, Ammar, W: Extracting scientific figures with distantly supervised neural networks. In: Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, pp 223–232 (2018)

  42. Lee, J, Hayashi, H, Ohyama, W, Uchida, S: Page segmentation using a convolutional neural network with trainable co-occurrence features. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1023–1028 (2019). IEEE

  43. Agarwal, M, Mondal, A, Jawahar, C: Cdec-net: Composite deformable cascade network for table detection in document images. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 9491–9498 (2021). IEEE

  44. Paliwal, SS, Vishwanath, D, Rahul, R, Sharma, M, Vig, L: Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp 128–133 (2019). IEEE

  45. Appalaraju, S, Jasani, B, Kota, B.U, Xie, Y, Manmatha, R: Docformer: End-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 993–1003 (2021)

  46. Li, P, Gu, J, Kuen, J, Morariu, V.I, Zhao, H, Jain, R, Manjunatha, V, Liu, H: Selfdoc: Self-supervised document representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5652–5660 (2021)

  47. Gu, Z, Meng, C, Wang, K, Lan, J, Wang, W, Gu, M, Zhang, L: Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4583–4592 (2022)

  48. Jiao J, Tu W-C, Liu D, He S, Lau RW, Huang TS (2020) Formnet: Formatted learning for image restoration. IEEE Transactions on Image Processing 29:6302–6314

    Article  MATH  Google Scholar 

  49. Huang, Y, Lv, T, Cui, L, Lu, Y, Wei, F: Layoutlmv3: Pre-training for document ai with unified text and image masking, pp 4083–4091 (2022). https://doi.org/10.1145/3503161.3548112

  50. Lin, T-Y, Dollár, P, Girshick, R, He, K, Hariharan, B, Belongie, S: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125 (2017)

  51. Hu, J, Shen, L, Sun, G: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 (2018)

  52. Dai, J, Qi, H, Xiong, Y, Li, Y, Zhang, G, Hu, H, Wei, Y: Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773 (2017)

  53. Common-Object-in-Context: COCO Detection Evaluation. Website. https://cocodataset.org/#detection-eval (2021)

  54. Jimeno Yepes, A, Zhong, P, Burdick, D: Icdar 2021 competition on scientific literature parsing. In: International conference on document analysis and recognition, pp 605–617 (2021). Springer

  55. Zhang, S, Chi, C, Yao, Y, Lei, Z, Li, SZ: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768 (2020)

  56. Li, J, Xu, Y, Lv, T, Cui, L, Zhang, C, Wei, F: Dit: Self-supervised pre-training for document image transformer, pp 3530–3539 (2022). https://doi.org/10.1145/3503161.3547911

  57. Li, K, Wigington, C, Tensmeyer, C, Zhao, H, Barmpalios, N, Morariu, VI, Manjunatha, V, Sun, T, Fu, Y: Cross-domain document object detection: Benchmark suite and method. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12915–12924 (2020)

  58. Li, M, Xu, Y, Cui, L, Huang, S, Wei, F, Li, Z, Zhou, M (2020) DocBank: A benchmark dataset for document layout analysis, pp 949–960. https://doi.org/10.18653/v1/2020.coling-main.82

  59. Zhong, X, Tang, J, Yepes, AJ (2019) Publaynet: largest dataset ever for document layout analysis. In: 2019 International conference on document analysis and recognition (ICDAR), pp 1015–1022. IEEE

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61806107 and 61702135, Shandong Key Laboratory of Wisdom Mine Information Technology, and the Opening Project of State Key Laboratory of Digital Publishing Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shi Cao.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Canhui, X., Yuteng, L., Cao, S. et al. HiM: hierarchical multimodal network for document layout analysis. Appl Intell 53, 24314–24326 (2023). https://doi.org/10.1007/s10489-023-04782-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04782-3

Keywords

Navigation