Abstract
Visual Information Extraction (VIE) is a task to extract key information from document images such as waybills and receipts. Existing methods typically combine multi-modal information including textual, visual, layout features and achieve promising results on datasets in various domains. However, previous methods treat the VIE task as a token-level sequence labelling problem and have not explicitly modelled the relationship between bounding boxes. VIE heavily depends on the context, especially the relationship between key-value pairs. To address this problem, in this paper, we propose a dual-level graph attention model that combines coarse-grained and fine-grained information. At the fine-grained token level, we force the graph attention network to focus on its local token neighbours within a bounding box. At the coarse-grained bounding box level, we encourage further information interaction between bounding boxes and pay more attention to the potential key-value pairs. To the best of our knowledge, our method may be the first attempt to jointly model the correlation between bounding boxes and tokens under a unified fine-tuning framework. Experimental results show that the proposed method significantly surpasses previous methods. Compared to the strong baseline LayoutLM, our method improves the F1-score by about 3% on both datasets. Our method is an important complement to existing VIE methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 32–39 (2019)
Katti, A.R., et al.: Chargrid: Towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469 (2018)
Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis, vol. 19, pp. 273–277. IEEE (1997)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, pp. 260–270 (2016)
Qian, Y., Santus, E., Jin, Z., Guo, J., Barzilay, R.: Graphie: a graph-based framework for information extraction. In: Proceedings of NAACL-HLT, pp. 751–761 (2019)
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: Pick: processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 4363–4370. IEEE (2021)
Gal, R., Ardazi, S., Shilkrot, R.: Cardinal graph convolution framework for document information extraction. In: Proceedings of the ACM Symposium on Document Engineering 2020, pp. 1–11 (2020)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200 (2020)
Schuster, D., et al.: Intellix-end-user trained information extraction for document archiving. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 101–105. IEEE (2013)
Zhao, X., Niu, E., Wu, Z., Wang, X.: Cutie: learning to understand documents with convolutional universal text information extractor. arXiv e-prints pp. arXiv-1903 (2019)
Guo, H., Qin, X., Liu, J., Han, J., Liu, J., Ding, E.: Eaten: Entity-aware attention for single shot visual text extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 254–259. IEEE (2019)
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Garncarek, Ł, et al.: LAMBERT: layout-aware language modeling for information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 532–547. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_34
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. Stat 1050, 20 (2017)
Cheng, M., Qiu, M., Shi, X., Huang, J., Lin, W.: One-shot text field labeling using attention and belief propagation for structure information extraction. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 340–348 (2020)
Gui, T., et al.: A lexicon-based graph neural network for Chinese NER. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1040–1050 (2019)
Reimers, N., Gurevych, I.: Sentence-Bert: sentence embeddings using Siamese Bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUNSD: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDAR), vol. 2, pp. 1–6. IEEE (2019)
Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using Bert-CRF. arXiv preprint arXiv:1909.10649 (2019)
Acknowledgement
The research reported in this paper was supported in part by the Shanghai Science and Technology Young Talents Sailing Program Grant 21YF1413900; Shanghai Municipal Science and Technology Committee of Shanghai Outstanding Academic Leaders Plan 20XD1401700; National Key Research and Development Program of China 2021YFC3300602.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, J., Wang, H., Luo, X. (2022). Dual-VIE: Dual-Level Graph Attention Network for Visual Information Extraction. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13629. Springer, Cham. https://doi.org/10.1007/978-3-031-20862-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-20862-1_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20861-4
Online ISBN: 978-3-031-20862-1
eBook Packages: Computer ScienceComputer Science (R0)