Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Oral, Berke; Eryiğit, Gülşen

doi:10.1007/s10032-022-00399-3

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Original Paper
Published: 22 April 2022

Volume 25, pages 187–205, (2022)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

825 Accesses
3 Citations
Explore all metrics

Abstract

The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual-VIE: Dual-Level Graph Attention Network for Visual Information Extraction

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Doc2Graph: A Task Agnostic Document Understanding Framework Based on Graph Neural Networks

Availability of data and materials

The data consisting monetary transactions of real customers are confidential.

Notes

There exist also semi-structured money transfer orders which are processed with table-detection algorithms, which are beyond the scope this article.
In the original study, Oral et al. [2] also use character embeddings next to pretrained textual word embeddings and report that this helps the performances at very low levels (0.2 percentage points) for NER. To alleviate this complexity, we dropped the character BiLSTM layer from textual representations to better observe the effects of the visual representations.
The term Layout embedding is also used by Xu et al. [22], but the approach that we introduce here should not be confused with it which is more similar to our Grid Embedding approach.
We also tested with an attention-based fusion approach using textual features as our attention context, but could not obtain good results.
Although there could appear semi-structured documents in this domain containing well-formed forms and tables, the authors state that these are not included in this dataset.

References

Graliński, F., Stanisławek, T., Wróblewska, A., Lipiński, D., Kaliska, A., Rosalska, P., Topolski, B., Biecek, P.: Kleister: A novel task for information extraction involving long documents with complex layout. arXiv preprint arXiv:2003.02356 (2020)
Oral, B., Emekligil, E., Arslan, S., Eryiǧit, G.: Information extraction from text intensive and visually rich banking documents. Inform. Process. Manag. (2020). https://doi.org/10.1016/j.ipm.2020.102361
Article Google Scholar
Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. Int. J. Inf. Manag. 40, 67–75 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.01.010
Article Google Scholar
Chalkidis, I., Androutsopoulos, I., Michos, A.: Extracting contract elements. In: Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law. ICAIL ’17, pp. 19–28. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3086512.3086515
Ilias, C., Ion, A.: A deep learning approach to contract element extraction. Frontiers in Artificial Intelligence and Applications 302 (Legal Knowledge and Information Systems), 155–164 (2017). https://doi.org/10.3233/978-1-61499-838-9-155
Göbel, M., Hassan, T., Oro, E., Orsi, G.: Icdar 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR) (2015)
Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.V.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244
Jaume, G., Kemal Ekenel, H., Thiran, J.-P.: Funsd: A dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6 (2019). https://doi.org/10.1109/ICDARW.2019.10029
Palm, R.B., Winther, O., Laws, F.: Cloudscan—A configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 406–413 (2017). https://doi.org/10.1109/ICDAR.2017.74
Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neural network approach for table field extraction in business documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1308–1313 (2019). https://doi.org/10.1109/ICDAR.2019.00211
Sage, C., Aussem, A., Eglin, V., Elghazel, H., Espinas, J.: End-to-end extraction of structured information from business documents with pointer-generator networks. In: Proceedings of the Fourth Workshop on Structured Prediction for NLP, pp. 43–52. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.spnlp-1.6
Santosh, K., Belaid, A.: Document information extraction and its evaluation based on client’s relevance. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 35–39 (2013). IEEE
Santosh, K.: g-DICE: graph mining-based document information content exploitation. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 337–355 (2015). https://doi.org/10.1007/s10032-015-0253-z
Article Google Scholar
Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1476
Denk, T.I., Reisswig, C.: BERTgrid: Contextualized embedding for 2d document representation and understanding. In: Workshop on Document Intelligence at NeurIPS 2019 (2019). https://openreview.net/forum?id=H1gsGaq9US
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336 (2019). https://doi.org/10.1109/ICDAR.2019.00060
Zhao, X., Niu, E., Wu, Z., Wang, X.: Cutie: Learning to understand documents with convolutional universal text information extractor. arXiv preprint arXiv:1903.12363 (2019)
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 32–39. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-2005
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-Training of Text and Layout for Document Image Understanding, pp. 1192–1200. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3403172
Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201
Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y., Wu, F.: Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900
Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145–2158. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://www.aclweb.org/anthology/C18-1182
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 1, 2020. https://doi.org/10.1109/TKDE.2020.2981314
Weld, H., Huang, X., Long, S., Poon, J., Han, S.C.: A survey of joint intent detection and slot-filling models in natural language understanding. arXiv preprint arXiv:2101.08091 (2021)
Subramani, N., Matton, A., Greaves, M., Lam, A.: A Survey of Deep Learning Approaches for OCR and Document Understanding (2021)
Jiang, H., Bao, Q., Cheng, Q., Yang, D., Wang, L., Xiao, Y.: Complex relation extraction: Challenges and opportunities. arXiv preprint arXiv:2012.04821 (2020)
Sahin, G.G., Emekligil, E., Arslan, S., Ağın, O., Eryiğit, G.: Relation extraction via one-shot dependency parsing on intersentential, higher-order, and nested relations. Turk. J. Electr. Eng. Comput. Sci. 26(2), 830–843 (2018)
Article Google Scholar
Oral, B., Emekligil, E., Arslan, S., Eryiğit, G.: Extracting complex relations from banking documents. In: Proceedings of the Second Workshop on Economics and Natural Language Processing, pp. 1–9. Association for Computational Linguistics, Hong Kong (2019). https://doi.org/10.18653/v1/D19-5101
R, A., Kuanr, A., KR, S.: Developing banking intelligence in emerging markets: systematic review and agenda. Int. J. Inf. Manag. Data Insights 1(2), 100026 (2021). https://doi.org/10.1016/j.jjimei.2021.100026
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR) (2020)
Bach, N., Badaskar, S.: A review of relation extraction. Lit. Rev. Lang. Stat. II(2), 1–15 (2007)
Google Scholar
Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10
McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 491–498. Association for Computational Linguistics, Ann Arbor, MI (2005). https://doi.org/10.3115/1219840.1219901
Peng, N., Poon, H., Quirk, C., Toutanova, K., Yih, W.T.: Cross-sentence n-ary relation extraction with graph lstms. Trans. Assoc. Comput. Linguist. 5, 101–115 (2017)
Article Google Scholar
Jia, R., Wong, C., Poon, H.: Document-level n-ary relation extraction with multiscale representation learning. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3693–3704. Association for Computational Linguistics, Minneapolis, MN (2019). https://doi.org/10.18653/v1/N19-1370
Song, L., Zhang, Y., Wang, Z., Gildea, D.: N-ary relation extraction using graph-state LSTM. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2226–2235. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1246
Prasojo, R.E., Kacimi, M., Nutt, W.: Stuffie: Semantic tagging of unlabeled facets using fine-grained information extraction. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18, pp. 467–476. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3271812
Takanobu, R., Zhang, T., Liu, J., Huang, M.: A hierarchical framework for relation extraction with reinforcement learning. Proc. AAAI Conf. Artif. Intell. 33(01), 7072–7079 (2019). https://doi.org/10.1609/aaai.v33i01.33017072
Article Google Scholar
Zeng, X., Zeng, D., He, S., Liu, K., Zhao, J.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1047
Sahu, S.K., Christopoulou, F., Miwa, M., Ananiadou, S.: Inter-sentence relation extraction with document-level graph convolutional neural network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4309–4316. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1423
Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5175–5184. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1521
Zhang, D., Cao, R., Wu, S.: Information fusion in visual question answering: a survey. Inform. Fusion 52, 268–280 (2019). https://doi.org/10.1016/j.inffus.2019.03.005
Article Google Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. CoRR arXiv:1802.05365 (2018)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR arXiv:1810.04805 (2018)
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014). https://doi.org/10.1109/ICPR.2014.546
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013)
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021). https://doi.org/10.1109/WACV48630.2021.00360
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745

Download references

Acknowledgements

The authors would like to thank Onur Deniz, Mehmet Yasin Akpınar, Erdem Emekligil, and Mustafa İşbilen for their valuable support.

Funding

This work is funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and by Yapı Kredi Technology with a TUBITAK 1505 (University - Industry Cooperation Support Program) project Grant No. 5190073.

Author information

Authors and Affiliations

NLP Research Group, Faculty of Computer and Informatics, Istanbul Technical University, Istanbul, 34469, Turkey
Berke Oral & Gülşen Eryiğit
Applied AI and R &D, Yapı Kredi Technology, Istanbul, 34467, Turkey
Berke Oral
Department of AI and Data Engineering, Istanbul Technical University, Istanbul, 34469, Turkey
Gülşen Eryiğit

Authors

Berke Oral
View author publications
You can also search for this author in PubMed Google Scholar
Gülşen Eryiğit
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Both authors contributed equally to methodology, conceptualization, formal analysis, investigation, visualization, writing—original draft, review and editing. [Berke Oral] helped in software and data curation; [Gülşen Eryiǧit] contributed to supervision and funding acquisition.

Corresponding author

Correspondence to Gülşen Eryiğit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oral, B., Eryiğit, G. Fusion of visual representations for multimodal information extraction from unstructured transactional documents. IJDAR 25, 187–205 (2022). https://doi.org/10.1007/s10032-022-00399-3

Download citation

Received: 28 September 2021
Revised: 14 March 2022
Accepted: 17 March 2022
Published: 22 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10032-022-00399-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Abstract

Access this article

Similar content being viewed by others

Dual-VIE: Dual-Level Graph Attention Network for Visual Information Extraction

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Doc2Graph: A Task Agnostic Document Understanding Framework Based on Graph Neural Networks

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Abstract

Access this article

Similar content being viewed by others

Dual-VIE: Dual-Level Graph Attention Network for Visual Information Extraction

Low-Dimensionality Information Extraction Model for Semi-structured Documents

Doc2Graph: A Task Agnostic Document Understanding Framework Based on Graph Neural Networks

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation