Skip to main content
Log in

Fusion of visual representations for multimodal information extraction from unstructured transactional documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

The data consisting monetary transactions of real customers are confidential.

Notes

  1. There exist also semi-structured money transfer orders which are processed with table-detection algorithms, which are beyond the scope this article.

  2. In the original study, Oral et al. [2] also use character embeddings next to pretrained textual word embeddings and report that this helps the performances at very low levels (0.2 percentage points) for NER. To alleviate this complexity, we dropped the character BiLSTM layer from textual representations to better observe the effects of the visual representations.

  3. The term Layout embedding is also used by Xu et al. [22], but the approach that we introduce here should not be confused with it which is more similar to our Grid Embedding approach.

  4. We also tested with an attention-based fusion approach using textual features as our attention context, but could not obtain good results.

  5. Although there could appear semi-structured documents in this domain containing well-formed forms and tables, the authors state that these are not included in this dataset.

References

  1. Graliński, F., Stanisławek, T., Wróblewska, A., Lipiński, D., Kaliska, A., Rosalska, P., Topolski, B., Biecek, P.: Kleister: A novel task for information extraction involving long documents with complex layout. arXiv preprint arXiv:2003.02356 (2020)

  2. Oral, B., Emekligil, E., Arslan, S., Eryiǧit, G.: Information extraction from text intensive and visually rich banking documents. Inform. Process. Manag. (2020). https://doi.org/10.1016/j.ipm.2020.102361

    Article  Google Scholar 

  3. Cristani, M., Bertolaso, A., Scannapieco, S., Tomazzoli, C.: Future paradigms of automated processing of business documents. Int. J. Inf. Manag. 40, 67–75 (2018). https://doi.org/10.1016/j.ijinfomgt.2018.01.010

    Article  Google Scholar 

  4. Chalkidis, I., Androutsopoulos, I., Michos, A.: Extracting contract elements. In: Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law. ICAIL ’17, pp. 19–28. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3086512.3086515

  5. Ilias, C., Ion, A.: A deep learning approach to contract element extraction. Frontiers in Artificial Intelligence and Applications 302 (Legal Knowledge and Information Systems), 155–164 (2017). https://doi.org/10.3233/978-1-61499-838-9-155

  6. Göbel, M., Hassan, T., Oro, E., Orsi, G.: Icdar 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013). https://doi.org/10.1109/ICDAR.2013.292

  7. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR) (2015)

  8. Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee, H.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

  9. Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.V.: Icdar2019 competition on scanned receipt ocr and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019). https://doi.org/10.1109/ICDAR.2019.00244

  10. Jaume, G., Kemal Ekenel, H., Thiran, J.-P.: Funsd: A dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6 (2019). https://doi.org/10.1109/ICDARW.2019.10029

  11. Palm, R.B., Winther, O., Laws, F.: Cloudscan—A configuration-free invoice analysis system using recurrent neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 406–413 (2017). https://doi.org/10.1109/ICDAR.2017.74

  12. Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neural network approach for table field extraction in business documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1308–1313 (2019). https://doi.org/10.1109/ICDAR.2019.00211

  13. Sage, C., Aussem, A., Eglin, V., Elghazel, H., Espinas, J.: End-to-end extraction of structured information from business documents with pointer-generator networks. In: Proceedings of the Fourth Workshop on Structured Prediction for NLP, pp. 43–52. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.spnlp-1.6

  14. Santosh, K., Belaid, A.: Document information extraction and its evaluation based on client’s relevance. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 35–39 (2013). IEEE

  15. Santosh, K.: g-DICE: graph mining-based document information content exploitation. Int. J. Doc. Anal. Recogn. (IJDAR) 18(4), 337–355 (2015). https://doi.org/10.1007/s10032-015-0253-z

    Article  Google Scholar 

  16. Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards understanding 2D documents. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4459–4469. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1476

  17. Denk, T.I., Reisswig, C.: BERTgrid: Contextualized embedding for 2d document representation and understanding. In: Workshop on Document Intelligence at NeurIPS 2019 (2019). https://openreview.net/forum?id=H1gsGaq9US

  18. Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 329–336 (2019). https://doi.org/10.1109/ICDAR.2019.00060

  19. Zhao, X., Niu, E., Wu, Z., Wang, X.: Cutie: Learning to understand documents with convolutional universal text information extractor. arXiv preprint arXiv:1903.12363 (2019)

  20. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pp. 32–39. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-2005

  21. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-Training of Text and Layout for Document Image Understanding, pp. 1192–1200. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3403172

  22. Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C., Che, W., Zhang, M., Zhou, L.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2579–2591. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.201

  23. Zhang, P., Xu, Y., Cheng, Z., Pu, S., Lu, J., Qiao, L., Niu, Y., Wu, F.: Trie: End-to-end text reading and information extraction for document understanding. In: Proceedings of the 28th ACM International Conference on Multimedia. MM ’20, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900

  24. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145–2158. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://www.aclweb.org/anthology/C18-1182

  25. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 1, 2020. https://doi.org/10.1109/TKDE.2020.2981314

  26. Weld, H., Huang, X., Long, S., Poon, J., Han, S.C.: A survey of joint intent detection and slot-filling models in natural language understanding. arXiv preprint arXiv:2101.08091 (2021)

  27. Subramani, N., Matton, A., Greaves, M., Lam, A.: A Survey of Deep Learning Approaches for OCR and Document Understanding (2021)

  28. Jiang, H., Bao, Q., Cheng, Q., Yang, D., Wang, L., Xiao, Y.: Complex relation extraction: Challenges and opportunities. arXiv preprint arXiv:2012.04821 (2020)

  29. Sahin, G.G., Emekligil, E., Arslan, S., Ağın, O., Eryiğit, G.: Relation extraction via one-shot dependency parsing on intersentential, higher-order, and nested relations. Turk. J. Electr. Eng. Comput. Sci. 26(2), 830–843 (2018)

    Article  Google Scholar 

  30. Oral, B., Emekligil, E., Arslan, S., Eryiğit, G.: Extracting complex relations from banking documents. In: Proceedings of the Second Workshop on Economics and Natural Language Processing, pp. 1–9. Association for Computational Linguistics, Hong Kong (2019). https://doi.org/10.18653/v1/D19-5101

  31. R, A., Kuanr, A., KR, S.: Developing banking intelligence in emerging markets: systematic review and agenda. Int. J. Inf. Manag. Data Insights 1(2), 100026 (2021). https://doi.org/10.1016/j.jjimei.2021.100026

  32. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In: 2020 25th International Conference on Pattern Recognition (ICPR) (2020)

  33. Bach, N., Badaskar, S.: A review of relation extraction. Lit. Rev. Lang. Stat. II(2), 1–15 (2007)

    Google Scholar 

  34. Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 148–163. Springer, Berlin, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15939-8_10

  35. McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., White, P.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pp. 491–498. Association for Computational Linguistics, Ann Arbor, MI (2005). https://doi.org/10.3115/1219840.1219901

  36. Peng, N., Poon, H., Quirk, C., Toutanova, K., Yih, W.T.: Cross-sentence n-ary relation extraction with graph lstms. Trans. Assoc. Comput. Linguist. 5, 101–115 (2017)

    Article  Google Scholar 

  37. Jia, R., Wong, C., Poon, H.: Document-level n-ary relation extraction with multiscale representation learning. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3693–3704. Association for Computational Linguistics, Minneapolis, MN (2019). https://doi.org/10.18653/v1/N19-1370

  38. Song, L., Zhang, Y., Wang, Z., Gildea, D.: N-ary relation extraction using graph-state LSTM. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2226–2235. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1246

  39. Prasojo, R.E., Kacimi, M., Nutt, W.: Stuffie: Semantic tagging of unlabeled facets using fine-grained information extraction. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18, pp. 467–476. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3269206.3271812

  40. Takanobu, R., Zhang, T., Liu, J., Huang, M.: A hierarchical framework for relation extraction with reinforcement learning. Proc. AAAI Conf. Artif. Intell. 33(01), 7072–7079 (2019). https://doi.org/10.1609/aaai.v33i01.33017072

    Article  Google Scholar 

  41. Zeng, X., Zeng, D., He, S., Liu, K., Zhao, J.: Extracting relational facts by an end-to-end neural model with copy mechanism. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 506–514. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/P18-1047

  42. Sahu, S.K., Christopoulou, F., Miwa, M., Ananiadou, S.: Inter-sentence relation extraction with document-level graph convolutional neural network. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4309–4316. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1423

  43. Xiong, L., Hu, C., Xiong, C., Campos, D., Overwijk, A.: Open domain web keyphrase extraction beyond language modeling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5175–5184. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1521

  44. Zhang, D., Cao, R., Wu, S.: Information fusion in visual question answering: a survey. Inform. Fusion 52, 268–280 (2019). https://doi.org/10.1016/j.inffus.2019.03.005

    Article  Google Scholar 

  45. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. CoRR arXiv:1802.05365 (2018)

  46. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR arXiv:1810.04805 (2018)

  47. Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172 (2014). https://doi.org/10.1109/ICPR.2014.546

  48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013)

  49. Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568 (2021). https://doi.org/10.1109/WACV48630.2021.00360

  50. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745

Download references

Acknowledgements

The authors would like to thank Onur Deniz, Mehmet Yasin Akpınar, Erdem Emekligil, and Mustafa İşbilen for their valuable support.

Funding

This work is funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and by Yapı Kredi Technology with a TUBITAK 1505 (University - Industry Cooperation Support Program) project Grant No. 5190073.

Author information

Authors and Affiliations

Authors

Contributions

Both authors contributed equally to methodology, conceptualization, formal analysis, investigation, visualization, writing—original draft, review and editing. [Berke Oral] helped in software and data curation; [Gülşen Eryiǧit] contributed to supervision and funding acquisition.

Corresponding author

Correspondence to Gülşen Eryiğit.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oral, B., Eryiğit, G. Fusion of visual representations for multimodal information extraction from unstructured transactional documents. IJDAR 25, 187–205 (2022). https://doi.org/10.1007/s10032-022-00399-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-022-00399-3

Keywords

Navigation