research-article

CALM: Commen-Sense Knowledge Augmentation for Document Image Understanding

Authors:

Yaohui JinAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 3282 - 3290

https://doi.org/10.1145/3503161.3548321

Published: 10 October 2022 Publication History

Abstract

Performance of document image understanding has been significantly fueled by encoding multi-modal information in recent years. However, existing works heavily rely on the superficial appearance of the observed data, resulting in counter-intuitive model behavior in many critical cases. To overcome this issue, this paper proposes a common-sense knowledge augmented model CALM for document image understanding tasks. It firstly produces purified representations of document contents to extract key information and learn common-sense augmented representation for inputs. Then, relevant common-sense knowledge is extracted from the external ConceptNet knowledge base, and a derived knowledge graph is built to enhance the common-sense reasoning capability of CALM jointly. In order to further highlight the importance of common-sense knowledge in document image understanding, we propose the first question-answering dataset, CS-DVQA, focused on common-sense reasoning for document images, in which questions are answered by taking both document contents and common-sense knowledge into consideration. Through extensive evaluation, the proposed CALM approach outperforms the state-of-the-art models in three document image understanding tasks, including key information extraction(from 85.37 to 86.52), document image classification(from 96.08 to 96.17), document visual question answering(from 86.72 to 88.03).

References

[1]

Muhammad Zeshan Afzal, Andreas Kölsch, Sheraz Ahmed, and Marcus Liwicki. 2017. Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1. IEEE, 883--888.

[2]

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R Manmatha. 2021. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 993--1003.

[3]

Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. Unilmv2: Pseudomasked language models for unified language model pre-training. In International Conference on Machine Learning. PMLR, 642--652.

[4]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 1247--1250.

Digital Library

[5]

Lukasz Borchmann, Michal Pietruszka, Tomasz Stanislawek, Dawid Jurkiewicz, Michal Turski, Karolina Szyndler, and Filip Grali'ski. 2021. DUE: End-to-End Document Understanding Benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

[6]

Tyler Dauphinee, Nikunj Patel, and Mohammad Rashidi. 2019. Modular multimodal architecture for document classification. arXiv preprint arXiv:1912.04376 (2019).

[7]

Timo I Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948 (2019).

[8]

Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. 2020. Scalable multi-hop relational reasoning for knowledge-aware question answering. arXiv preprint arXiv:2005.00646 (2020).

[9]

Leilei Gan, Kun Kuang, Yi Yang, and Fei Wu. 2021. Judgment prediction via injecting legal knowledge into neural networks. In Proceedings of AAAI. 12866--12874.

[10]

François Gardères, Maryam Ziaeefard, Baptiste Abeloos, and Freddy Lecue. 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020. 489--498.

[11]

Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In International conference on machine learning. PMLR, 1263--1272.

[12]

Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 991--995.

Digital Library

[13]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[14]

Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1--6.

[15]

Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2d documents. arXiv preprint arXiv:1809.08799 (2018).

[16]

Chen-Yu Lee, Chun-Liang Li, Chu Wang, Renshen Wang, Yasuhisa Fujii, Siyang Qin, Ashok Popat, and Tomas Pfister. 2021. ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. arXiv preprint arXiv:2106.10786 (2021).

[17]

JaeYun Lee and Incheol Kim. 2021. Vision--Language--Knowledge Co-Embedding for Visual Commonsense Reasoning. Sensors 21, 9 (2021), 2911.

[18]

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. Dbpedia--a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web 6, 2 (2015), 167--195.

[19]

Chenliang Li, Bin Bi, Ming Yan,WeiWang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. arXiv preprint arXiv:2105.11210 (2021).

[20]

Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan. 2018. Concept mining via embedding. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 267--276.

[21]

Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5652--5660.

[22]

Tao Li and Vivek Srikumar. 2019. Augmenting neural networks with first-order logic. arXiv preprint arXiv:1906.06298 (2019).

[23]

Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2019. A unified MRC framework for named entity recognition. arXiv preprint arXiv:1910.11476 (2019).

[24]

Jialu Liu, Jingbo Shang, ChiWang, Xiang Ren, and Jiawei Han. 2015. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1729--1744.

Digital Library

[25]

Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. arXiv preprint arXiv:1903.11279 (2019).

[26]

Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. arXiv preprint arXiv:2103.13009 (2021).

[27]

Bodhisattwa Majumder, Navneet Potti, Sandeep Tata, James B Wendt, Qi Zhao, and Marc Najork. 2020. Representation learning for information extraction from form-like documents. (2020).

[28]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa:Avisual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3195--3204.

[29]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2200--2209.

[30]

Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. 2018. Out of the box: Reasoning with graph convolution nets for factual visual question answering. Advances in neural information processing systems 31 (2018).

[31]

Ritesh Sarkhel and Arnab Nandi. 2019. Deterministic routing between layout abstractions for multi-scale classification of visually rich documents. In 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019.

Digital Library

[32]

Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39, 11 (2016), 2298--2304.

Digital Library

[33]

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence.

[34]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937 (2018).

[35]

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in natural image with connectionist text proposal network. In European conference on computer vision. Springer, 56--72.

[36]

Theo van Veen. 2019. Wikidata. Information technology and libraries 38, 2 (2019), 72--81.

[37]

Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro Szekely, and Xiang Ren. 2020. Connecting the dots: A knowledgeable path generator for commonsense question answering. arXiv preprint arXiv:2005.00691 (2020).

[38]

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40, 10 (2017), 2413--2427.

[39]

Zilong Wang, Mingjie Zhan, Xuebo Liu, and Ding Liang. 2020. DocStruct: a multimodal method to extract hierarchy structure in document for general form understanding. arXiv preprint arXiv:2010.11685 (2020).

[40]

JialinWu, Jiasen Lu, Ashish Sabharwal, and Roozbeh Mottaghi. 2021. Multi-modal answer validation for knowledge-based vqa. arXiv preprint arXiv:2103.12248 (2021).

[41]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, FuruWei, and Ming Zhou. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1192--1200.

Digital Library

[42]

Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and FuruWei. 2021. LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding. arXiv preprint arXiv:2104.08836 (2021).

[43]

Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020. Layoutlmv2: Multimodal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020).

[44]

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C Lee Giles. 2017. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5315--5324.

[45]

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. 2021. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835 (2021).

[46]

Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2021. PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 4363--4370.

[47]

Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. 2020. TRIE: end-to-end text reading and information extraction for document understanding. In Proceedings of the 28th ACM International Conference on Multimedia. 1413--1422.

Digital Library

[48]

Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, and Qi Wu. 2020. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering. arXiv preprint arXiv:2006.09073 (2020).

Cited By

Ding YRen KHuang JLuo SHan SLarson K(2024)MMVQAProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/690(6243-6251)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/690
Ahmed SJawade BPandey SSetlur SGovindaraju V(2023)RealCQA: Scientific Chart Question Answering as a Test-Bed for First-Order LogicDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41682-8_5(66-83)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41682-8_5

Index Terms

CALM: Commen-Sense Knowledge Augmentation for Document Image Understanding
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Information extraction

Recommendations

LayoutLM: Pre-training of Text and Layout for Document Image Understanding
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout ...
HM-Transformer: Hierarchical Multi-modal Transformer for Long Document Image Understanding
Web and Big Data
Abstract
Transformer plays a massive role in document image understanding. However, it has difficulty handling text in long document images due to the increasing quadratic complexity along the text length. To solve this problem, we propose the hierarchical ...
Integrated text and image understanding for document understanding
HLT '94: Proceedings of the workshop on Human Language Technology

Because of the complexity of documents and the variety of applications which must be supported, document understanding requires the integration of image understanding with text understanding. Our document understanding technology is implemented in a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
273
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)4

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ding YRen KHuang JLuo SHan SLarson K(2024)MMVQAProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/690(6243-6251)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/690
Ahmed SJawade BPandey SSetlur SGovindaraju V(2023)RealCQA: Scientific Chart Question Answering as a Test-Bed for First-Order LogicDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41682-8_5(66-83)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41682-8_5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten