research-article

Towards Reasoning Ability in Scene Text Visual Question Answering

Authors:

Hao HeAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2281 - 2289

https://doi.org/10.1145/3474085.3475390

Published: 17 October 2021 Publication History

Abstract

Works on scene text visual question answering (TextVQA) always emphasize the importance of reasoning questions and image contents. However, we find current TextVQA models lack reasoning ability and tend to answer questions by exploiting dataset bias and language priors. Moreover, our observations indicate that recent accuracy improvement in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies and more Transformer layers, instead of newly proposed networks. In this work, towards the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based explainability method to explore why TextVQA models answer what they answer and find evidence for their predictions; 3) perform qualitative experiments to visually analyze models reasoning ability and explore potential reasons behind such a poor ability.

References

[1]

Somak Aditya, Yezhou Yang, and Chitta Baral. 2018. Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering. In AAAI. 629--637.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6077--6086.

[3]

Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robort C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, and Samual White. 2010. VizWiz: Nearly Real-time Answers to Visual Questions. the 23nd annual ACM symposium on User Interface Software and Technology (2010), 333--342.

Digital Library

[4]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C.V.Jawahar, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In Int. Conf. Comput. Vis. 4291--4301.

[5]

Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining. 71--79.

Digital Library

[6]

Adiya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. 2018. Grad-CAM+: Improved Visual Explanations for Deep Convolutional Networks. arxiv: 1710.11063v3

[7]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFS. PAMI, Vol. 40, 4 (2017), 834--848.

[8]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT's Attention. arxiv: 1906.04341v1

[9]

Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In IEEE Conf. Comput. Vis. Pattern Recog. 1063--6919.

[10]

Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. 2020 b. Structured Multimodal Attentions for TextVQA. arxiv: 2006.00753v1

[11]

Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2020 a. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. In IEEE Conf. Comput. Vis. Pattern Recog. 12746--12756.

[12]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6639--6648.

[13]

Wei Han, Hantao Huang, and Tao Han. 2020. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering. In Int. Conf. Computational Linguistics.

[14]

Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. In IEEE Conf. Comput. Vis. Pattern Recog. 9992--10002.

[15]

Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v1.0: the Winning Entry to the VQA Challenge 2018. arxiv: 1807.09956

[16]

Yash Kant, Dhruv Batra, Peter Anderson, Alexander Schwing, Devi Parikh, Jiasen Lu, and Harsh Agrawal. 2020. Spatially Aware Multimodal Transformer for TextVQA. In Eur. Conf. Comput. Vis.

[17]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Adv. Neural Inform. Process. Syst. 1564--1574.

Digital Library

[18]

Thomas N. Kipf and Max Welling. 2017. Semi-supervised Classification with Graph Convolutional Network. In Int. Conf. Learn. Represent.

[19]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. IJCV, Vol. 123, 1 (2017), 32--73.

Digital Library

[20]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, and Tom Duerig. 2018. The open images dataset v4: Unified Image Classfication, Object Detection and Visual Relationship Detection at Scale. arxiv: 1811.00982

[21]

Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-aware Graph Attention Network for Visual Question Answering. In Int. Conf. Comput. Vis. 10313--10322.

[22]

Yuetan Lin, Hongrui Zhao, Yanan Li, and Donghui Wang. 2019. DCD-ZJU, TextVQA Challenge 2019 winner. https://visualqa.org/workshop.html (2019).

[23]

Fen Liu, Guanghui Xu, Qi Wu, Ding Du, Wei Jia, and Mingkui Tan. 2020. Cascade Reasoning Network for Text-based Visual Question Answering. In ACM Int. Conf. Multimedia. 4060--4069.

Digital Library

[24]

Yuliang Liu, Sheng Zhang, Lianwen Jin, Lele Xie, Yaqiang Wu, and Zhepeng Wang. 2019. Omnidirectional Scene Text Detection with Sequential-free Box Discretization. In IJCAI. 3052--3058.

Digital Library

[25]

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR. 947--952.

[26]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Adv. Neural Inform. Process. Syst. 91--99.

Digital Library

[27]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Int. Conf. Comput. Vis. 618--626.

[28]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019 b. Towards VQA Models That Can Read. In IEEE Conf. Comput. Vis. Pattern Recog. 8317--8326.

[29]

Ajeet Kumar Singh, Anand Mishra, Shashank Shekhar, and Anirban Chakraborty. 2019 a. From String to Things: Knowledge-enabled VQA Model That Can Read and Reason. In Int. Conf. Comput. Vis. 4602--4612.

[30]

Anonymous submission. 2019. MSFT-VTI, TextVQA Challenge 2019 top entry (post-challenge). https://evalai.cloudcv.org/web/challenges/challenge-page/244/ (2019).

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, and Lukasz Kaiser. 2017. Attention Is All You Need. In Adv. Neural Inform. Process. Syst.

Digital Library

[32]

Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition. arxiv: 1904.01375v1

[33]

Qingqing Wang, Ye Huang, Wenjing Jia, Xiangjian He, Yue Lu, and Michael Blumenstein. 2020 a. FACLSTM: ConvLSTM with Focused Attention for Scene Text Recognition. Science China Information Science, Vol. 63, 2 (2020), 120103.

[34]

Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020 b. On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 10126--10135.

[35]

Xiaofeng Yang, Guosheng Lin, Fengmao Lv, and Fayao Liu. 2020 a. TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering. In Eur. Conf. Comput. Vis.

[36]

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2020 b. TAP: Text-Aware Pre-training for Text-VQA and Text Caption. arxiv: 2012.04638v1

[37]

Zhuoqian Yang, Zengchang Qin, Jing Yu, and Yue Hu. 2019. Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering. arxiv: 1812.09681v2

[38]

Zhou Yu, Jun Yu, Yuhao Gui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6281--6290.

[39]

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. 2016. Top-down Neural Attention by Excitation Backprop. In Eur. Conf. Comput. Vis. 543--559.

[40]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In IEEE Conf. Comput. Vis. Pattern Recog. 2921--2929.

[41]

Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. 2020. Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps. arxiv: 2012.05153

Cited By

Chen HWan TLin ZXu KWang JWang HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)VTQAGen: BART-based Generative Model For Visual Text Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612844(9456-9461)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612844
Zeng GZhang YZhou YFang BZhao GWei XWang WEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQAProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612520(1261-1272)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612520
Ganz RNuriel OAberdam AKittenplon YMazor SLitman R(2023)Towards Models that Can See and Read2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01985(21661-21671)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01985
Show More Cited By

Index Terms

Towards Reasoning Ability in Scene Text Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Cascade Reasoning Network for Text-based Visual Question Answering
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts ...
Learning Hierarchical Reasoning for Text-Based Visual Question Answering
Artificial Neural Networks and Machine Learning – ICANN 2021
Abstract
Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent ...
Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
The Semantic Web – ISWC 2021
Abstract
Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China?Shanghai Municipal Science and Technology Major Project?Shanghai Science and Technology Innovation Action Plan

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
360
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen HWan TLin ZXu KWang JWang HEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)VTQAGen: BART-based Generative Model For Visual Text Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612844(9456-9461)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612844
Zeng GZhang YZhou YFang BZhao GWei XWang WEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQAProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612520(1261-1272)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612520
Ganz RNuriel OAberdam AKittenplon YMazor SLitman R(2023)Towards Models that Can See and Read2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01985(21661-21671)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01985
Bi CWang SXue ZChen SHuang QMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Inferential Visual Question GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548055(4164-4174)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548055
Jin ZShou MZhou FTsutsui SQin JYin XMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQAProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547977(4564-4572)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547977
Biten ALitman RXie YAppalaraju SManmatha R(2022)LaTr: Layout-Aware Transformer for Scene-Text VQA2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01605(16527-16537)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.01605
Li BWang JZhao MZhou S(2022)Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question AnsweringComputer Vision – ACCV 202210.1007/978-3-031-26316-3_39(658-674)Online publication date: 4-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-26316-3_39

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten