skip to main content
10.1145/3474085.3475390acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Towards Reasoning Ability in Scene Text Visual Question Answering

Published: 17 October 2021 Publication History

Abstract

Works on scene text visual question answering (TextVQA) always emphasize the importance of reasoning questions and image contents. However, we find current TextVQA models lack reasoning ability and tend to answer questions by exploiting dataset bias and language priors. Moreover, our observations indicate that recent accuracy improvement in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies and more Transformer layers, instead of newly proposed networks. In this work, towards the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based explainability method to explore why TextVQA models answer what they answer and find evidence for their predictions; 3) perform qualitative experiments to visually analyze models reasoning ability and explore potential reasons behind such a poor ability.

References

[1]
Somak Aditya, Yezhou Yang, and Chitta Baral. 2018. Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering. In AAAI. 629--637.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6077--6086.
[3]
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robort C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, and Samual White. 2010. VizWiz: Nearly Real-time Answers to Visual Questions. the 23nd annual ACM symposium on User Interface Software and Technology (2010), 333--342.
[4]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, C.V.Jawahar, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In Int. Conf. Comput. Vis. 4291--4301.
[5]
Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta: Large Scale System for Text Detection and Recognition in Images. In ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining. 71--79.
[6]
Adiya Chattopadhyay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. 2018. Grad-CAM+: Improved Visual Explanations for Deep Convolutional Networks. arxiv: 1710.11063v3
[7]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFS. PAMI, Vol. 40, 4 (2017), 834--848.
[8]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look At? An Analysis of BERT's Attention. arxiv: 1906.04341v1
[9]
Jia Deng, Wei Dong, Richard Socher, Li jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A Large-scale Hierarchical Image Database. In IEEE Conf. Comput. Vis. Pattern Recog. 1063--6919.
[10]
Chenyu Gao, Qi Zhu, Peng Wang, Hui Li, Yuliang Liu, Anton van den Hengel, and Qi Wu. 2020 b. Structured Multimodal Attentions for TextVQA. arxiv: 2006.00753v1
[11]
Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2020 a. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. In IEEE Conf. Comput. Vis. Pattern Recog. 12746--12756.
[12]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6639--6648.
[13]
Wei Han, Hantao Huang, and Tao Han. 2020. Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering. In Int. Conf. Computational Linguistics.
[14]
Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. In IEEE Conf. Comput. Vis. Pattern Recog. 9992--10002.
[15]
Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2018. Pythia v1.0: the Winning Entry to the VQA Challenge 2018. arxiv: 1807.09956
[16]
Yash Kant, Dhruv Batra, Peter Anderson, Alexander Schwing, Devi Parikh, Jiasen Lu, and Harsh Agrawal. 2020. Spatially Aware Multimodal Transformer for TextVQA. In Eur. Conf. Comput. Vis.
[17]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Adv. Neural Inform. Process. Syst. 1564--1574.
[18]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised Classification with Graph Convolutional Network. In Int. Conf. Learn. Represent.
[19]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, et al. 2017. Visual Genome: Connecting Language and Vision using Crowdsourced Dense Image Annotations. IJCV, Vol. 123, 1 (2017), 32--73.
[20]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, and Tom Duerig. 2018. The open images dataset v4: Unified Image Classfication, Object Detection and Visual Relationship Detection at Scale. arxiv: 1811.00982
[21]
Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Relation-aware Graph Attention Network for Visual Question Answering. In Int. Conf. Comput. Vis. 10313--10322.
[22]
Yuetan Lin, Hongrui Zhao, Yanan Li, and Donghui Wang. 2019. DCD-ZJU, TextVQA Challenge 2019 winner. https://visualqa.org/workshop.html (2019).
[23]
Fen Liu, Guanghui Xu, Qi Wu, Ding Du, Wei Jia, and Mingkui Tan. 2020. Cascade Reasoning Network for Text-based Visual Question Answering. In ACM Int. Conf. Multimedia. 4060--4069.
[24]
Yuliang Liu, Sheng Zhang, Lianwen Jin, Lele Xie, Yaqiang Wu, and Zhepeng Wang. 2019. Omnidirectional Scene Text Detection with Sequential-free Box Discretization. In IJCAI. 3052--3058.
[25]
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. 2019. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR. 947--952.
[26]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Adv. Neural Inform. Process. Syst. 91--99.
[27]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Int. Conf. Comput. Vis. 618--626.
[28]
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019 b. Towards VQA Models That Can Read. In IEEE Conf. Comput. Vis. Pattern Recog. 8317--8326.
[29]
Ajeet Kumar Singh, Anand Mishra, Shashank Shekhar, and Anirban Chakraborty. 2019 a. From String to Things: Knowledge-enabled VQA Model That Can Read and Reason. In Int. Conf. Comput. Vis. 4602--4612.
[30]
Anonymous submission. 2019. MSFT-VTI, TextVQA Challenge 2019 top entry (post-challenge). https://evalai.cloudcv.org/web/challenges/challenge-page/244/ (2019).
[31]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, and Lukasz Kaiser. 2017. Attention Is All You Need. In Adv. Neural Inform. Process. Syst.
[32]
Peng Wang, Lu Yang, Hui Li, Yuyan Deng, Chunhua Shen, and Yanning Zhang. 2019. A Simple and Robust Convolutional-Attention Network for Irregular Text Recognition. arxiv: 1904.01375v1
[33]
Qingqing Wang, Ye Huang, Wenjing Jia, Xiangjian He, Yue Lu, and Michael Blumenstein. 2020 a. FACLSTM: ConvLSTM with Focused Attention for Scene Text Recognition. Science China Information Science, Vol. 63, 2 (2020), 120103.
[34]
Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. 2020 b. On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 10126--10135.
[35]
Xiaofeng Yang, Guosheng Lin, Fengmao Lv, and Fayao Liu. 2020 a. TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering. In Eur. Conf. Comput. Vis.
[36]
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. 2020 b. TAP: Text-Aware Pre-training for Text-VQA and Text Caption. arxiv: 2012.04638v1
[37]
Zhuoqian Yang, Zengchang Qin, Jing Yu, and Yue Hu. 2019. Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering. arxiv: 1812.09681v2
[38]
Zhou Yu, Jun Yu, Yuhao Gui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In IEEE Conf. Comput. Vis. Pattern Recog. 6281--6290.
[39]
Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. 2016. Top-down Neural Attention by Excitation Backprop. In Eur. Conf. Comput. Vis. 543--559.
[40]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning Deep Features for Discriminative Localization. In IEEE Conf. Comput. Vis. Pattern Recog. 2921--2929.
[41]
Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. 2020. Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps. arxiv: 2012.05153

Cited By

View all
  • (2023)VTQAGen: BART-based Generative Model For Visual Text Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612844(9456-9461)Online publication date: 26-Oct-2023
  • (2023)Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQAProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612520(1261-1272)Online publication date: 26-Oct-2023
  • (2023)Towards Models that Can See and Read2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01985(21661-21671)Online publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. Towards Reasoning Ability in Scene Text Visual Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. TextVQA
    2. explainability method
    3. quantitatively and qualitative analysis
    4. reasoning ability

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China?Shanghai Municipal Science and Technology Major Project?Shanghai Science and Technology Innovation Action Plan

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)VTQAGen: BART-based Generative Model For Visual Text Question AnsweringProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612844(9456-9461)Online publication date: 26-Oct-2023
    • (2023)Filling in the Blank: Rationale-Augmented Prompt Tuning for TextVQAProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612520(1261-1272)Online publication date: 26-Oct-2023
    • (2023)Towards Models that Can See and Read2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01985(21661-21671)Online publication date: 1-Oct-2023
    • (2022)Inferential Visual Question GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548055(4164-4174)Online publication date: 10-Oct-2022
    • (2022)From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQAProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547977(4564-4572)Online publication date: 10-Oct-2022
    • (2022)LaTr: Layout-Aware Transformer for Scene-Text VQA2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01605(16527-16537)Online publication date: Jun-2022
    • (2022)Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question AnsweringComputer Vision – ACCV 202210.1007/978-3-031-26316-3_39(658-674)Online publication date: 4-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media