skip to main content
10.1145/3581783.3612844acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

VTQAGen: BART-based Generative Model For Visual Text Question Answering

Published: 27 October 2023 Publication History

Abstract

Visual Text Question Answering (VTQA) is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. The main objective is to develop models that can accurately provide relevant answers based on complementary information from both images and text, as well as the semantic meaning of the question. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. This paper introduces a novel generative framework called VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. The framework incorporates a step-based ensemble method to enhance model performance and generalization ability. VTQAGen utilizes an encoder-decoder generative model based on BART. Faster R-CNN is employed to extract visual regions of interest, while BART's encoder is modified to handle multi-modal interaction. The decoder stage utilizes the shift-predict approach and introduces step-based logits fusion to improve stability and accuracy. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

Supplemental Material

MP4 File
This video first introduces the background of our work. VTQA is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. Then we summarize the related work and introduce our motivation. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. To overcome these challenges, we give a description of our novel generative framework, VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

References

[1]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2017. VQA: Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 21--29.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6077--6086.
[3]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2017. VQA: Visual Question Answering. International Journal of Computer Vision, Vol. 123, 1 (2017), 4--31.
[4]
Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, and R Manmatha. 2022. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16548--16558.
[5]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019a. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1563--1570.
[6]
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019b. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision. 4291--4301.
[7]
Kang Chen and Xiangqian Wu. 2023. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. arXiv preprint arXiv:2303.02635 (2023).
[8]
Xun Chen, Lin Ma, Wei Zhang, Hongxun Yang, and Liang Zheng. 2021b. Multimodal Word-Wise Attention for Visual Question Answering. In International Joint Conference on Artificial Intelligence (IJCAI). IJCAI, 1805--1812.
[9]
Yen-Chun Chen, Zhe Gan, Linjie Li, Licheng Yu, Jingjing Li, Yu Cheng, and Jingjing Liu. 2021a. UNITER: A Hierarchical Pretrained Language Model for Unified Image and Text Representation. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 10536--10546.
[10]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: Universal Image-TexT Representation Learning. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 13194--13204.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[12]
Shiv Ram Dubey. 2021. A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 5 (2021), 2687--2704.
[13]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.
[14]
Peng Gao, Qi Mao, Yufeng Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2296--2304.
[15]
Yudong Guo, Ziping Huang, Peng Li, Xiaobo Zhou, Houjing Wang, and Liang Lin. 2021. LMVR: Learning Multimodal Vision-Language Representation via Cross-Modal Relation Modeling. arXiv preprint arXiv:2104.10821 (2021).
[16]
Darryl Hannan, Akshay Jain, and Mohit Bansal. 2020. Manymodalqa: Modality disambiguation and qa over diverse inputs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7879--7886.
[17]
Drew A Hudson and Christopher D Manning. 2018. Compositional Attention Networks for Machine Reasoning. arXiv preprint arXiv:1803.03067 (2018).
[18]
Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6700--6709.
[19]
Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, and Xu-Cheng Yin. 2021. Ruart: A novel text-centered solution for text-based visual question answering. IEEE Transactions on Multimedia (2021).
[20]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2901--2910.
[21]
Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision. 1965--1973.
[22]
Jinsoo Kim and Gunhee Lee. 2020. Dense Relational Captioning: Triple-Stream Networks for Relationship-based Captioning. arXiv preprint arXiv:2004.12419 (2020).
[23]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123 (2017), 32--73.
[24]
Jie Lei, Licheng Yu, Xiaojun Yang, Yangyang Ji, Zhe Gan, Yu Cheng, Jingjing Wang, and Jingjing Liu. 2021. Learning by Aligning Visual and Language Representations. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1217--1226.
[25]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[26]
Yu Li, Licheng Li, Bo Li, Chunyan Liu, Fan Zhang, and Yu Cheng. 2019a. Entangled Transformer for Image-and-Text Matching. In International Conference on Computer Vision (ICCV). IEEE, 6761--6770.
[27]
Zhen Li, Feng Zhou, Xiaoli Li, Lin Ma, and Wei Zhang. 2019b. Relation-Aware Graph Attention Network for Visual Question Answering. In International Conference on Computer Vision (ICCV). IEEE, 5670--5679.
[28]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.
[29]
Fen Liu, Guanghui Xu, Qi Wu, Qing Du, Wei Jia, and Mingkui Tan. 2020. Cascade reasoning network for text-based visual question answering. In Proceedings of the 28th ACM International Conference on Multimedia. 4060--4069.
[30]
Junyu Liu, Jiawei Zhang, Liqiang Nie, Yan Yan, and Li Cheng. 2019. Reasoning with Heterogeneous Graph Alignment for Video-and-Language Relationship Learning. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 9118--9127.
[31]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[32]
Xiaopeng Lu, Zhen Fan, Yansen Wang, Jean Oh, and Carolyn P Rosé. 2021. Localize, group, and select: Boosting text-vqa by scene text modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2631--2639.
[33]
Hyeonwoo Noh and Bohyung Han. 2016. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 30--38.
[34]
Revant Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, et al. 2022. MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11200--11208.
[35]
Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023. Prompting large language models with answer heuristics for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14974--14983.
[36]
Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi, and Jonathan Berant. 2021. Multimodalqa: Complex question answering over text, tables and images. arXiv preprint arXiv:2104.06039 (2021).
[37]
Damien Teney, Peter Anderson, and Xiaodong He. 2018. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. In Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE, 422--427.
[38]
Chaofan Wang, Samuel Kernan Freire, Mo Zhang, Jing Wei, Jorge Goncalves, Vassilis Kostakos, Zhanna Sarsenbayeva, Christina Schneegass, Alessandro Bozzon, and Evangelos Niforatos. 2023. Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection. arXiv preprint arXiv:2306.08833 (2023).
[39]
Qingqing Wang, Liqiang Xiao, Yue Lu, Yaohui Jin, and Hao He. 2021. Towards reasoning ability in scene text visual question answering. In Proceedings of the 29th ACM International Conference on Multimedia. 2281--2289.
[40]
Xuanyi Yang, Hanwang Huang, and James T Kwok. 2019b. Hierarchical Dynamic Graph Convolutional Network for Video Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7209--7218.
[41]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander Smola. 2015. Stacked Attention Networks for Image Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 21--29.
[42]
Zuxuan Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J Smola. 2019a. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6328--6337.
[43]
Zhou Yu, Jimei Yang, Youngjun Choi, Xiaohui Xiong, Alexander C Berg, and Tamara L Berg. 2018a. MATTN: A Modularized Attention Network for Referring Expression Comprehension. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1327--1335.
[44]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2018b. Beyond Narrative Description: Generating Visual Answers from Object Scenes. IEEE Transactions on Image Processing, Vol. 28, 1 (2018), 73--83.
[45]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2018c. Rethinking Diversified and Discriminative Multi-Sentence Video Description Generation. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4489--4497.
[46]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 6281--6290.
[47]
Chen Zhang, Hanwang Xu, Yongzhen Huang, Jian Guo, Fei Huang, and Rongrong Ji. 2019. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1464--1473.
[48]
Haoyuan Zhang, Ya Liu, Wanli Ouyang, Xiaogang Wang, and Dahua Lin. 2016. Yin and yang: Balancing and answering binary visual questions. In European Conference on Computer Vision (ECCV). Springer, 489--505.
[49]
Minyi Zhao, Bingjia Li, Jie Wang, Wanqing Li, Wenjing Zhou, Lan Zhang, Shijie Xuyang, Zhihang Yu, Xinkun Yu, Guangze Li, et al. 2022. Towards video text visual question answering: benchmark and baseline. Advances in Neural Information Processing Systems, Vol. 35 (2022), 35549--35562.

Cited By

View all
  • (2024)Demonstrative Instruction Following in Multimodal LLMs via Integrating Low-Rank Adaptation with Ensemble LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688995(11435-11441)Online publication date: 28-Oct-2024

Index Terms

  1. VTQAGen: BART-based Generative Model For Visual Text Question Answering

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cross-media reasoning
      2. multi-modal attention
      3. visual text question answering

      Qualifiers

      • Research-article

      Funding Sources

      • Major Science and Technology Innovation 2030 ``New Generation Artificial Intelligence' project

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)111
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Demonstrative Instruction Following in Multimodal LLMs via Integrating Low-Rank Adaptation with Ensemble LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688995(11435-11441)Online publication date: 28-Oct-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media