research-article

Finetuning Language Models for Multimodal Question Answering

Authors:

Min ZhangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9420 - 9424

https://doi.org/10.1145/3581783.3612837

Published: 27 October 2023 Publication History

Abstract

To achieve multi-modal intelligence, AI must be able to process and respond to inputs from multimodal sources. However, many current question answering models are limited to specific types of answers, such as yes/no and number, and require additional human assessments. Recently, Visual-Text Question Answering (VQTA) dataset has been proposed to fix this gap. In this paper, we conduct an exhaustive analysis and exploration of this task. Specifically, we implement a T5-based multi-modal generative network that overcomes the limitations of traditional labeling space and provides more freedom in responses. Our approach achieve the best performance in both English and Chinese tracks in the VTQA challenge.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018a. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE Computer Society, 6077--6086.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018b. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE Computer Society, 6077--6086.

[3]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2425--2433.

Digital Library

[4]

Kang Chen and Xiangqian Wu. 2023. VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning. CoRR, Vol. abs/2303.02635 (2023). https://doi.org/10.48550/arXiv.2303.02635 showeprint[arXiv]2303.02635

[5]

Xiaolin Chen, Xuemeng Song, Ruiyang Ren, Lei Zhu, Zhiyong Cheng, and Liqiang Nie. 2020. Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation Learning. ACM Trans. Inf. Syst., Vol. 38, 4 (2020), 37:1--37:26.

Digital Library

[6]

Xiaolin Chen, Xuemeng Song, Yinwei Wei, Liqiang Nie, and Tat-Seng Chua. 2023. Dual Semantic Knowledge Composed Multimodal Dialog Systems. In SIGIR. ACM, 1518--1527.

[7]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1724--1734.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[9]

Yash Goyal, Tejas Khot, Aishwarya Agrawal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2019. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis., Vol. 127, 4 (2019), 398--414.

Digital Library

[10]

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NeurIPS.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[12]

Gouthaman Kv and Anurag Mittal. 2020. Reducing language biases in visual question answering with visually-grounded question encoder. In ECCV. Springer, 18--34.

[13]

Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu-Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS. 9694--9705.

[14]

Ilya Loshchilov and Frank Hutter. [n.,d.]. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

[15]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022a. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).

[16]

Panzhong Lu, Xin Zhang, Meishan Zhang, and Min Zhang. 2022b. Extending Phrase Grounding with Pronouns in Visual Dialogues. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 7614--7625.

[17]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In CVPR. Computer Vision Foundation / IEEE, 3195--3204.

[18]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67. http://jmlr.org/papers/v21/20-074.html

[19]

Jun Rao, Liang Ding, Shuhan Qi, Meng Fang, Yang Liu, Li Shen, and Dacheng Tao. 2023. Dynamic Contrastive Distillation for Image-Text Retrieval. IEEE Transactions on Multimedia (2023).

Digital Library

[20]

Jun Rao, Xv Meng, Liang Ding, Shuhan Qi, and Dacheng Tao. 2022a. Parameter-Efficient and Student-Friendly Knowledge Distillation. CoRR, Vol. abs/2205.15308 (2022). https://doi.org/10.48550/arXiv.2205.15308 showeprint[arXiv]2205.15308

[21]

Jun Rao, Tao Qian, Shuhan Qi, Yulin Wu, Qing Liao, and Xuan Wang. 2021. Student Can Also be a Good Teacher: Extracting Knowledge from Vision-and-Language Model for Cross-Modal Retrieval. In CIKM.

[22]

Jun Rao, Fei Wang, Liang Ding, Shuhan Qi, Yibing Zhan, Weifeng Liu, and Dacheng Tao. 2022b. Where Does the Performance Improvement Come From? -A Reproducibility Concern about Image-Text Retrieval. In SIGIR. 2727--2737.

[23]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Conference on Neural Information Processing Systems. 91--99.

[24]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representationss.

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems. MIT Press, 5998--6008.

[26]

Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive Linguistic-Visual Composition Network for Image Retrieval. In SIGIR. ACM, 1369--1378.

[27]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Perric Cistac, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. Association for Computational Linguistics, 38--45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[28]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked Attention Networks for Image Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 21--29.

[29]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep Modular Co-Attention Networks for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 6281--6290.

[30]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In CVPR. Computer Vision Foundation / IEEE, 6720--6731.

[31]

Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. 2022. Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence. CoRR, Vol. abs/2209.02970 (2022).

[32]

Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang, and Ming Zhou. 2021. Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese. arxiv: 2110.06696 [cs.CL]

Cited By

Kuang JShen YXie JLuo HXu ZLi RLi YCheng XLin XHan Y(2025)Natural Language Understanding and Inference with MLLM in Visual Question Answering: A SurveyACM Computing Surveys10.1145/371168057:8(1-36)Online publication date: 31-Jan-2025
https://dl.acm.org/doi/10.1145/3711680
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727

Index Terms

Finetuning Language Models for Multimodal Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Question answering

Recommendations

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Zero-shot Visual Question Answering (VQA) is a prominent vision-language task that examines both the visual and textual understanding capability of systems in the absence of training data. Recently, by converting the images into captions, information ...
Adapting Visual Question Answering Models for Enhancing Multimodal Community Q&A Platforms
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Question categorization and expert retrieval methods have been crucial for information organization and accessibility in community question & answering (CQA) platforms. Research in this area, however, has dealt with only the text modality. With the ...
Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Advances in Information Retrieval
Abstract
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
316
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)11

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kuang JShen YXie JLuo HXu ZLi RLi YCheng XLin XHan Y(2025)Natural Language Understanding and Inference with MLLM in Visual Question Answering: A SurveyACM Computing Surveys10.1145/371168057:8(1-36)Online publication date: 31-Jan-2025
https://dl.acm.org/doi/10.1145/3711680
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten