research-article

VTQAGen: BART-based Generative Model For Visual Text Question Answering

Authors:

Kele Xu,

Huaimin WangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9456 - 9461

https://doi.org/10.1145/3581783.3612844

Published: 27 October 2023 Publication History

Get Access

Abstract

Visual Text Question Answering (VTQA) is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. The main objective is to develop models that can accurately provide relevant answers based on complementary information from both images and text, as well as the semantic meaning of the question. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. This paper introduces a novel generative framework called VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. The framework incorporates a step-based ensemble method to enhance model performance and generalization ability. VTQAGen utilizes an encoder-decoder generative model based on BART. Faster R-CNN is employed to extract visual regions of interest, while BART's encoder is modified to handle multi-modal interaction. The decoder stage utilizes the shift-predict approach and introduces step-based logits fusion to improve stability and accuracy. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

Supplemental Material

MP4 File

This video first introduces the background of our work. VTQA is a challenging task that requires answering questions pertaining to visual content by combining image understanding and language comprehension. Then we summarize the related work and introduce our motivation. Despite ongoing efforts, the VTQA task presents several challenges, including multimedia alignment, multi-step cross-media reasoning, and handling open-ended questions. To overcome these challenges, we give a description of our novel generative framework, VTQAGen, which leverages a Multi- modal Attention Layer to combine image-text pairs and question inputs, as well as a BART-based model for reasoning and entity extraction from both images and text. In the experiments, the proposed VTQAGen demonstrates superior performance on the testing set, securing second place in the ACM Multimedia Visual Text Question Answer Challenge.

Download
8.95 MB

References

[1]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2017. VQA: Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 21--29.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering

Visual question answering model based on graph neural network and contextual attention

Visual question answering: Which investigated applications?

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations