skip to main content
10.1145/3323873.3325044acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Stacked Self-Attention Networks for Visual Question Answering

Published: 05 June 2019 Publication History

Abstract

Given a photograph, the task of Visual Question Answering (VQA) requires joint image and language understanding to answer a question. It is challenging in effectively extracting the visual representation of images, and efficiently embedding the textual sentences of questions. To address these challenges, we propose a VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model. Particularly, the stacked self-attention mechanism proposed enables the model to not only focus on a simple object but also the relations between objects. Furthermore, the BERT model is learned in an end-to-end manner to better embed the question sentences. Our model is validated on the well-known VQA v2.0 dataset, and achieves the state-of-the-art results.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR .
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV .
[3]
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In ICCV .
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).
[6]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR .
[7]
Vahid Kazemi and Ali Elqursh. 2017. Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering. arXiv preprint arXiv:1704.03162 (2017).
[8]
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling . In ICLR .
[9]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV .
[10]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS .
[11]
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network. In AAAI .
[12]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In ICCV .
[13]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS .
[14]
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR .
[15]
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2017. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711 (2017).
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS .
[17]
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Henge. 2017. Explicit knowledge-based reasoning for visual question answering. In IJCAI .
[18]
Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. 2018. Chain of Reasoning for Visual Question Answering. In NIPS .
[19]
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. CVIU (2017).
[20]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .
[21]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR .
[22]
Bing Zhang, Chengming Xu, Chang Mao Cheng, Yanwei Fu, Yu-Gang Jiang, and Xiangyang Xue. 2018. Learning to score and summarize figure skating sport videos. arXiv preprint arXiv:1802.02774 (2018).
[23]
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015).
[24]
Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. 2017. Structured attentions for visual question answering. In ICCV .
[25]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In CVPR .

Cited By

View all
  • (2025)Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question AnsweringBig Data Mining and Analytics10.26599/BDMA.2024.90200798:2(458-478)Online publication date: Apr-2025
  • (2024)Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance AnalysisProceedings of the 2024 6th International Conference on Image Processing and Machine Vision10.1145/3645259.3645278(115-122)Online publication date: 12-Jan-2024
  • (2024)Visual Question Answering: Convolutional Vision Transformers with Image-Guided Knowledge and Stacked Attention2024 Fourth International Conference on Multimedia Processing, Communication & Information Technology (MPCIT)10.1109/MPCIT62449.2024.10892649(71-77)Online publication date: 13-Dec-2024
  • Show More Cited By

Index Terms

  1. Stacked Self-Attention Networks for Visual Question Answering

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval
    June 2019
    427 pages
    ISBN:9781450367653
    DOI:10.1145/3323873
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. language understanding
    2. scene understanding
    3. visual question answering

    Qualifiers

    • Short-paper

    Conference

    ICMR '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 88 of 241 submissions, 37%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question AnsweringBig Data Mining and Analytics10.26599/BDMA.2024.90200798:2(458-478)Online publication date: Apr-2025
    • (2024)Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance AnalysisProceedings of the 2024 6th International Conference on Image Processing and Machine Vision10.1145/3645259.3645278(115-122)Online publication date: 12-Jan-2024
    • (2024)Visual Question Answering: Convolutional Vision Transformers with Image-Guided Knowledge and Stacked Attention2024 Fourth International Conference on Multimedia Processing, Communication & Information Technology (MPCIT)10.1109/MPCIT62449.2024.10892649(71-77)Online publication date: 13-Dec-2024
    • (2024)AENet: attention enhancement network for industrial defect detection in complex and sensitive scenariosThe Journal of Supercomputing10.1007/s11227-024-05898-080:9(11845-11868)Online publication date: 1-Jun-2024
    • (2024)Refining Medical Text Query ResponsesMathematical Modeling for Computer Applications10.1002/9781394248438.ch15(241-262)Online publication date: 13-Sep-2024
    • (2023)A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic EnvironmentsProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592295(508-515)Online publication date: 12-Jun-2023
    • (2023)Multi-Granularity Interaction and Integration Network for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327849233:12(7684-7695)Online publication date: Dec-2023
    • (2023)Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impactsSimulation Modelling Practice and Theory10.1016/j.simpat.2023.102754126(102754)Online publication date: Jul-2023
    • (2023)Co-attention graph convolutional network for visual question answeringMultimedia Systems10.1007/s00530-023-01125-729:5(2527-2543)Online publication date: 20-Jun-2023
    • (2022)Deep Modular Bilinear Attention Network for Visual Question AnsweringSensors10.3390/s2203104522:3(1045)Online publication date: 28-Jan-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media