short-paper

Stacked Self-Attention Networks for Visual Question Answering

Authors:

Yanwei FuAuthors Info & Claims

ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

Pages 207 - 211

https://doi.org/10.1145/3323873.3325044

Published: 05 June 2019 Publication History

Abstract

Given a photograph, the task of Visual Question Answering (VQA) requires joint image and language understanding to answer a question. It is challenging in effectively extracting the visual representation of images, and efficiently embedding the textual sentences of questions. To address these challenges, we propose a VQA model that utilizes the stacked self-attention for visual understanding, and the BERT-based question embedding model. Particularly, the stacked self-attention mechanism proposed enables the model to not only focus on a simple object but also the relations between objects. Furthermore, the BERT model is learned in an end-to-end manner to better embed the question sentences. Our model is validated on the well-known VQA v2.0 dataset, and achieves the state-of-the-art results.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR .

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV .

Digital Library

[3]

Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In ICCV .

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).

[6]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR .

[7]

Vahid Kazemi and Ali Elqursh. 2017. Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering. arXiv preprint arXiv:1704.03162 (2017).

[8]

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2017. Hadamard Product for Low-rank Bilinear Pooling . In ICLR .

[9]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV .

[10]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS .

Digital Library

[11]

Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to Answer Questions from Image Using Convolutional Neural Network. In AAAI .

Digital Library

[12]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In ICCV .

[13]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS .

Digital Library

[14]

Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR .

[15]

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. 2017. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711 (2017).

[16]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS .

Digital Library

[17]

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Henge. 2017. Explicit knowledge-based reasoning for visual question answering. In IJCAI .

[18]

Chenfei Wu, Jinlai Liu, Xiaojie Wang, and Xuan Dong. 2018. Chain of Reasoning for Visual Question Answering. In NIPS .

Digital Library

[19]

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. CVIU (2017).

[20]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML .

Digital Library

[21]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR .

[22]

Bing Zhang, Chengming Xu, Chang Mao Cheng, Yanwei Fu, Yu-Gang Jiang, and Xiangyang Xue. 2018. Learning to score and summarize figure skating sport videos. arXiv preprint arXiv:1802.02774 (2018).

[23]

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015).

[24]

Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. 2017. Structured attentions for visual question answering. In ICCV .

[25]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In CVPR .

Cited By

Zakari ROwusu JQin KHe TLuo G(2025)Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question AnsweringBig Data Mining and Analytics10.26599/BDMA.2024.90200798:2(458-478)Online publication date: Apr-2025
https://doi.org/10.26599/BDMA.2024.9020079
Mao K(2024)Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance AnalysisProceedings of the 2024 6th International Conference on Image Processing and Machine Vision10.1145/3645259.3645278(115-122)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3645259.3645278
Prakash MMallam M(2024)Visual Question Answering: Convolutional Vision Transformers with Image-Guided Knowledge and Stacked Attention2024 Fourth International Conference on Multimedia Processing, Communication & Information Technology (MPCIT)10.1109/MPCIT62449.2024.10892649(71-77)Online publication date: 13-Dec-2024
https://doi.org/10.1109/MPCIT62449.2024.10892649
Show More Cited By

Index Terms

Stacked Self-Attention Networks for Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Multi-modal co-attention relation networks for visual question answering
Abstract
The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (...
Multimodal attention-driven visual question answering for Malayalam
Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English ...
Generative Attention Model with Adversarial Self-learning for Visual Question Answering
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017

Visual question answering (VQA) is arguably one of the most challenging multimodal understanding problems as it requires reasoning and deep understanding of the image, the question, and their semantic relationship. Existing VQA methods heavily rely on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

June 2019

427 pages

ISBN:9781450367653

DOI:10.1145/3323873

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Alberto Del Bimbo
University of Florence, Italy
,
Zhongfei Zhang
Binghamton University, State University of New York, USA
,
Program Chairs:
Alexander Hauptmann
Carnegie Mellon University, USA
,
K. Selcuk Candan
Arizona State University, USA
,
Marco Bertini
University of Florence, Italy
,
Lexing Xie
Australia National University, Australia
,
Xiao-Yong Wei
Sichuan University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

ICMR '19

Sponsor:

SIGMM

ICMR '19: International Conference on Multimedia Retrieval

June 10 - 13, 2019

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 88 of 241 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
412
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zakari ROwusu JQin KHe TLuo G(2025)Seeing and Reasoning: A Simple Deep Learning Approach to Visual Question AnsweringBig Data Mining and Analytics10.26599/BDMA.2024.90200798:2(458-478)Online publication date: Apr-2025
https://doi.org/10.26599/BDMA.2024.9020079
Mao K(2024)Enhancing Visual Question Answering through Bi-Modal Feature Fusion: Performance AnalysisProceedings of the 2024 6th International Conference on Image Processing and Machine Vision10.1145/3645259.3645278(115-122)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3645259.3645278
Prakash MMallam M(2024)Visual Question Answering: Convolutional Vision Transformers with Image-Guided Knowledge and Stacked Attention2024 Fourth International Conference on Multimedia Processing, Communication & Information Technology (MPCIT)10.1109/MPCIT62449.2024.10892649(71-77)Online publication date: 13-Dec-2024
https://doi.org/10.1109/MPCIT62449.2024.10892649
Wan YYi LJiang BChen JJiang YXie X(2024)AENet: attention enhancement network for industrial defect detection in complex and sensitive scenariosThe Journal of Supercomputing10.1007/s11227-024-05898-080:9(11845-11868)Online publication date: 1-Jun-2024
https://dl.acm.org/doi/10.1007/s11227-024-05898-0
Shamim RAlfurhood BMallik B(2024)Refining Medical Text Query ResponsesMathematical Modeling for Computer Applications10.1002/9781394248438.ch15(241-262)Online publication date: 13-Sep-2024
https://doi.org/10.1002/9781394248438.ch15
Li YZhong SLi SLiu Y(2023)A Robust Deep Learning Enhanced Monocular SLAM System for Dynamic EnvironmentsProceedings of the 2023 ACM International Conference on Multimedia Retrieval10.1145/3591106.3592295(508-515)Online publication date: 12-Jun-2023
https://dl.acm.org/doi/10.1145/3591106.3592295
Wang YLiu MWu JNie L(2023)Multi-Granularity Interaction and Integration Network for Video Question AnsweringIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.327849233:12(7684-7695)Online publication date: Dec-2023
https://doi.org/10.1109/TCSVT.2023.3278492
Kolides ANawaz ARathor ABeeman DHashmi MFatima SBerdik DAl-Ayyoub MJararweh Y(2023)Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impactsSimulation Modelling Practice and Theory10.1016/j.simpat.2023.102754126(102754)Online publication date: Jul-2023
https://doi.org/10.1016/j.simpat.2023.102754
Liu CTan YXia TZhang JZhu M(2023)Co-attention graph convolutional network for visual question answeringMultimedia Systems10.1007/s00530-023-01125-729:5(2527-2543)Online publication date: 20-Jun-2023
https://doi.org/10.1007/s00530-023-01125-7
Yan FSilamu WLi Y(2022)Deep Modular Bilinear Attention Network for Visual Question AnsweringSensors10.3390/s2203104522:3(1045)Online publication date: 28-Jan-2022
https://doi.org/10.3390/s22031045
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten