Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering

Shao, Huan; Xu, Yunlong; Ji, Yi; Yang, Jianyu; Liu, Chunping

doi:10.1007/978-3-030-36802-9_24

Huan Shao⁹,
Yunlong Xu¹⁰,
Yi Ji⁹,
Jianyu Yang¹¹ &
…
Chunping Liu⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1143))

Included in the following conference series:

International Conference on Neural Information Processing

2446 Accesses
2 Citations

Abstract

Better capturing the interactions of different modality is a hot research topic in visual question answering (VQA) recently. Inspired by human vision information processing, a method of VQA based on intra-modality features interactive with self-attention mechanism (IMFI-SA) is proposed. We adopted object-level features with bottom-up attention instead of feature mapping to extract the fine-grained information in images. Moreover, the interactions of intra-modality in the question and the image modality is also extracted by proposed IMFI-SA model respectively. Finally, we combined the enhanced object-level features interaction using top-down cross-attention and the question features interaction to predict the answer given a question and image. Experimental results on the VQA2.0 dataset show that the proposed method is superior to the existing method in the reasoning answer generating, especially in counting problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

Cross-modality co-attention networks for visual question answering

Article 05 January 2021

Modular dual-stream visual fusion network for visual question answering

Article 28 May 2024

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733 (2016)
Chung, J., Gulcehre, C., Cho, K.H., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. Eprint Arxiv (2014)
Google Scholar
Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Underst. 163, 90–100 (2017)
Article Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Gao, P., et al.: Question-guided hybrid convolution for visual question answering. In: The European Conference on Computer Vision (ECCV), September 2018
Chapter Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)
Google Scholar
Patro, B., Namboodiri, V.P.: Differential attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7680–7688 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2014)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tan, Z., Wang, M., Xie, J., Chen, Y., Shi, X.: Deep semantic role labeling with self-attention. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Teney, D., Anderson, P., He, X., Hengel, A.V.D.: Tips and tricks for visual question answering: learnings from the 2017 challenge (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yu, D., Fu, J., Mei, T., Rui, Y.: Multi-level attention networks for visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Google Scholar
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. Computer Science (2015)
Google Scholar

Download references

Acknowledgments

Supported by National Natural Science Foundation of China (61773272, 61272258, 61301299), The Natural Science Foundation of the Jiangsu Higher Education Institutions of China (19KJA230001), the Priority Academic Program Development of Jiangsu Higher Education Institutions.

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215006, People’s Republic of China
Huan Shao, Yi Ji & Chunping Liu
Applied Technical School of Soochow University, Suzhou, 215325, People’s Republic of China
Yunlong Xu
School of Rail Transportation, Soochow University, Suzhou, 215006, People’s Republic of China
Jianyu Yang

Authors

Huan Shao
View author publications
You can also search for this author in PubMed Google Scholar
Yunlong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Ji
View author publications
You can also search for this author in PubMed Google Scholar
Jianyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Chunping Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yunlong Xu or Chunping Liu .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shao, H., Xu, Y., Ji, Y., Yang, J., Liu, C. (2019). Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Communications in Computer and Information Science, vol 1143. Springer, Cham. https://doi.org/10.1007/978-3-030-36802-9_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-36802-9_24
Published: 05 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36801-2
Online ISBN: 978-3-030-36802-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics