Compound-Attention Network with Original Feature injection for visual question and answering

Wu, Chunlei; Lu, Jing; Li, Haisheng; Wu, Jie; Duan, Hailong; Yuan, Shaozu

doi:10.1007/s11760-021-01932-3

Compound-Attention Network with Original Feature injection for visual question and answering

Original Paper
Published: 31 May 2021

Volume 15, pages 1853–1861, (2021)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Chunlei Wu ORCID: orcid.org/0000-0002-0944-2564¹,
Jing Lu¹,
Haisheng Li²,
Jie Wu¹,
Hailong Duan¹ &
…
Shaozu Yuan¹

352 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

Recently, visual question answering (VQA) based on the feature fusion between image vision and question text has attracted considerable research interests. The attention mechanism and dense iterative operations are adopted for fine-grained interplay and matching by aggregating the similarities of the image region and question word pairs. However, the autocorrelation information of image regions is ignored, which will lead to deviation in overall semantic understanding, thereby reducing the accuracy of answer prediction. Moreover, we notice that some valuable but unattended edge information of image is often completely forgotten after multiple bilateral co-attention operations. In this paper, a novel Compound-Attention Network with Original Feature injection is proposed to leverage both bilateral information and autocorrelation in a holistic deep framework. A visual feature enhancement mechanism is designed to mine more complete visual semantics and avoid understanding deviation. Then, an original feature injection module is proposed to retain the unattended edge information of the image. Extensive experiments conducted on VQA2.0 database demonstrate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

Article 13 September 2023

IMCN: Improved modular co-attention networks for visual question answering

Article 01 March 2024

Multi-level Visual Feature Enhancement Method for Visual Question Answering

References

Savchenko, A.V.: Event recognition with automatic album detection based on sequential processing, neural attention and image captioning. In: Computer 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). https://doi.org/10.1109/IJCNN48605.2020.9207675
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and VQA. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: ECCV Conference on Lecture Notes in Computer Science, pp. 707–723 (2018)
Cha, M., Gwon, Y.L., Kung, H.T.: Adversarial learning of semantic relevance in text to image synthesis. In: National Conference on Artificial Intelligence, vol. 33, pp. 3272–3279 (2019)
Kim, H.H., Park, J.S., Jung, J.W., et al.: Immersive teleconference system based on human–robot–avatar interaction using head-tracking devices. Int. J. Control Autom. Syst. 11, 1028–1037 (2013)
Article Google Scholar
Gao, P., Jiang, Z., You, H., et al.: Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In: Computer Vision and Pattern Recognition, pp 6639–6648 (2019)
Fukui, A., Park, D.H., Yang, D., et al.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Empirical Methods in Natural Language Processing, pp. 457–468 (2016)
Teney, D., Anderson, P., He, X., et al.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. In: Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6087–6096 (2018)
Gao, Y., Beijbom, O., Zhang, N., et al.: Compact bilinear pooling. In: Computer Vision and Pattern Recognition, pp. 317–326 (2016)
Kim, J.H. , On, K.W. , Lim, W., et al. Hadamard product for low-rank bilinear pooling. In: Computer Vision and Pattern Recognition (2016)
Yu, Z., Yu, J., Fan, J., et al.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE International Conference on Computer Vision, pp. 1839–1848 (2017)
Mohammadi, S., Majelan, S.G., Shokouhi, S.B.: Ensembles of deep neural networks for action recognition in still images. In: International Conference on Computer and Knowledge Engineering, pp. 315–318 (2019)
Gao, J., He, T., Zhou, X., et al.: Focusing and diffusion: Bidirectional attentive graph convolutional networks for skeleton-based action recognition. In: Computer Vision and Pattern Recognition (2019)
Nuthakki, S., Neela, S., Gichoya, J.W., et al.: Natural language processing of mimic-iii clinical notes for identifying diagnosis and procedures with neural networks. In: Computation and Language (2019)
Das, A., Verma, R.: Automated email generation for targeted attacks using natural language. In: Computation and Language (2019)
Ilija, I., Yan, S., Feng, J.: A focused dynamic attention model for visual question answering. In: Computer Vision and Pattern Recognition (2016)
Damron, M., Hanson, J., Houdré, C., et al.: Lower bounds for fluctuations in first-passage percolation for general distributions. In: Probability (math.PR) (2018)
Lee, H., Huang, C., Yune, S., et al.: Machine friendly machine learning: interpretation of computed tomography without image reconstruction. Sci. Rep. 15540, 1300–1309 (2019)
Google Scholar
Chen, Z., Zhao, Y., Huang, S., et al.: Structured attentions for visual question answering. In: International Conference on Computer Vision, vol. 1, pp. 1300–1309 (2017)
Kim, J., Lee, S., Kwak, D., et al.: Multimodal residual learning for visual QA. In: Neural Information Processing Systems, pp. 7–13 (2016)
Yang, Z., He, X., Gao, J., et al.: Stacked attention networks for image question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 21–29 (2016)
Kim, S., Choi, H.: Convolutional neural network for monocular vision-based multi-target tracking. Int. J. Control Autom. Syst. 17, 2284–2296 (2019)
Goyal, Y., Khot, T., Agrawal, A., et al.: Making the v in vqa matter: elevating the role of image understanding in visual question answering. In International Journal of Computer Vision, vol. 127 (2019). https://doi.org/10.1007/s11263-018-1116-0
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. Empir. Methods Nat. Lang. Process. 14, 1532–1543 (2014)
Google Scholar
He, K., Zhang, X., Ren, S.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 770–778 (2016)
Srivastava, N., Hinton, G., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1921–1958 (2014)
MathSciNet MATH Google Scholar
Yu, Z., Yu, J., Xiang, C., et al.: Beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 99, 1–13 (2017)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the grants from Key Research and Development Plan of Shandong Province (No. 2019GGX101015), the Major Scientific and Technological Projects of CNPC under Grant ZD2019-183-001, the Fundamental Research Funds for the Central Universities (No. 20CX05018A), and the China National Study Abroad Fund.

Author information

Authors and Affiliations

College of Computer Science and Technology, China University of Petroleum (Huadong), No. 66, West Changjiang Road, Huangdao District, Qingdao, 266580, China
Chunlei Wu, Jing Lu, Jie Wu, Hailong Duan & Shaozu Yuan
Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing, China
Haisheng Li

Authors

Chunlei Wu
View author publications
You can also search for this author inPubMed Google Scholar
Jing Lu
View author publications
You can also search for this author inPubMed Google Scholar
Haisheng Li
View author publications
You can also search for this author inPubMed Google Scholar
Jie Wu
View author publications
You can also search for this author inPubMed Google Scholar
Hailong Duan
View author publications
You can also search for this author inPubMed Google Scholar
Shaozu Yuan
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Chunlei Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, C., Lu, J., Li, H. et al. Compound-Attention Network with Original Feature injection for visual question and answering. SIViP 15, 1853–1861 (2021). https://doi.org/10.1007/s11760-021-01932-3

Download citation

Received: 26 December 2020
Revised: 27 March 2021
Accepted: 07 May 2021
Published: 31 May 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11760-021-01932-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compound-Attention Network with Original Feature injection for visual question and answering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

IMCN: Improved modular co-attention networks for visual question answering

Multi-level Visual Feature Enhancement Method for Visual Question Answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now