Abstract
In recent years, the research of attention mechanism has made significant progress in the field of computer vision. In the processing of visual problems of remote sensing images, the attention mechanism can make the computer focus on important image areas and improve the accuracy of question answering. Our research focuses on the role of synergistic attention mechanisms in the interaction of question representations and visual representations. On the basis of Modular Collaborative Attention (MCA), according to the complementary characteristics of global features and local features, the hybrid connection strategy is used to perceive global features at the same time without weakening the attention distribution of local features. The impact of attention mechanisms on various types of visual question answering questions has been evaluated:(i) scene classification (ii)object comparison (iii) quantitative statistics (iv) relational judgment. By fusing the global features and local features of different modalities, the model can obtain more information between modalities. Model performance evaluation under the RSVQA-LR dataset. Experimental results show, the method in this paper improves the global accuracy by 9.81% than RSVQA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt Representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ilievski, I., Yan, S., Feng, J.: A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016)
Kumar, A., et al.: Ask me anything: Dynamic memory networks for natural language processing. In: International Conference on Machine Learning, pp. 1378–1387. PMLR (2016)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: International Conference on Machine Learning. pp. 2397–2406. PMLR (2016)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61772399 and Grant 62101517, in part by the Key Research and Development Program in Shaanxi Province of China under Grant 2019ZDLGY09-05, and in part by the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Zhang, S., Wei, Q., Li, Y., Chen, Y., Jiao, L. (2022). Visual Question Answering of Remote Sensing Image Based on Attention Mechanism. In: Shi, Z., Jin, Y., Zhang, X. (eds) Intelligence Science IV. ICIS 2022. IFIP Advances in Information and Communication Technology, vol 659. Springer, Cham. https://doi.org/10.1007/978-3-031-14903-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-14903-0_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14902-3
Online ISBN: 978-3-031-14903-0
eBook Packages: Computer ScienceComputer Science (R0)