Abstract
Visual question answering is a multimodal task that interacts a given image with the corresponding natural language question to get the final answer. Traditional visual question answer models use region-based top-down image feature representations. This approach causes regional features to lose their contextual connection to global features, resulting in the underutilization of the global semantic features of visual features. To solve this problem, it is necessary to enhance the relationships between regions and between regions and the global to obtain more accurate visual feature representations, which can better correlate with corresponding question texts. Therefore, this paper proposes a multi-level visual feature enhancement method (MLVE). It mainly consists of the separated visual feature representation module (SVFR) and the joint visual feature representation module (JVFR). The graph attention neural network is an important part of the two modules to enhance the relationship between regions and between regions and the global. These two modules can learn different levels of visual semantic relationships to provide richer visual feature representations. The effectiveness of this scheme is verified on the VQA2.0 dataset.
This work is supported in part by National Key Research and Development Plan under Grant No. 2019YFB1404700.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, B., Yang, Y., Xu, X., et al.: Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162 (2017)
Cui, Y., Yang, G., Veit, A., et al.: Learning to evaluate image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5804–5812 (2018)
Miao, Y., Cheng, W., He, S., et al.: Research on visual question answering based on gat relational reasoning. Neural Process. Lett. 54(2), 1435–1448 (2022)
Yan, F., Silamu, W., Li, Y.: Deep modular bilinear attention network for visual question answering. Sensors 22(3), 1045 (2022)
Zhan, H., Xiong, P., Wang, X., et al.: Visual question answering by pattern matching and reasoning. Neurocomputing 467, 323–336 (2022)
Yang, Z., He, X., Gao, J., et al.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Yu, Z., Yu, J., Fan, J., et al.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Velickovic, P., Cucurull, G., Casanova, A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Kim, J.H., Lee, S.W., Kwak, D., et al.: Multimodal residual learning for visual QA. Adv. Neural Inf. Process. Syst. 29 (2016)
Kim, J.H., On, K.W., Lim, W., et al.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. Adv. Neural Inf. Process. Syst. 31 (2018)
Yu, Z., Yu, J., Cui, Y., et al.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Lu, J., Batra, D., Parikh, D., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 32 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Veličković, P.A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Deng, J., Dong, W., Socher, R., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 27 (2014)
Veličković, P., Cucurull, G., Casanova, A., et al.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Teney, D., Anderson, P., He, X., et al.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Y., Zhang, X., Zhang, Q., et al.: Dual self-attention with co-attention networks for visual question answering. Pattern Recogn. 117, 107956 (2021)
Chen, C., Han, D., Wang, J.: Multimodal encoder-decoder attention networks for visual question answering. IEEE Access 8, 35662–35671 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, X., Liu, X., Li, X., Cui, J., Cheng, H. (2023). Multi-level Visual Feature Enhancement Method for Visual Question Answering. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1792. Springer, Singapore. https://doi.org/10.1007/978-981-99-1642-9_40
Download citation
DOI: https://doi.org/10.1007/978-981-99-1642-9_40
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1641-2
Online ISBN: 978-981-99-1642-9
eBook Packages: Computer ScienceComputer Science (R0)