Abstract
In Visual Question Answering (VQA), an attention mechanism has a critical role in specifying the different objects present in an image or tells the machine where to focus by invoking the vivid visuals in an image. However, the current VQA systems uses the region-based or bounding box based image features to learn the attention distribution, which are not capable enough to answer the questions related to foreground or background object present in an image. In this paper, we have proposed a VQA model that uses image features which are effective enough to answer questions related to foreground object and background region. Also, we have used graph neural network to encode the relationship between the image regions and objects in an image. Further, we have generated image captions based on these visual relationship based image representation. Thus, the proposed model uses two attention modules to take advantage of each other’s knowledge, to generate more influential attention modules together with the captions based image representation to extract the features which are capable enough to answer questions related to the foreground object and background region. Finally, the performance of proposed architecture is improved by combining the hybrid simulated annealing-Mantaray Foraging Optimization (SA-MRFO) algorithm, which selects the optimal weight parameter for the proposed model. To estimate the performance of the proposed model, two benchmark datasets are used: VQA 2.0 and VQA-CP v2.
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision pp. 2425–2433
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering, in: Advances in neural information processing systems, pp. 2953–2961
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question 415 answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, pp. 91–99
Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543
Lu P, Ji L, Zhang W, Duan N, Zhou M, and Wang J (2018) “R-VQA: Learning visual relation facts with semantic attention for visual question answering,” in Proc. 24th ACM SIGKDD Int Conf Knowl Discov Data Min, pp. 1880–1889
Lu P, Li H, Zhang W, Wang J, and Wang X (2018) “Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering,” in Proc. 32nd AAAI Conf Artif Intell, pp. 7218–7225
Yu D, Fu J, Mei V, and Rui Y (2017) “Multi-level attention networks for visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Honolulu, HI, USA,pp. 21–29
Liu Y, Zhang X, Huang F, Tang X, and Li Z (2019) “Visual question answering via attention-based syntactic structure tree-LSTM,” Appl Soft Comput, vol. 82, Art. no. 105584
Yang Z, He X, Gao J, Deng L, and Smola AJ (2016) “Stacked attention networks for image question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, 2016, pp. 21–29
Lu J, Yang J, Batra D, and Parikh D (2016) “Hierarchical question-image co-attention for visual question answering,” in Proc Adv Neural Inf Process Syst, pp. 289–297
Nam H, Ha J-W, and Kim J (2017) “Dual attention networks for multimodal reasoning and matching,” in Proc IEEE Conf Comput Vis Pattern Recognit, Honolulu, HI, USA, pp. 2156–2164
Li R and Jia J (2016) “Visual question answering with question representation update (QRU),” in Proc Adv Neural Inf Process Syst, pp. 4655–4663
Anderson et al. P (2018), “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Salt Lake City, UT, USA, pp. 6077–6086
Shih KJ, Singh S, and Hoiem D (2016) “Where to look: Focus regions for visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, pp. 4613–4621
Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Andrej K, Li F-F (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676
You Q, Jin H , Wang Z , Fang C , Luo J (2016) Image captioning with semantic attention, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659
Yang Z , Yuan Y , Wu Y, Cohen WW , Salakhutdinov RR (2016) Review networks for caption generation, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2361–2369
Chen X , Ma L , Jiang W , Yao J , Liu W (2018) Regularizing RNNs for caption generation by reconstructing the past with the present, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7995–8003
Tanti M, Gatt A, Camilleri KP (2018) Where to put the image in an image caption generator. Nat Lang Eng 24(3):467–489
Cadene R, Ben-Younes H, Cord M and Thome N (2019) Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1989–1998
Li L, Gan Z, Cheng Y, and Liu J (2019) Relation-aware graph attention network for visual question answering, In Proceedings of the IEEE International Conference on Computer Vision, pp. 10313–10322. 2019
Gao L, Cao L, Xu X, Shao J, Song J (2020) Question-Led object attention for visual question answering. Neurocomputing 391:227–233
Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Visual Commun Image Rep 73:102762
Zhang W, Jing Y, Hua H, Haiyang H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126
Bai Z, Li Y, Woźniak M, Zhou M, Li D (2020) DecomVQANet Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognit 110:107538
Chen L, Yan X, Xiao J, Zhang H, Pu S, and Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809
Lobry S, Marcos D, Murray J, and Tuia D (2020) RSVQA: Visual Question Answering for Remote Sensing Data, IEEE Transactions on Geoscience and Remote Sensing
Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image and Vision Comput 110:104165
Basu K, Shakerin F, and Gupta G (2020) "Aqua: Asp-based visual question answering." In International Symposium on Practical Aspects of Declarative Languages, pp. 57–72. Springer, Cham,
Sharma H, Jalal AS (2020) Incorporating external knowledge for image captioning using CNN and LSTM. Modern Phys Lett B 34(28):2050315
Yu Z, Yu J, Cui Y, Tao D and Tian Q (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6281–6290
He K. , Zhang X, Ren S, and Sun J (2016) “Deep residual learning for image recognition,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, pp. 770–778
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Bai Z, Tu J and Shi Y (2016) An improved algorithm for the vertex cover $ P_3 $ problem on graphs of bounded treewidth. arXiv preprint http://arxiv.org/pdf/1603.09448.pdf.
Micev M, Ćalasan M, Ali ZM, Hasanien HM, Aleem SHA (2020) Optimal design of automatic voltage regulation controller using hybrid simulated annealing–Manta ray foraging optimization algorithm. Ain Shams Eng J 12:641–657
Lin T-Y , Maire M , Belongie S , Hays J , Perona P , Ramanan D , Dollár P , Zitnick CL (2014) Microsoft COCO: Common objects in context, in: Proceedings of the European Conference on Computer Vision, pp. 740–755
Papineni K , Roukos S , Ward T , Zhu W (2002) BLEU: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp 376–380
Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries, in: Proceedings of the Association for Computational Linguistics, pp. 74–81 .
Yao T , Pan Y , Li Y , Qiu Z , Mei T (2017) Boosting image captioning with attributes, in: Proceedings of the International Conference on Computer Vision, pp. 4 904–4 912
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering, IEEE transactions on neural networks and learning systems, pp. 1–13
Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks, in: Advances in Neural Information Processing Systems, pp. 1571–1581.
Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, and Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6639–6648.
Zhang L, Liu S, Liu D, Zeng P, Li X, Song J, Gao L (2020) Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans Neural Netw Learn Syst 32:4362–4373
Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620
Kim J-H, On K-W, Lim W, Kim J, Ha J-W, Zhang B-T (2016) Hadamard product for low-rank bilinear pooling, arXiv preprint http://arxiv.org/pdf/1610.04325.pdf
Zhang W, Yu J, Hua H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126
Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Visual Commun Image Rep 73:102762
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multim Tools Appl 80:1–19
Zhang W, Jing Y, Wang Y, Wang W (2020) Multimodal deep fusion for image question answering. Knowledge-Based Syst 212:106639
Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) ALSA: adversarial learning of supervised attentions for visual question answering. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.3029423
Fukui A, Park DH, Yang D, Rohrbach A (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding, in EMNLP, 2016
Bai Z, Li Y, Zhou M, Li D, Wang D, Połap D and Woźniak M (2020) Bilinear Semi-Tensor Product Attention (BSTPA) model for visual question answering. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29
Agrawal A, Batra D, Parikh D and Kembhavi A (2018) Don't just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980
Andreas J, Rohrbach M, Darrell T and Klein D (2016) Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 39–48
Malinowski M, Doersch C, Santoro A and Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–20
Lee J, and Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In Sixteenth annual conference of the international speech communication association, 2015
Zhang Y, Hare J, and Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv preprint http://arxiv.org/pdf/1802.05766.pdf
Y Xi Y Zhang S Ding S Wan 2020 Visual question answering model based on visual relationship detection Signal Process: Image Commun 80 115648
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sharma, H., Jalal, A.S. An Improved Attention and Hybrid Optimization Technique for Visual Question Answering. Neural Process Lett 54, 709–730 (2022). https://doi.org/10.1007/s11063-021-10655-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10655-y