Skip to main content
Log in

An Improved Attention and Hybrid Optimization Technique for Visual Question Answering

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In Visual Question Answering (VQA), an attention mechanism has a critical role in specifying the different objects present in an image or tells the machine where to focus by invoking the vivid visuals in an image. However, the current VQA systems uses the region-based or bounding box based image features to learn the attention distribution, which are not capable enough to answer the questions related to foreground or background object present in an image. In this paper, we have proposed a VQA model that uses image features which are effective enough to answer questions related to foreground object and background region. Also, we have used graph neural network to encode the relationship between the image regions and objects in an image. Further, we have generated image captions based on these visual relationship based image representation. Thus, the proposed model uses two attention modules to take advantage of each other’s knowledge, to generate more influential attention modules together with the captions based image representation to extract the features which are capable enough to answer questions related to the foreground object and background region. Finally, the performance of proposed architecture is improved by combining the hybrid simulated annealing-Mantaray Foraging Optimization (SA-MRFO) algorithm, which selects the optimal weight parameter for the proposed model. To estimate the performance of the proposed model, two benchmark datasets are used: VQA 2.0 and VQA-CP v2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering, in: Proceedings of the IEEE international conference on computer vision pp. 2425–2433

  2. Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering, in: Advances in neural information processing systems, pp. 2953–2961

  3. Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: Elevating the role of image understanding in visual question 415 answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913

  4. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, pp. 91–99

  5. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543

  6. Lu P, Ji L, Zhang W, Duan N, Zhou M, and Wang J (2018) “R-VQA: Learning visual relation facts with semantic attention for visual question answering,” in Proc. 24th ACM SIGKDD Int Conf Knowl Discov Data Min, pp. 1880–1889

  7. Lu P, Li H, Zhang W, Wang J, and Wang X (2018) “Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering,” in Proc. 32nd AAAI Conf Artif Intell, pp. 7218–7225

  8. Yu D, Fu J, Mei V, and Rui Y (2017) “Multi-level attention networks for visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Honolulu, HI, USA,pp. 21–29

  9. Liu Y, Zhang X, Huang F, Tang X, and Li Z (2019) “Visual question answering via attention-based syntactic structure tree-LSTM,” Appl Soft Comput, vol. 82, Art. no. 105584

  10. Yang Z, He X, Gao J, Deng L, and Smola AJ (2016) “Stacked attention networks for image question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, 2016, pp. 21–29

  11. Lu J, Yang J, Batra D, and Parikh D (2016) “Hierarchical question-image co-attention for visual question answering,” in Proc Adv Neural Inf Process Syst, pp. 289–297

  12. Nam H, Ha J-W, and Kim J (2017) “Dual attention networks for multimodal reasoning and matching,” in Proc IEEE Conf Comput Vis Pattern Recognit, Honolulu, HI, USA, pp. 2156–2164

  13. Li R and Jia J (2016) “Visual question answering with question representation update (QRU),” in Proc Adv Neural Inf Process Syst, pp. 4655–4663

  14. Anderson et al. P (2018), “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Salt Lake City, UT, USA, pp. 6077–6086

  15. Shih KJ, Singh S, and Hoiem D (2016) “Where to look: Focus regions for visual question answering,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, pp. 4613–4621

  16. Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530

    Article  Google Scholar 

  17. Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663

    Article  Google Scholar 

  18. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  19. Andrej K, Li F-F (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676

    Article  Google Scholar 

  20. You Q, Jin H , Wang Z , Fang C , Luo J (2016) Image captioning with semantic attention, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659

  21. Yang Z , Yuan Y , Wu Y, Cohen WW , Salakhutdinov RR (2016) Review networks for caption generation, in: Proceedings of the Advances in Neural Information Processing Systems, pp. 2361–2369

  22. Chen X , Ma L , Jiang W , Yao J , Liu W (2018) Regularizing RNNs for caption generation by reconstructing the past with the present, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7995–8003

  23. Tanti M, Gatt A, Camilleri KP (2018) Where to put the image in an image caption generator. Nat Lang Eng 24(3):467–489

    Article  Google Scholar 

  24. Cadene R, Ben-Younes H, Cord M and Thome N (2019) Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1989–1998

  25. Li L, Gan Z, Cheng Y, and Liu J (2019) Relation-aware graph attention network for visual question answering, In Proceedings of the IEEE International Conference on Computer Vision, pp. 10313–10322. 2019

  26. Gao L, Cao L, Xu X, Shao J, Song J (2020) Question-Led object attention for visual question answering. Neurocomputing 391:227–233

    Article  Google Scholar 

  27. Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Visual Commun Image Rep 73:102762

    Article  Google Scholar 

  28. Zhang W, Jing Y, Hua H, Haiyang H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126

    Article  Google Scholar 

  29. Bai Z, Li Y, Woźniak M, Zhou M, Li D (2020) DecomVQANet Decomposing visual question answering deep network via tensor decomposition and regression. Pattern Recognit 110:107538

    Article  Google Scholar 

  30. Chen L, Yan X, Xiao J, Zhang H, Pu S, and Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809

  31. Lobry S, Marcos D, Murray J, and Tuia D (2020) RSVQA: Visual Question Answering for Remote Sensing Data, IEEE Transactions on Geoscience and Remote Sensing

  32. Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image and Vision Comput 110:104165

    Article  Google Scholar 

  33. Basu K, Shakerin F, and Gupta G (2020) "Aqua: Asp-based visual question answering." In International Symposium on Practical Aspects of Declarative Languages, pp. 57–72. Springer, Cham,

  34. Sharma H, Jalal AS (2020) Incorporating external knowledge for image captioning using CNN and LSTM. Modern Phys Lett B 34(28):2050315

    Article  MathSciNet  Google Scholar 

  35. Yu Z, Yu J, Cui Y, Tao D and Tian Q (2019) Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6281–6290

  36. He K. , Zhang X, Ren S, and Sun J (2016) “Deep residual learning for image recognition,” in Proc IEEE Conf Comput Vis Pattern Recognit, Las Vegas, NV, USA, pp. 770–778

  37. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80

    Article  Google Scholar 

  38. Bai Z, Tu J and Shi Y (2016) An improved algorithm for the vertex cover $ P_3 $ problem on graphs of bounded treewidth. arXiv preprint http://arxiv.org/pdf/1603.09448.pdf.

  39. Micev M, Ćalasan M, Ali ZM, Hasanien HM, Aleem SHA (2020) Optimal design of automatic voltage regulation controller using hybrid simulated annealing–Manta ray foraging optimization algorithm. Ain Shams Eng J 12:641–657

    Article  Google Scholar 

  40. Lin T-Y , Maire M , Belongie S , Hays J , Perona P , Ramanan D , Dollár P , Zitnick CL (2014) Microsoft COCO: Common objects in context, in: Proceedings of the European Conference on Computer Vision, pp. 740–755

  41. Papineni K , Roukos S , Ward T , Zhu W (2002) BLEU: A method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318

  42. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation, pp 376–380

  43. Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries, in: Proceedings of the Association for Computational Linguistics, pp. 74–81 .

  44. Yao T , Pan Y , Li Y , Qiu Z , Mei T (2017) Boosting image captioning with attributes, in: Proceedings of the International Conference on Computer Vision, pp. 4 904–4 912

  45. Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering, IEEE transactions on neural networks and learning systems, pp. 1–13

  46. Kim J-H, Jun J, Zhang B-T (2018) Bilinear attention networks, in: Advances in Neural Information Processing Systems, pp. 1571–1581.

  47. Gao P, Jiang Z, You H, Lu P, Hoi SC, Wang X, and Li H (2019) Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6639–6648.

  48. Zhang L, Liu S, Liu D, Zeng P, Li X, Song J, Gao L (2020) Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans Neural Netw Learn Syst 32:4362–4373

    Article  Google Scholar 

  49. Ben-Younes H, Cadene R, Cord M, Thome N (2017) Mutan: Multimodal tucker fusion for visual question answering, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620

  50. Kim J-H, On K-W, Lim W, Kim J, Ha J-W, Zhang B-T (2016) Hadamard product for low-rank bilinear pooling, arXiv preprint http://arxiv.org/pdf/1610.04325.pdf

  51. Zhang W, Yu J, Hua H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126

    Article  Google Scholar 

  52. Sun B, Yao Z, Zhang Y, Yu L (2020) Local relation network with multilevel attention for visual question answering. J Visual Commun Image Rep 73:102762

    Article  Google Scholar 

  53. Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2020) Object-difference drived graph convolutional networks for visual question answering. Multim Tools Appl 80:1–19

    Google Scholar 

  54. Zhang W, Jing Y, Wang Y, Wang W (2020) Multimodal deep fusion for image question answering. Knowledge-Based Syst 212:106639

    Article  Google Scholar 

  55. Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) ALSA: adversarial learning of supervised attentions for visual question answering. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.3029423

    Article  Google Scholar 

  56. Fukui A, Park DH, Yang D, Rohrbach A (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding, in EMNLP, 2016

  57. Bai Z, Li Y, Zhou M, Li D, Wang D, Połap D and Woźniak M (2020) Bilinear Semi-Tensor Product Attention (BSTPA) model for visual question answering. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE

  58. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29

  59. Agrawal A, Batra D, Parikh D and Kembhavi A (2018) Don't just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980

  60. Andreas J, Rohrbach M, Darrell T and Klein D (2016) Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 39–48

  61. Malinowski M, Doersch C, Santoro A and Battaglia P (2018) Learning visual question answering by bootstrapping hard attention. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–20

  62. Lee J, and Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. In Sixteenth annual conference of the international speech communication association, 2015

  63. Zhang Y, Hare J, and Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv preprint http://arxiv.org/pdf/1802.05766.pdf

  64. Y Xi Y Zhang S Ding S Wan 2020 Visual question answering model based on visual relationship detection Signal Process: Image Commun 80 115648

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Himanshu Sharma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, H., Jalal, A.S. An Improved Attention and Hybrid Optimization Technique for Visual Question Answering. Neural Process Lett 54, 709–730 (2022). https://doi.org/10.1007/s11063-021-10655-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10655-y

Keywords

Navigation