Skip to main content

ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering

  • Conference paper
  • First Online:
Intelligent Systems Design and Applications (ISDA 2020)

Abstract

Visual Question Answering (VQA) is a task where machines are challenged to produce correct answers for a question asked about an image. This paper proposes a novel image featurization framework named ImageFuse to improve the task of VQA. It implements a combination of feature fusion networks to form a fine-grained image representation instead of directly adopting common representations from the popular ImageNet CNN models via transfer learning. The two parallel fusion networks are trained using Canonical Correlation Analysis (CCA) and Autoencoders (AE) to capture both linear and non-linear relationships that exist in multiple views of the image. Extensive experiments conducted on DAQUAR VQA dataset show a significant improvement for the proposed framework over single image representation based VQA systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    μ: Membership measure.

  2. 2.

    Ai, Ti: ith predicted answer, and ith ground truth answer.

  3. 3.

    WUP (a, b): Similarity based on depth of two words ‘a’ and ‘b’ in the wordNet taxonomy.

References

  1. Teney, D., Wu, Q., van den Hengel, A.: Visual question answering: a tutorial. IEEE Signal Process. Mag. 34(6), 63–75 (2017)

    Article  Google Scholar 

  2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  3. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: CVPR, vol. 1, no. 2, p. 3 (2017)

    Google Scholar 

  4. Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank description generation and question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2461–2469 (2015)

    Google Scholar 

  5. Tommasi, T., Mallya, A., Plummer, B., Lazebnik, S., Berg, A.C., Berg, T.L.: Combining multiple cues for visual madlibs question answering. Int. J. Comput. Vision 127(1), 38–60 (2019)

    Article  Google Scholar 

  6. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: grounded question answering in images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004 (2016)

    Google Scholar 

  7. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)

    Google Scholar 

  8. Manmadhan, S., Kovoor, B.C.: Visual question answering: a state-of-the-art review. Artif. Intell. Rev. 53, 1–41 (2020)

    Article  Google Scholar 

  9. Fader, A., Zettlemoyer, L., Etzioni, O.: Paraphrase-driven learning for open question answering. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1608–1618 (2013)

    Google Scholar 

  10. Yue, C., Cao, H., Xiong, K., Cui, A., Qin, H., Li, M.: Enhanced question understanding with dynamic memory networks for textual question answering. Expert Syst. Appl. 80, 39–45 (2017)

    Article  Google Scholar 

  11. Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621 (2016)

    Google Scholar 

  12. Saito, K., Shin, A., Ushiku, Y., Harada, T.: Dualnet: domain-invariant network for visual question answering. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 829–834. IEEE (2017)

    Google Scholar 

  13. Toor, A.S., Wechsler, H., Nappi, M.: Question action relevance and editing for visual question answering. Multimedia Tools Appl. 78(3), 2921–2935 (2019)

    Article  Google Scholar 

  14. Sun, Q.S., Zeng, S.G., Liu, Y., Heng, P.A., Xia, D.S.: A new method of feature fusion and its application in image recognition. Pattern Recogn. 38(12), 2437–2448 (2005)

    Article  Google Scholar 

  15. Ergun, H., Akyuz, Y.C., Sert, M., Liu, J.: Early and late level fusion of deep convolutional neural networks for visual concept recognition. Int. J. Semant. Comput. 10(03), 379–397 (2016)

    Article  Google Scholar 

  16. Li, J., Yang, B., Yang, W., Sun, C., Xu, J.: Subspace-based multi-view fusion for instance-level image retrieval. Vis. Comput. 37, 1–15 (2020)

    Google Scholar 

  17. Charte, D., Charte, F., García, S., del Jesus, M.J., Herrera, F.: A practical tutorial on autoencoders for nonlinear feature fusion: taxonomy, models, software and guidelines. Inf. Fusion 44, 78–96 (2018)

    Article  Google Scholar 

  18. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1–3), 37–52 (1987)

    Article  Google Scholar 

  19. Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional data—with application to face recognition. Pattern Recogn. 34(10), 2067–2070 (2001)

    Article  Google Scholar 

  20. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  22. Manmadhan, S., Kovoor, B.C.: Optimal image feature ranking and fusion for visual question answering. In: Evolution in Computational Intelligence, pp. 103–113. Springer, Singapore (2021)

    Google Scholar 

  23. Cover, T.M.: Elements of Information theory. John Wiley & Sons, Hoboken (1999)

    Google Scholar 

  24. Hotelling, H.: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190. Springer, New York (1992)

    Google Scholar 

  25. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Advances in Neural Information Processing Systems, pp. 1682–1690 (2014)

    Google Scholar 

  26. Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Bigham, J.P.: Vizwiz grand challenge: answering visual questions from blind people. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manmadhan, S., Kovoor, B.C. (2021). ImageFuse: A Multi-view Image Featurization Framework for Visual Question Answering. In: Abraham, A., Piuri, V., Gandhi, N., Siarry, P., Kaklauskas, A., Madureira, A. (eds) Intelligent Systems Design and Applications. ISDA 2020. Advances in Intelligent Systems and Computing, vol 1351. Springer, Cham. https://doi.org/10.1007/978-3-030-71187-0_14

Download citation

Publish with us

Policies and ethics