Skip to main content

Exploiting Query Knowledge Embedding and Trilinear Joint Embedding for Visual Question Answering

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14089))

Included in the following conference series:

  • 946 Accesses

Abstract

Visual Question Answering (VQA) aims to answer natural language questions about a given image. Researchers generally believe that incorporating external knowledge can improve VQA task’s performance. However, existing methods face limitations in acquiring and utilizing such knowledge, preventing them from effectively enhancing a model’s question-answering capability. In this paper, we propose a novel VQA approach based on question-query for Knowledge Embedding. In our approach, we design question query rules to obtain critical external knowledge and then embed this knowledge by integrating it with the question as input features for text modalities. Traditional multimodal feature fusion techniques rely solely on local features, which may result in the loss of global information. To address this issue, we introduce a feature fusion method based on Trilinear Joint Embedding. Utilizing an attention mechanism, we generate a feature matrix composed of question, knowledge, and image components. This matrix is then trilinearly joint embedded to form a novel global feature vector. Due to the computational challenges associated with high-dimensional vectors produced during the trilinear joint embedding process, we employ Tensor Decomposition to break down this vector into a sum of several low-rank tensors. Subsequently, we input the global feature vector into a classifier to obtain the answer in a multi-category classification fashion. Experimental results on the VQAv2, OKVQA, and VizWiz public datasets demonstrate that our approach can achieve accuracy improvements of 1.78%, 3.95%, and 1.16%. Our code are available at https://github.com/yxNoth/KB-VLT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. CoRR abs/1505.01121 (2015)

    Google Scholar 

  2. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  3. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. pp. 2296–2304(2015)

    Google Scholar 

  4. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-modal compact bilinear pooling for visual question answering and visual grounding.pp. 457–468. The Association for Computational Linguistics (2016)

    Google Scholar 

  5. Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. pp. 2631–2639. IEEE Computer Society (2017)

    Google Scholar 

  6. Bai, Y., Fu, J., Zhao, T., Mei, T.: Deep attention neural tensor network for visual question answering. Lecture Notes in Computer Science, vol. 11216, pp. 21–37.Springer

    Google Scholar 

  7. Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7w: Grounded question answering in images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. pp. 4995–5004.IEEE Computer Society (2016)

    Google Scholar 

  8. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. pp. 21–29. IEEE Computer Society (2016)

    Google Scholar 

  9. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain. pp. 289–297 (2016)

    Google Scholar 

  10. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. pp. 6281–6290. Computer Vision Foundation / IEEE (2019)

    Google Scholar 

  11. Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong,V., Paulus, R., Socher, R.: Ask me anything: Dynamic memory networks for natural language processing. In: Balcan, M., Weinberger, K.Q. (eds.) Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016. JMLR Workshop and Conference Proceedings, vol. 48,pp. 1378–1387. JMLR.org (2016)

    Google Scholar 

  12. Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.R.: Explicit knowledge based reasoning for visual question answering. CoRR abs/1511.02570 (2015)

    Google Scholar 

  13. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: A n cleus for a web of open data. In: The Semantic Web, 6th International Semantic Web Co ference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007. Lecture Notes in Computer Science, vol. 4825, pp. 722–735. Springer (2007)

    Google Scholar 

  14. Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2018)

    Article  Google Scholar 

  15. Kim, W., Son, B., Kim, I.: ViLT: Vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021,Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of Sichuan Province (2022NSFSC0503), and Sichuan Science and Technology Program (2022ZHCG0007).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, Z., Wen, Y. (2023). Exploiting Query Knowledge Embedding and Trilinear Joint Embedding for Visual Question Answering. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science(), vol 14089. Springer, Singapore. https://doi.org/10.1007/978-981-99-4752-2_64

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4752-2_64

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4751-5

  • Online ISBN: 978-981-99-4752-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics