Exploiting Query Knowledge Embedding and Trilinear Joint Embedding for Visual Question Answering

Chen, Zheng; Wen, Yaxin

doi:10.1007/978-981-99-4752-2_64

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14089))

Included in the following conference series:

International Conference on Intelligent Computing

946 Accesses

Abstract

Visual Question Answering (VQA) aims to answer natural language questions about a given image. Researchers generally believe that incorporating external knowledge can improve VQA task’s performance. However, existing methods face limitations in acquiring and utilizing such knowledge, preventing them from effectively enhancing a model’s question-answering capability. In this paper, we propose a novel VQA approach based on question-query for Knowledge Embedding. In our approach, we design question query rules to obtain critical external knowledge and then embed this knowledge by integrating it with the question as input features for text modalities. Traditional multimodal feature fusion techniques rely solely on local features, which may result in the loss of global information. To address this issue, we introduce a feature fusion method based on Trilinear Joint Embedding. Utilizing an attention mechanism, we generate a feature matrix composed of question, knowledge, and image components. This matrix is then trilinearly joint embedded to form a novel global feature vector. Due to the computational challenges associated with high-dimensional vectors produced during the trilinear joint embedding process, we employ Tensor Decomposition to break down this vector into a sum of several low-rank tensors. Subsequently, we input the global feature vector into a classifier to obtain the answer in a multi-category classification fashion. Experimental results on the VQAv2, OKVQA, and VizWiz public datasets demonstrate that our approach can achieve accuracy improvements of 1.78%, 3.95%, and 1.16%. Our code are available at https://github.com/yxNoth/KB-VLT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. CoRR abs/1505.01121 (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. pp. 2296–2304(2015)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multi-modal compact bilinear pooling for visual question answering and visual grounding.pp. 457–468. The Association for Computational Linguistics (2016)
Google Scholar
Ben-Younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. pp. 2631–2639. IEEE Computer Society (2017)
Google Scholar
Bai, Y., Fu, J., Zhao, T., Mei, T.: Deep attention neural tensor network for visual question answering. Lecture Notes in Computer Science, vol. 11216, pp. 21–37.Springer
Google Scholar
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7w: Grounded question answering in images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. pp. 4995–5004.IEEE Computer Society (2016)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. pp. 21–29. IEEE Computer Society (2016)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain. pp. 289–297 (2016)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019. pp. 6281–6290. Computer Vision Foundation / IEEE (2019)
Google Scholar
Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong,V., Paulus, R., Socher, R.: Ask me anything: Dynamic memory networks for natural language processing. In: Balcan, M., Weinberger, K.Q. (eds.) Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19–24, 2016. JMLR Workshop and Conference Proceedings, vol. 48,pp. 1378–1387. JMLR.org (2016)
Google Scholar
Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.R.: Explicit knowledge based reasoning for visual question answering. CoRR abs/1511.02570 (2015)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: Dbpedia: A n cleus for a web of open data. In: The Semantic Web, 6th International Semantic Web Co ference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007. Lecture Notes in Computer Science, vol. 4825, pp. 722–735. Springer (2007)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A.R., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2413–2427 (2018)
Article Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: Vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18–24 July 2021,Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 5583–5594. PMLR (2021)
Google Scholar

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of Sichuan Province (2022NSFSC0503), and Sichuan Science and Technology Program (2022ZHCG0007).

Author information

Authors and Affiliations

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, Sichuan, China
Zheng Chen & Yaxin Wen

Authors

Zheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yaxin Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Chen .

Editor information

Editors and Affiliations

Department of Computer Science, Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Zhengzhou University of Light Industry, Zhengzhou, China
Baohua Jin
Zhong Yuan University of Technology, Zhengzhou, China
Boyang Qu
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Department of Computer Science, Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Wen, Y. (2023). Exploiting Query Knowledge Embedding and Trilinear Joint Embedding for Visual Question Answering. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science(), vol 14089. Springer, Singapore. https://doi.org/10.1007/978-981-99-4752-2_64

Download citation

DOI: https://doi.org/10.1007/978-981-99-4752-2_64
Published: 31 July 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4751-5
Online ISBN: 978-981-99-4752-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics