research-article

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

Authors:
Yan Wang

School of Artificial Intelligence, and Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China

School of Artificial Intelligence, and Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, China

0000-0002-4751-0708
View Profile

,
Peize Li

School of Artificial Intelligence, Jilin University, China

School of Artificial Intelligence, Jilin University, China

0000-0002-5321-2176
View Profile

,
Qingyi Si

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

0000-0001-8433-0215
View Profile

,
Hanwen Zhang

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

0000-0003-4081-8838
View Profile

,
Wenyu Zang

China Electronics Corporation, China

China Electronics Corporation, China

0000-0002-6369-0681
View Profile

,
Zheng Lin

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

0000-0002-8432-1658
View Profile

,
Peng Fu

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

Institute of Information Engineering, Chinese Academy of Sciences, and School of Cyber Security, University of Chinese Academy of Sciences, China

0000-0001-9899-8566
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20 Issue 323 October 2023Article No.: 63pp 1–22https://doi.org/10.1145/3618301

Published:23 October 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.

REFERENCES

[1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
[2] Auer Sören, Bizer Christian, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary. 2007. Dbpedia: A nucleus for a web of open data. In Asian Semantic Web Conference. 722–735.Google ScholarDigital Library
[3] Ben-Younes Hedi, Cadene Rémi, Cord Matthieu, and Thome Nicolas. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.Google ScholarCross Ref
[4] Ben-Younes Hedi, Cadene Remi, Thome Nicolas, and Cord Matthieu. 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 8102–8109.Google ScholarDigital Library
[5] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1989–1998.Google ScholarCross Ref
[6] Cao Qingxing, Li Bailin, Liang Xiaodan, Wang Keze, and Lin Liang. 2021. Knowledge-routed visual question reasoning: Challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33, 7 (2021), 2758–2767.Google Scholar
[7] Ding Yang, Yu Jing, Liu Bang, Hu Yue, Cui Mingxin, and Wu Qi. 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5089–5098.Google ScholarCross Ref
[8] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 1–16. DOI:Google ScholarDigital Library
[9] Gardères François, Ziaeefard Maryam, Abeloos Baptiste, and Lecue Freddy. 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics (EMNLP’20). 489–498.Google Scholar
[10] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.Google ScholarCross Ref
[11] hasPartKB. 2004. hasPartKB: A New Knowledge Base of hasPart Relations. Retrieved from https://allenai.org/data/haspartkbGoogle Scholar
[12] Jang Eric, Gu Shixiang, and Poole Ben. 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations.Google Scholar
[13] Jin Weike, Zhao Zhou, Li Yimeng, Li Jie, Xiao Jun, and Zhuang Yueting. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–22. DOI:Google ScholarDigital Library
[14] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564–1574.Google Scholar
[15] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
[16] Li Guohao, Wang Xin, and Zhu Wenwu. 2020. Boosting visual question answering with context-aware knowledge aggregation. In Proceedings of the 28th ACM International Conference on Multimedia. 1227–1235.Google ScholarDigital Library
[17] Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, and Chang Kai-Wei. 2020. What does BERT with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5265–5275. DOI:Google ScholarCross Ref
[18] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740–755.Google ScholarCross Ref
[19] Liu Hugo and Singh Push. 2004. ConceptNet–a practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211–226.Google ScholarDigital Library
[20] Liu Yibing, Guo Yangyang, Yin Jianhua, Song Xuemeng, Liu Weifeng, Nie Liqiang, and Zhang Min. 2022. Answer questions with right image regions: A visual attention regularization approach. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–18. DOI:Google ScholarDigital Library
[21] Loshchilov Ilya and Hutter Frank. 2018. Fixing weight decay regularization in Adam. In International Conference on Learning Representations.Google Scholar
[22] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems.Google Scholar
[23] Luo Hao, Jiang Wei, Gu Youzhi, Liu Fuxu, Liao Xingyu, Lai Shenqi, and Gu Jianyang. 2020. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimedia 22, 10 (2020), 2597–2609. DOI:Google ScholarCross Ref
[24] Marino Kenneth, Chen Xinlei, Parikh Devi, Gupta Abhinav, and Rohrbach Marcus. 2021. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14111–14121.Google ScholarCross Ref
[25] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3195–3204.Google ScholarCross Ref
[26] Mithun Niluthpol Chowdhury, Panda Rameswar, Papalexakis Evangelos E., and Roy-Chowdhury Amit K.. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1856–1864.Google ScholarDigital Library
[27] Mousselly-Sergieh Hatem, Botschen Teresa, Gurevych Iryna, and Roth Stefan. 2018. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 225–234.Google ScholarCross Ref
[28] Pan Yonghua, Li Zechao, Zhang Liyan, and Tang Jinhui. 2022. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–23. DOI:Google ScholarDigital Library
[29] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.Google Scholar
[30] Perez Ethan, Strub Florian, Vries Harm De, Dumoulin Vincent, and Courville Aaron. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence. 3942–3951.Google ScholarCross Ref
[31] Qu Chen, Zamani Hamed, Yang Liu, Croft W. Bruce, and Learned-Miller Erik. 2021. Passage retrieval for outside-knowledge visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1753–1757.Google ScholarDigital Library
[32] Rebele Thomas, Suchanek Fabian, Hoffart Johannes, Biega Joanna, Kuzey Erdal, and Weikum Gerhard. 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In International Semantic Web Conference. 177–185.Google ScholarDigital Library
[33] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neur. Inf. Process. Syst. 28 (2015), 91–99.Google Scholar
[34] Shevchenko Violetta, Teney Damien, Dick Anthony, and Hengel Anton van den. 2021. Reasoning over vision and language: Exploring the benefits of supplemental knowledge. In Proceedings of the 3rd Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN’21).Google Scholar
[35] Shu Xiangbo, Qi Guo-Jun, Tang Jinhui, and Wang Jingdong. 2015. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In Proceedings of the 23rd ACM International Conference on Multimedia. New York, NY, 35–44. Google ScholarDigital Library
[36] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5100–5111.Google ScholarCross Ref
[37] Tandon Niket, Melo Gerard, and Weikum Gerhard. 2014. Acquiring comparative commonsense knowledge from the web. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28.Google ScholarCross Ref
[38] Tang Jinhui, Shu Xiangbo, Li Zechao, Qi Guo-Jun, and Wang Jingdong. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4s (2016), 1–22. Google ScholarDigital Library
[39] Tian Yonglong, Sun Chen, Poole Ben, Krishnan Dilip, Schmid Cordelia, and Isola Phillip. 2020. What makes for good views for contrastive learning? Adv. Neur. Inf. Process. Syst. 33 (2020), 6827–6839.Google Scholar
[40] Wang Shuo, Guo Dan, Xu Xin, Zhuo Li, and Wang Meng. 2019. Cross-modality retrieval by joint correlation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–16. DOI:Google ScholarDigital Library
[41] Wang Weining, Huang Yan, and Wang Liang. 2020. Long video question answering: A matching-guided attention model. Pattern Recogn. 102 (2020), 107248.Google ScholarDigital Library
[42] Wikipedia. 2017. Wikipedia, The Free Encyclopedia. Retrieved from https://www.wikipedia.org/Google Scholar
[43] Wu Jialin, Lu Jiasen, Sabharwal Ashish, and Mottaghi Roozbeh. 2022. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence. 2712–2721.Google ScholarCross Ref
[44] Xu Tong, Zhou Peilun, Hu Linkang, He Xiangnan, Hu Yao, and Chen Enhong. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1 (2021), 1–23 DOI:Google ScholarDigital Library
[45] Yu Dongfei, Fu Jianlong, Tian Xinmei, and Mei Tao. 2019. Multi-source multi-level attention networks for visual question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–20. DOI:Google ScholarDigital Library
[46] Yu Jing, Zhu Zihao, Wang Yujing, Zhang Weifeng, Hu Yue, and Tan Jianlong. 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108 (2020), 107563.Google ScholarCross Ref
[47] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarCross Ref
[48] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 5947–5959.Google ScholarCross Ref
[49] Zha Zheng-Jun, Liu Jiawei, Yang Tianhao, and Zhang Yongdong. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–18. DOI:Google ScholarDigital Library
[50] Zhang Pengchuan, Li Xiujun, Hu Xiaowei, Yang Jianwei, Zhang Lei, Wang Lijuan, Choi Yejin, and Gao Jianfeng. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5579–5588.Google ScholarCross Ref
[51] Zheng Wenbo, Yan Lan, Gou Chao, and Wang Fei-Yue. 2021. KM4: Visual reasoning via knowledge embedding memory model with mutual modulation. Inf. Fusion 67 (2021), 14–28.Google ScholarCross Ref
[52] Zheng Wenbo, Yan Lan, Gou Chao, and Wang Fei-Yue. 2021. Knowledge is power: Hierarchical-knowledge embedded meta-learning for visual reasoning in artistic domains. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2360–2368.Google ScholarDigital Library
[53] Zhu Zihao, Yu Jing, Wang Yujing, Sun Yajing, Hu Yue, and Wu Qi. 2020. Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 1097–1103.Google ScholarCross Ref

Index Terms

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Information extraction
      2. Question answering

Recommendations

Cross-modality co-attention networks for visual question answering
Abstract
Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention ...
Read More
Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Advances in Information Retrieval
Abstract
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to ...
Read More
Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
Web and Big Data
Abstract
Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 3
March 2024
665 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3613614
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2023
- Online AM: 2 September 2023
- Accepted: 20 August 2023
- Revised: 18 June 2023
- Received: 25 September 2022
Published in tomm Volume 20, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-modality relation
external knowledge
visual question answering
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 245
  Total Downloads
- Downloads (Last 12 months)245
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Cross-modality co-attention networks for visual question answering

Cross-Modal Retrieval for Knowledge-Based Visual Question Answering

Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering