skip to main content
research-article

Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

Authors Info & Claims
Published:23 October 2023Publication History
Skip Abstract Section

Abstract

Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Auer Sören, Bizer Christian, Kobilarov Georgi, Lehmann Jens, Cyganiak Richard, and Ives Zachary. 2007. Dbpedia: A nucleus for a web of open data. In Asian Semantic Web Conference. 722735.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Ben-Younes Hedi, Cadene Rémi, Cord Matthieu, and Thome Nicolas. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 26122620.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Ben-Younes Hedi, Cadene Remi, Thome Nicolas, and Cord Matthieu. 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 81028109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Cadene Remi, Ben-Younes Hedi, Cord Matthieu, and Thome Nicolas. 2019. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 19891998.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cao Qingxing, Li Bailin, Liang Xiaodan, Wang Keze, and Lin Liang. 2021. Knowledge-routed visual question reasoning: Challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33, 7 (2021), 27582767.Google ScholarGoogle Scholar
  7. [7] Ding Yang, Yu Jing, Liu Bang, Hu Yue, Cui Mingxin, and Wu Qi. 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 50895098.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Fan Hehe, Zhu Linchao, Yang Yi, and Wu Fei. 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 1–16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Gardères François, Ziaeefard Maryam, Abeloos Baptiste, and Lecue Freddy. 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics (EMNLP’20). 489498.Google ScholarGoogle Scholar
  10. [10] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 69046913.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] hasPartKB. 2004. hasPartKB: A New Knowledge Base of hasPart Relations. Retrieved from https://allenai.org/data/haspartkbGoogle ScholarGoogle Scholar
  12. [12] Jang Eric, Gu Shixiang, and Poole Ben. 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  13. [13] Jin Weike, Zhao Zhou, Li Yimeng, Li Jie, Xiao Jun, and Zhuang Yueting. 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–22. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kim Jin-Hwa, Jun Jaehyun, and Zhang Byoung-Tak. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 15641574.Google ScholarGoogle Scholar
  15. [15] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle Scholar
  16. [16] Li Guohao, Wang Xin, and Zhu Wenwu. 2020. Boosting visual question answering with context-aware knowledge aggregation. In Proceedings of the 28th ACM International Conference on Multimedia. 12271235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Li Liunian Harold, Yatskar Mark, Yin Da, Hsieh Cho-Jui, and Chang Kai-Wei. 2020. What does BERT with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 52655275. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740755.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Liu Hugo and Singh Push. 2004. ConceptNet–a practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Liu Yibing, Guo Yangyang, Yin Jianhua, Song Xuemeng, Liu Weifeng, Nie Liqiang, and Zhang Min. 2022. Answer questions with right image regions: A visual attention regularization approach. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–18. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Loshchilov Ilya and Hutter Frank. 2018. Fixing weight decay regularization in Adam. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  22. [22] Lu Jiasen, Batra Dhruv, Parikh Devi, and Lee Stefan. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  23. [23] Luo Hao, Jiang Wei, Gu Youzhi, Liu Fuxu, Liao Xingyu, Lai Shenqi, and Gu Jianyang. 2020. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimedia 22, 10 (2020), 25972609. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Marino Kenneth, Chen Xinlei, Parikh Devi, Gupta Abhinav, and Rohrbach Marcus. 2021. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1411114121.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Marino Kenneth, Rastegari Mohammad, Farhadi Ali, and Mottaghi Roozbeh. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31953204.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Mithun Niluthpol Chowdhury, Panda Rameswar, Papalexakis Evangelos E., and Roy-Chowdhury Amit K.. 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 18561864.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Mousselly-Sergieh Hatem, Botschen Teresa, Gurevych Iryna, and Roth Stefan. 2018. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 225234.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pan Yonghua, Li Zechao, Zhang Liyan, and Tang Jinhui. 2022. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–23. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  30. [30] Perez Ethan, Strub Florian, Vries Harm De, Dumoulin Vincent, and Courville Aaron. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence. 39423951.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Qu Chen, Zamani Hamed, Yang Liu, Croft W. Bruce, and Learned-Miller Erik. 2021. Passage retrieval for outside-knowledge visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 17531757.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Rebele Thomas, Suchanek Fabian, Hoffart Johannes, Biega Joanna, Kuzey Erdal, and Weikum Gerhard. 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In International Semantic Web Conference. 177185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neur. Inf. Process. Syst. 28 (2015), 9199.Google ScholarGoogle Scholar
  34. [34] Shevchenko Violetta, Teney Damien, Dick Anthony, and Hengel Anton van den. 2021. Reasoning over vision and language: Exploring the benefits of supplemental knowledge. In Proceedings of the 3rd Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN’21).Google ScholarGoogle Scholar
  35. [35] Shu Xiangbo, Qi Guo-Jun, Tang Jinhui, and Wang Jingdong. 2015. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In Proceedings of the 23rd ACM International Conference on Multimedia. New York, NY, 3544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 51005111.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Tandon Niket, Melo Gerard, and Weikum Gerhard. 2014. Acquiring comparative commonsense knowledge from the web. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Tang Jinhui, Shu Xiangbo, Li Zechao, Qi Guo-Jun, and Wang Jingdong. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4s (2016), 1–22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Tian Yonglong, Sun Chen, Poole Ben, Krishnan Dilip, Schmid Cordelia, and Isola Phillip. 2020. What makes for good views for contrastive learning? Adv. Neur. Inf. Process. Syst. 33 (2020), 68276839.Google ScholarGoogle Scholar
  40. [40] Wang Shuo, Guo Dan, Xu Xin, Zhuo Li, and Wang Meng. 2019. Cross-modality retrieval by joint correlation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Wang Weining, Huang Yan, and Wang Liang. 2020. Long video question answering: A matching-guided attention model. Pattern Recogn. 102 (2020), 107248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wikipedia. 2017. Wikipedia, The Free Encyclopedia. Retrieved from https://www.wikipedia.org/Google ScholarGoogle Scholar
  43. [43] Wu Jialin, Lu Jiasen, Sabharwal Ashish, and Mottaghi Roozbeh. 2022. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence. 27122721.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xu Tong, Zhou Peilun, Hu Linkang, He Xiangnan, Hu Yao, and Chen Enhong. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1 (2021), 1–23 DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Yu Dongfei, Fu Jianlong, Tian Xinmei, and Mei Tao. 2019. Multi-source multi-level attention networks for visual question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–20. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yu Jing, Zhu Zihao, Wang Yujing, Zhang Weifeng, Hu Yue, and Tan Jianlong. 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108 (2020), 107563.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yu Zhou, Yu Jun, Cui Yuhao, Tao Dacheng, and Tian Qi. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 62816290.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 59475959.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zha Zheng-Jun, Liu Jiawei, Yang Tianhao, and Zhang Yongdong. 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–18. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhang Pengchuan, Li Xiujun, Hu Xiaowei, Yang Jianwei, Zhang Lei, Wang Lijuan, Choi Yejin, and Gao Jianfeng. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 55795588.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zheng Wenbo, Yan Lan, Gou Chao, and Wang Fei-Yue. 2021. KM4: Visual reasoning via knowledge embedding memory model with mutual modulation. Inf. Fusion 67 (2021), 1428.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zheng Wenbo, Yan Lan, Gou Chao, and Wang Fei-Yue. 2021. Knowledge is power: Hierarchical-knowledge embedded meta-learning for visual reasoning in artistic domains. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 23602368.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zhu Zihao, Yu Jing, Wang Yujing, Sun Yajing, Hu Yue, and Wu Qi. 2020. Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 10971103.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 3
            March 2024
            665 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3613614
            • Editor:
            • Abdulmotaleb El Saddik
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 October 2023
            • Online AM: 2 September 2023
            • Accepted: 20 August 2023
            • Revised: 18 June 2023
            • Received: 25 September 2022
            Published in tomm Volume 20, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)245
            • Downloads (Last 6 weeks)32

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text