skip to main content
survey

A Survey of Multi-modal Knowledge Graphs: Technologies and Trends

Published: 28 June 2024 Publication History

Abstract

In recent years, Knowledge Graphs (KGs) have played a crucial role in the development of advanced knowledge-intensive applications, such as recommender systems and semantic search. However, the human sensory system is inherently multi-modal, as objects around us are often represented by a combination of multiple signals, such as visual and textual. Consequently, Multi-modal Knowledge Graphs (MMKGs), which combine structured knowledge representation with multiple modalities, represent a powerful extension of KGs. Although MMKGs can handle certain types of tasks (e.g., visual query answering) or queries that standard KGs cannot process, and they can effectively tackle some standard problems (e.g., entity alignment), we lack a widely accepted definition of MMKG. In this survey, we provide a rigorous definition of MMKGs along with a classification scheme based on how existing approaches address four fundamental challenges: representation, fusion, alignment, and translation, which are crucial to improving an MMKG. Our classification scheme is flexible and allows for easy incorporation of new approaches, as well as a comparison of two approaches in terms of how they address one of the fundamental challenges mentioned above. As the first comprehensive survey of MMKG, this article aims at inspiring and provide a reference for relevant researchers in the field of Artificial Intelligence.

References

[1]
Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. 2019. Multi-level multimodal common semantic space for image-phrase grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Long Beach, CA, USA, 12476–12486.
[2]
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a Web of open data. In Proceedings of the Semantic Web. Springer, 722–735.
[3]
A. Az, Hhbc Huang, and Hhac Chen. 2019. Multimodal joint learning for personal knowledge base construction from Twitter-based lifelogs. Information Processing & Management 57, 6 (2019), 102148.
[4]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[6]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 1247–1250.
[7]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). (Lake Tahoe, Nevada), Curran Associates Inc., Red Hook, NY, USA, 2787–2795.
[8]
Jake Bouvrie. 2006. Notes on convolutional neural networks. (2006).
[9]
Martin D. Buhmann. 2003. Radial Basis Functions: Theory and Implementations. Vol. 12. Cambridge university press.
[10]
Ermei Cao, Difeng Wang, Jiacheng Huang, and Wei Hu. 2020. Open knowledge enrichment for long-tail entities. In Proceedings of The Web Conference 2020. ACM / IW3C2, Taipei, Taiwan, 384–394.
[11]
Changhao Chen, Stefano Rosa, Yishu Miao, Chris Xiaoxuan Lu, Wei Wu, Andrew Markham, and Niki Trigoni. 2019. Selective sensor fusion for neural visual-inertial odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Vision Foundation, Long Beach, CA, USA, 10542–10551.
[12]
Liyi Chen, Zhi Li, Yijun Wang, Tong Xu, Zhefeng Wang, and Enhong Chen. 2020. MMEA: Entity alignment for multi-modal knowledge graph. In Proceedings of the International Conference on Knowledge Science, Engineering and Management.Springer, Hangzhou, China, 134–147.
[13]
Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. 2013. NEIL: Extracting visual knowledge from Web data. In Proceedings of the IEEE International Conference on Computer Vision. 1409–1416.
[14]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, ARTICLE (2011), 2493–2537.
[15]
Victor de Boer, Jan Wielemaker, Judith van Gent, Marijke Oosterbroek, Michiel Hildebrand, Antoine Isaac, Jacco van Ossenbruggen, and Guus Schreiber. 2013. Amsterdam museum linked open data. Semantic Web 4, 3 (2013), 237–243.
[16]
Cheng Deng, Yuting Jia, Hui Xu, Chong Zhang, Jingyao Tang, Luoyi Fu, Weinan Zhang, Haisong Zhang, Xinbing Wang, and Chenghu Zhou. 2021. GAKG: A multimodal geoscience academic knowledge graph. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM Press, Queensland, Australia, 4445–4454.
[17]
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Minneapolis, MN, USA, 4171–4186.
[19]
Yang Ding, Jing Yu, Bang Liu, Yue Hu, Mingxin Cui, and Qi Wu. 2022. Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5089–5098.
[20]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth \(16\times 16\) words: Transformers for image recognition at scale. In International Conference on Learning Representations, Oral.
[21]
Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. 2008. Open information extraction from the web. Communications of the ACM 51, 12 (2008), 68–74.
[22]
Congcong Ge, Xiaoze Liu, Lu Chen, Baihua Zheng, and Yunjun Gao. 2021. LargeEA: Aligning entities for large-scale knowledge graphs. Proceedings of the VLDB Endowment 15, 2 (2021), 237–245.
[23]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587.
[24]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14). (Montreal, Canada), MIT Press, Cambridge, MA, USA, 2672–2680.
[25]
Lingbing Guo, Zequn Sun, and Wei Hu. 2019. Learning to exploit long-term relational dependencies in knowledge graphs. In Proceedings of the International Conference on Machine Learning. PMLR, 2505–2514.
[26]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-augmented language model pre-training. arXiv:2002.08909. Retrieved from https://arxiv.org/abs/2002.08909
[27]
Junheng Hao, Muhao Chen, Wenchao Yu, Yizhou Sun, and Wei Wang. 2019. Universal representation learning of knowledge bases by jointly embedding instances and ontological concepts. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2019. ACM, Anchorage, Alaska, USA, 1709–1719.
[28]
F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems 5, 4 (2015), 1–19.
[29]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.
[30]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[31]
K. Hornik, M. Stinchcombe, and H. White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989), 359–366.
[32]
Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. In Proceedings of the ACM International Conference on Web Search and Data Mining. ACM, Melbourne, Australia, 105–113.
[33]
Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 687–696.
[34]
Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. 2016. Knowledge graph completion with adaptive sparse transfer matrix. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
[35]
Shengbin Jia, Yang Xiang, Xiaojun Chen, Kun Wang, and Shijia E. 2019. Triple trustworthiness measurement for knowledge graph. In Proceedings of the The World Wide Web Conference. ACM Press, San Francisco, CA, USA, 2865–2871.
[36]
Xiaotian Jiang, Quan Wang, and Bin Wang. 2019. Adaptive convolution for multi-relational learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 978–987.
[37]
Amar Viswanathan Kannan, Dmitriy Fradkin, Ioannis Akrotirianakis, Tugba Kulahcioglu, Arquimedes Canedo, Aditi Roy, Shih-Yuan Yu, Malawade Arnav, and Mohammad Abdullah Al Faruque. 2020. Multimodal knowledge graph for deep learning papers and code. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. 3417–3420.
[38]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence Pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7871–7880.
[39]
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2020. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2020), 50–70.
[40]
Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-Event: Connecting text and images with event structures. In Proceedings of the Internationaò Conference on Computer Vision and Pattern Recognition. IEEE, New Orleans, USA, 16399–16408.
[41]
Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-Event: Connecting text and images with event structures. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. IEEE, New Orleans, USA, 16399–16408.
[42]
Manling Li, Alireza Zareian, Ying Lin, Xiaoman Pan, Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare R. Voss, Daniel Napierski, and Marjorie Freedman. 2020. GAIA: A fine-grained multimedia knowledge extraction system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 77–86.
[43]
Yangning Li, Jiaoyan Chen, Yinghui Li, Yuejia Xiang, Xi Chen, and Hai-Tao Zheng. 2023. Vision, deduction and alignment: An empirical study on multi-modal knowledge graph alignment. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’23). 1–5.
[44]
Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu, and Fuchun Sun. 2022. Reasoning over different types of knowledge graphs: Static, temporal and multi-modal. arXiv preprint arXiv:2212.05767 (2022).
[45]
Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, and Jiawei Han. 2019. Reliability-aware dynamic feature composition for name tagging. In Proceedings of the Conference of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 165–174.
[46]
Yankai Lin, Zhiyuan Liu, Huanbo Luan, Maosong Sun, Siwei Rao, and Song Liu. 2015. Modeling relation paths for representation learning of knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lluís Màrquez, Chris Callison-Burch, and Jian Su (Eds.). Association for Computational Linguistics, Lisbon, Portugal, 705–714.
[47]
Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[48]
Fangyu Liu, Muhao Chen, Dan Roth, and Nigel Collier. 2021. Visual pivoting for (unsupervised) entity alignment. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 4257–4266.
[49]
Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. 2020. HAL: Improved text-image matching by mitigating visual semantic hubs. In Proceedings of AAAI Conference on Artificial Intelligence. AAAI Press, New York, USA, 11563–11571.
[50]
Ye Liu, Hui Li, Alberto Garcia-Duran, Mathias Niepert, Daniel Onoro-Rubio, and David S. Rosenblum. 2019. MMKG: Multi-modal knowledge graphs. In Proceedings of the European Semantic Web Conference. Springer, 459–474.
[51]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692
[52]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Melbourne, Australia, 2247–2256.
[53]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[54]
Justin Lovelace, Denis Newman-Griffis, Shikhar Vashishth, Jill Fain Lehman, and Carolyn P. Rosé. 2021. Robust knowledge graph completion with stacked convolutions and a student re-ranking network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Virtual Event, 1016–1029.
[55]
Justin Lovelace and Carolyn P. Rosé. 2022. A framework for adapting pre-trained language models to knowledge graph completion. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5937–5955.
[56]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Annual Conference on Advances in Neural Information Processing Systems. Vancouver, BC, Canada, 13–23.
[57]
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, and Hua Wu. 2022. Unified structure generation for universal information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 5755–5772.
[58]
Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, and Ming Zhou. 2020. Univl: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020).
[59]
Xin Mao, Wenting Wang, Huimin Xu, Yuanbin Wu, and Man Lan. 2020. Relational reflection entity alignment. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, Virtual Event, 1095–1104.
[60]
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: contextualized word vectors. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). (Long Beach, California, USA), Curran Associates Inc., Red Hook, NY, USA, 6297–6308.
[61]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[62]
Hatem Mousselly-Sergieh, Teresa Botschen, Iryna Gurevych, and Stefan Roth. 2018. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 225–234.
[63]
Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. 2019. Learning attention-based embeddings for relation prediction in knowledge graphs. In Proceedings of the Conference of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4710–4723.
[64]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the ICML.
[65]
Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. 2018. A novel embedding model for knowledge base completion based on convolutional neural network. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 327–333.
[66]
Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, and Nan Duan. 2021. M3P: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 3977–3986.
[67]
Daniel Oñoro-Rubio, Mathias Niepert, Alberto García-Durán, Roberto González, and Roberto J. López-Sastre. 2017. Answering visual-relational queries in web-extracted knowledge graphs. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1701.0013
[68]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[69]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP ’14). 1532–1543.
[70]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 2227–2237.
[71]
Pouya Pezeshkpour, Liyan Chen, and Sameer Singh. 2018. Embedding multimodal relational data for knowledge base completion. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1701.00133
[72]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035
[73]
T. Subba Rao and M. M. Gabr. 2012. An Introduction to Bispectral Analysis and Bilinear Time Series Models. Vol. 24. Springer Science and Business Media.
[74]
Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander G. Schwing, and Heng Ji. 2022. MuMuQA: Multimedia multi-hop news question answering via cross-media knowledge extraction and grounding. In Proc. of the AAAI Conference on Artificial Intelligence (AAAI’22). AAAI Press, 11200–11208.
[75]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 779–788.
[76]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans.actions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
[77]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans.actions on Pattern Analysis and Machine Intelligence 39, 6 (2017), 1137–1149.
[78]
S. Rendle. 2010. Factorization machines. In Proceedings of the 10th IEEE International Conference on Data Mining, Sydney, Australia, 14–17 December 2010.
[79]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2009), 61–80. DOI:DOI:
[80]
Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In Proceedings of the European Semantic Web Conference. Springer, 593–607.
[81]
Edward W. Schneider. 1973. Course modularization applied: The interface system and its implications for sequence control and data analysis. Behavioral Objectives (1973), 21.
[82]
Ekaterina Shutova, Douwe Kiela, and Jean Maillard. [n. d.]. Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego, 160–170.
[83]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[84]
Amit Singhal. 2012. Introducing the knowledge graph: Things, not strings. Official Google Blog 5 (2012), 16. https://www.blog.google/products/search/introducing-knowledgegraph-things-not/
[85]
Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. 2015. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 567–576.
[86]
Fenglong Su, Chengjin Xu, Han Yang, Zhongwu Chen, and Ning Jing. 2023. Neural entity alignment with cross-modal supervision. Information Processing and Management 60, 2 (2023), 103174.
[87]
Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, and Arun Sacheti. 2021. GEM: A general evaluation benchmark for multimodal tasks. In Findings of the Association for Computational Linguistics: (ACL-IJCNLP’21). Association for Computational Linguistics, Online, 2594–2603.
[88]
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web. 697–706.
[89]
Rui Sun, Xuezhi Cao, Yan Zhao, Junchen Wan, Kun Zhou, Fuzheng Zhang, Zhongyuan Wang, and Kai Zheng. 2020. Multi-modal knowledge graphs for recommender systems. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, 1405–1414.
[90]
Zequn Sun, Chengming Wang, Wei Hu, Muhao Chen, Jian Dai, Wei Zhang, and Yuzhong Qu. 2020. Knowledge graph alignment network with gated multi-hop neighborhood aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, New York, NY, 222–229.
[91]
Shaohua Tao, Runhe Qiu, Yuan Ping, and Hui Ma. 2021. Multi-modal knowledge-aware reinforcement learning network for explainable recommendation. Knowledge-Based Systems 227 (2021), 107217.
[92]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–6010.
[93]
Meng Wang, Guilin Qi, HaoFen Wang, and Qiushuo Zheng. 2019. Richpedia: A comprehensive multi-modal knowledge graph. In Proceedings of the Joint International Semantic Technology Conference. Springer, 130–145.
[94]
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics 9 (2021), 176–194.
[95]
Yuxuan Wang, Yutai Hou, Wanxiang Che, and Ting Liu. 2020. From static to dynamic word representations: A survey. International Journal of Machine Learning and Cybernetics 11, 7 (2020), 1611–1630.
[96]
Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, and Changsheng Xu. 2020. Fake news detection via knowledge-driven multimodal graph convolutional networks. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 540–547.
[97]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence.
[98]
W. X. Wilcke, Peter Bloem, Victor de Boer, R. H. van t Veer, and F. A. H. van Harmelen. 2020. End-to-end entity classification on multimodal knowledge graphs. arXiv preprint arXiv:2003.12383 (2020).
[99]
Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. 2015. TransA: An adaptive approach for knowledge graph embedding. arXiv preprint arXiv:1509.05490 (2015).
[100]
Han Xiao, Minlie Huang, and Xiaoyan Zhu. 2016. TransG: A generative model for knowledge graph embedding. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 2316–2325.
[101]
Jiawang Xie, Zhenhao Dong, Qinghua Wen, Hongyin Zhu, Hailong Jin, Lei Hou, and Juanzi Li. 2021. Construction of multimodal chinese tourism knowledge graph. In Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators. Springer, Taiyuan, China, 16–29.
[102]
Ruobing Xie, Zhiyuan Liu, Fen Lin, and Leyu Lin. 2018. Does William Shakespeare really write Hamlet? Knowledge representation learning with confidence. In Proceedings of the AAAI Conference on Artificial Intelligence.
[103]
Ruobing Xie, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2017. Image-embodied knowledge representation learning. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI’17). ijcai.org, Melbourne, Australia, 3140–3146.
[104]
Ruobing Xie, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2017. Image-embodied knowledge representation learning. In Proceedings of the International Joint Conference on Artificial Intelligence. ijcai.org, Melbourne, Australia, 3140–3146.
[105]
Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit semantic ranking for academic search via knowledge graph embedding. In Proceedings of the International Conference on World Wide Web.ACM, Perth, Australia, 1271–1279.
[106]
Guohai Xu, Hehong Chen, Feng-Lin Li, Fu Sun, Yunzhou Shi, Zhixiong Zeng, Wei Zhou, Zhongzhou Zhao, and Ji Zhang. 2021. AliMe MKG: A multi-modal knowledge graph for live-streaming E-commerce. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, Queensland, Australia, 4808–4812.
[107]
Yexiang Xue, Yang Yuan, Zhitian Xu, and Ashish Sabharwal. 2018. Expanding holographic embeddings for knowledge completion. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). (Montréal, Canada), Curran Associates Inc., Red Hook, NY, 4496–4506.
[108]
Bishan Yang, Scott Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations (ICLR’15).
[109]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. KG-BERT: BERT for knowledge graph completion. CoRR abs/1909.03193 (2019). arXiv:1909.03193 http://arxiv.org/abs/1909.03193
[110]
Shih-Yuan Yu, Ahmet Salih Aksakal, Sujit Rokka Chhetri, and Mohammad Abdullah Al Faruque. 2020. Deep Code Curator–code2graph Part-II. Technical Report. Technical Report TR-20-01. Center for Embedded and Cyber-Physical Systems ....
[111]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. https://arxiv.org/abs/1409.2329
[112]
Kaisheng Zeng, Chengjiang Li, Lei Hou, Juanzi Li, and Ling Feng. 2021. A comprehensive survey of entity alignment for knowledge graphs. AI Open 2 (2021), 1–13.
[113]
Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, San Francisco, CA, USA, 353–362.
[114]
Huaiwen Zhang, Quan Fang, Shengsheng Qian, and Changsheng Xu. 2019. Multi-modal knowledge-aware event memory network for social media rumor detection. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, Nice, France, 1942–1951.
[115]
Ningyu Zhang, Lei Li, Xiang Chen, Xiaozhuan Liang, Shumin Deng, and Huajun Chen. 2022. Multimodal analogical reasoning over knowledge graphs. arXiv preprint arXiv:2210.00312 (2022).
[116]
Yingying Zhang, Shengsheng Qian, Quan Fang, and Changsheng Xu. 2019. Multi-modal knowledge-aware hierarchical attention network for explainable medical question answering. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, Nice, France, 1089–1097.
[117]
Chaoyu Zhu, Zhihao Yang, Xiaoqiong Xia, Nan Li, Fan Zhong, and Lei Liu. 2022. Multimodal reasoning based on knowledge graph embedding for specific diseases. Bioinformatics 38, 8 (2022), 2235–2245.
[118]
Jia Zhu, Changqin Huang, and Pasquale De Meo. 2023. DFMKE: A dual fusion multi-modal knowledge graph embedding framework for entity alignment. Information Fusion 90 (2023), 111–119.
[119]
Xiangru Zhu, Zhixu Li, Xiaodan Wang, Xueyao Jiang, Penglei Sun, Xuwu Wang, Yanghua Xiao, and Nicholas Jing Yuan. 2024. Multi-modal knowledge graph construction and application: A Survey. IEEE Transactions on Knowledge and Data Engineering 36, 2 (2024), 715–735.
[120]
Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670 [cs.CV].
[121]
Yushan Zhu, Huaixiao Zhao, Wen Zhang, Ganqiang Ye, Hui Chen, Ningyu Zhang, and Huajun Chen. 2021. Knowledge perceived multi-modal pretraining in E-commerce. In Proceedings of the 29th ACM International Conference on Multimedia (MM’21). (Virtual Event, China), Association for Computing Machinery, New York, NY, 2744–2752.

Cited By

View all
  • (2025)Temporal multi-modal knowledge graph generation for link predictionNeural Networks10.1016/j.neunet.2024.107108185(107108)Online publication date: May-2025
  • (2025)Beyond expression: Comprehensive visualization of knowledge triplet factsInformation Processing & Management10.1016/j.ipm.2025.10406262:3(104062)Online publication date: May-2025
  • (2025)Knowledge graph representation learning: A comprehensive and experimental overviewComputer Science Review10.1016/j.cosrev.2024.10071656(100716)Online publication date: May-2025
  • Show More Cited By

Index Terms

  1. A Survey of Multi-modal Knowledge Graphs: Technologies and Trends

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 56, Issue 11
    November 2024
    977 pages
    EISSN:1557-7341
    DOI:10.1145/3613686
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2024
    Online AM: 10 April 2024
    Accepted: 02 April 2024
    Revised: 28 February 2024
    Received: 15 September 2022
    Published in CSUR Volume 56, Issue 11

    Check for updates

    Author Tags

    1. Multi-modal knowledge graphs
    2. four fundamental challenges
    3. pre-training in MMKGs

    Qualifiers

    • Survey

    Funding Sources

    • Research and Demonstration Application of Key Technologies for Personalized Learning Driven by Educational Big Data
    • National Key R&D Program of China
    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,503
    • Downloads (Last 6 weeks)463
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Temporal multi-modal knowledge graph generation for link predictionNeural Networks10.1016/j.neunet.2024.107108185(107108)Online publication date: May-2025
    • (2025)Beyond expression: Comprehensive visualization of knowledge triplet factsInformation Processing & Management10.1016/j.ipm.2025.10406262:3(104062)Online publication date: May-2025
    • (2025)Knowledge graph representation learning: A comprehensive and experimental overviewComputer Science Review10.1016/j.cosrev.2024.10071656(100716)Online publication date: May-2025
    • (2024)Deep knowledge tracing method based on enhancing knowledge graph embeddingProceedings of the 2024 7th International Conference on Computer Information Science and Artificial Intelligence10.1145/3703187.3703284(576-580)Online publication date: 13-Sep-2024
    • (2024)Product Qulity Traceability of Cold Chain Logistics Based on Multimodal Knowledge Graph2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/ICNC-FSKD64080.2024.10702298(1-6)Online publication date: 27-Jul-2024
    • (2024)Text-Guided Hierarchical Visual Prefix Network for Multimodal Relation Extraction2024 4th International Conference on Electronic Information Engineering and Computer Science (EIECS)10.1109/EIECS63941.2024.10800016(1051-1055)Online publication date: 27-Sep-2024
    • (2024)TriMod Fusion for Multimodal Named Entity Recognition in Social Media2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10837944(1-9)Online publication date: 11-Nov-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media