MMpedia: A Large-Scale Multi-modal Knowledge Graph

Wu, Yinan; Wu, Xiaowei; Li, Junwen; Zhang, Yue; Wang, Haofen; Du, Wen; He, Zhidong; Liu, Jingping; Ruan, Tong

doi:10.1007/978-3-031-47243-5_2

Yinan Wu¹⁶,
Xiaowei Wu¹⁶,
Junwen Li¹⁶,
Yue Zhang¹⁶,
Haofen Wang¹⁷,
Wen Du¹⁸,
Zhidong He¹⁸,
Jingping Liu¹⁶ &
…
Tong Ruan¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14266))

Included in the following conference series:

International Semantic Web Conference

2252 Accesses

Abstract

Knowledge graphs serve as crucial resources for various applications. However, most existing knowledge graphs present symbolic knowledge in the form of natural language, lacking other modal information, e.g., images. Previous multi-modal knowledge graphs have encountered challenges with scaling and image quality. Therefore, this paper proposes a highly-scalable and high-quality multi-modal knowledge graph using a novel pipeline method. Summarily, we first retrieve images from a search engine and build a new Recurrent Gate Multi-modal model to filter out the non-visual entities. Then, we utilize entities’ textual and type information to remove noisy images of the remaining entities. Through this method, we construct a large-scale multi-modal knowledge graph named MMpedia, containing 2,661,941 entity nodes and 19,489,074 images. As we know, MMpedia has the largest collection of images among existing multi-modal knowledge graphs. Furthermore, we employ human evaluation and downstream tasks to verify the usefulness of images in MMpedia. The experimental result shows that both the state-of-the-art method and multi-modal large language model (e.g., VisualChatGPT) achieve about a 4% improvement on Hit@1 in the entity prediction task by incorporating our collected images. We also find that the multi-modal large language model is hard to ground entities to images. The dataset (https://zenodo.org/record/7816711) and source code of this paper are available at https://github.com/Delicate2000/MMpedia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aghaei, S., Raad, E., Fensel, A.: Question answering over knowledge graphs: a case study in tourism. IEEE Access 10, 69788–69801 (2022)
Article Google Scholar
Alberts, H., et al.: VisualSem: a high-quality knowledge graph for vision and language. arXiv preprint arXiv:2008.09150 (2020)
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Google Scholar
Calabrese, A., Bevilacqua, M., Navigli, R.: Fatality killed the cat or: BabelPic, a multimodal dataset for non-concrete concepts. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4680–4686 (2020)
Google Scholar
Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12
Chapter Google Scholar
Chen, Q., Wang, W., Huang, K., Coenen, F.: Zero-shot text classification via knowledge graph embedding for social media data. IEEE Internet Things J. 9(12), 9205–9213 (2021)
Article Google Scholar
Chen, X., et al.: Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. arXiv preprint arXiv:2205.02357 (2022)
Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416 (2013)
Google Scholar
Cheng, M., et al.: ViSTA: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)
Google Scholar
Colla, D., Mensa, E., Radicioni, D.P., Lieto, A.: Tell me why: computational explanation of conceptual similarity judgments. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 74–85. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91473-2_7
Chapter Google Scholar
Corbiere, C., Ben-Younes, H., Ramé, A., Ollion, C.: Leveraging weakly annotated data for fashion image retrieval and label prediction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2268–2274 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ferrada, S., Bustos, B., Hogan, A.: IMGpedia: a linked dataset with content-based analysis of Wikimedia images. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 84–93. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_8
Chapter Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Article Google Scholar
Gao, J., Zhao, H., Yu, C., Xu, R.: Exploring the feasibility of ChatGPT for event extraction. arXiv preprint arXiv:2303.03836 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hendriksen, M., Vakulenko, S., Kuiper, E., de Rijke, M.: Scene-centric vs. object-centric image-text cross-modal retrieval: a reproducibility study. arXiv preprint arXiv:2301.05174 (2023)
Kang, H., et al.: TSPNet: translation supervised prototype network via residual learning for multimodal social relation extraction. Neurocomputing 507, 166–179 (2022)
Article Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Article Google Scholar
Li, M., et al.: Gaia: a fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 77–86 (2020)
Google Scholar
Li, Y., Li, J., Jin, H., Peng, L.: Focusing attention across multiple images for multimodal event detection. In: ACM Multimedia Asia, pp. 1–6. Association for Computing Machinery (2021)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)
Google Scholar
Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.: MMKG: multi-modal knowledge graphs. In: Hitzler, P., et al. (eds.) ESWC 2019. LNCS, vol. 11503, pp. 459–474. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21348-0_30
Chapter Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Mafla, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D.: StacMR: scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230 (2021)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Article MathSciNet MATH Google Scholar
Oñoro-Rubio, D., Niepert, M., García-Durán, A., González, R., López-Sastre, R.J.: Answering visual-relational queries in web-extracted knowledge graphs. arXiv preprint arXiv:1709.02314 (2017)
Peng, Y., Zhang, J.: LineaRE: simple but powerful knowledge graph embedding for link prediction. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 422–431. IEEE (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)
Google Scholar
Sun, R., et al.: Multi-modal knowledge graphs for recommender systems. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1405–1414 (2020)
Google Scholar
Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019)
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)
Article Google Scholar
Tong, M., Wang, S., Cao, Y., Xu, B., Li, J., Hou, L., Chua, T.S.: Image enhanced event detection in news articles. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9040–9047 (2020)
Google Scholar
Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. In: International Conference on Machine Learning, pp. 2071–2080. PMLR (2016)
Google Scholar
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)
Google Scholar
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar
Wang, H., et al.: Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Trans. Multimedia 24, 2515–2525 (2021)
Article Google Scholar
Wang, M., Wang, H., Qi, G., Zheng, Q.: Richpedia: a large-scale, comprehensive multi-modal knowledge graph. Big Data Res. 22, 100159 (2020)
Article Google Scholar
Wang, M., Wang, S., Yang, H., Zhang, Z., Chen, X., Qi, G.: Is visual context really helpful for knowledge graph? A representation learning perspective. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2735–2743 (2021)
Google Scholar
Wang, X., et al.: PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: Bhattacharya, A., et al. (eds.) DASFAA 2022. LNCS, vol. 13247, pp. 297–305. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_24
Chapter Google Scholar
Wen, H., et al.: Resin: a dockerized schema-guided cross-document cross-lingual cross-media information extraction and event tracking system. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pp. 133–143 (2021)
Google Scholar
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)
Wu, Y., Zhan, P., Zhang, Y., Wang, L., Xu, Z.: Multimodal fusion with co-attention networks for fake news detection. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2560–2569 (2021)
Google Scholar
Yang, Y., Zhu, Y., Li, Y.: Personalized recommendation with knowledge graph via dual-autoencoder. Appl. Intell. 52(6), 6196–6207 (2022)
Article Google Scholar
Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion. arXiv preprint arXiv:1909.03193 (2019)
Zhao, J., Huang, F., Lv, J., Duan, Y., Qin, Z., Li, G., Tian, G.: Do RNN and LSTM have long memory? In: International Conference on Machine Learning, pp. 11365–11375. PMLR (2020)
Google Scholar
Zhao, Y., et al.: MoSE: modality split and ensemble for multimodal knowledge graph completion. arXiv preprint arXiv:2210.08821 (2022)
Zheng, C., Wu, Z., Feng, J., Fu, Z., Cai, Y.: MNRE: a challenge multimodal dataset for neural relation extraction with visual evidence in social media posts. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Google Scholar
Zhu, X., et al.: Multi-modal knowledge graph construction and application: a survey. arXiv preprint arXiv:2202.05786 (2022)

Download references

Acknowledgements

This work was supported by the Shanghai Municipal Special Fund for Promoting High-quality Development of Industries (2021-GZL-RGZN-01018) and the Shanghai Sailing Program (23YF1409400).

Author information

Authors and Affiliations

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China
Yinan Wu, Xiaowei Wu, Junwen Li, Yue Zhang, Jingping Liu & Tong Ruan
College of Design and Innovation, Tongji University, Shanghai, China
Haofen Wang
DS Information Technology, Shanghai, China
Wen Du & Zhidong He

Authors

Yinan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Junwen Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haofen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wen Du
View author publications
You can also search for this author in PubMed Google Scholar
Zhidong He
View author publications
You can also search for this author in PubMed Google Scholar
Jingping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tong Ruan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tong Ruan .

Editor information

Editors and Affiliations

University of Liverpool, Liverpool, UK
Terry R. Payne
University of Bologna, Bologna, Italy
Valentina Presutti
Southeast University, Nanjing, China
Guilin Qi
Universidad Politécnica de Madrid, Madrid, Spain
María Poveda-Villalón
Huawei Technologies R&D UK, Edinburgh, UK
Giorgos Stoilos
Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
Laura Hollink
IT University of Copenhagen, Copenhagen, Denmark
Zoi Kaoudi
Nanjing University, Nanjing, China
Gong Cheng
Tsinghua University, Beijing, China
Juanzi Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, Y. et al. (2023). MMpedia: A Large-Scale Multi-modal Knowledge Graph. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14266. Springer, Cham. https://doi.org/10.1007/978-3-031-47243-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-47243-5_2
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47242-8
Online ISBN: 978-3-031-47243-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)