Skip to main content

MMpedia: A Large-Scale Multi-modal Knowledge Graph

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2023 (ISWC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14266))

Included in the following conference series:

  • 1173 Accesses

Abstract

Knowledge graphs serve as crucial resources for various applications. However, most existing knowledge graphs present symbolic knowledge in the form of natural language, lacking other modal information, e.g., images. Previous multi-modal knowledge graphs have encountered challenges with scaling and image quality. Therefore, this paper proposes a highly-scalable and high-quality multi-modal knowledge graph using a novel pipeline method. Summarily, we first retrieve images from a search engine and build a new Recurrent Gate Multi-modal model to filter out the non-visual entities. Then, we utilize entities’ textual and type information to remove noisy images of the remaining entities. Through this method, we construct a large-scale multi-modal knowledge graph named MMpedia, containing 2,661,941 entity nodes and 19,489,074 images. As we know, MMpedia has the largest collection of images among existing multi-modal knowledge graphs. Furthermore, we employ human evaluation and downstream tasks to verify the usefulness of images in MMpedia. The experimental result shows that both the state-of-the-art method and multi-modal large language model (e.g., VisualChatGPT) achieve about a 4% improvement on Hit@1 in the entity prediction task by incorporating our collected images. We also find that the multi-modal large language model is hard to ground entities to images. The dataset (https://zenodo.org/record/7816711) and source code of this paper are available at https://github.com/Delicate2000/MMpedia.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://commons.wikimedia.org.

  2. 2.

    https://dbpedia.org/sparql.

References

  1. Aghaei, S., Raad, E., Fensel, A.: Question answering over knowledge graphs: a case study in tourism. IEEE Access 10, 69788–69801 (2022)

    Article  Google Scholar 

  2. Alberts, H., et al.: VisualSem: a high-quality knowledge graph for vision and language. arXiv preprint arXiv:2008.09150 (2020)

  3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, vol. 26 (2013)

    Google Scholar 

  4. Calabrese, A., Bevilacqua, M., Navigli, R.: Fatality killed the cat or: BabelPic, a multimodal dataset for non-concrete concepts. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4680–4686 (2020)

    Google Scholar 

  5. Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12

    Chapter  Google Scholar 

  6. Chen, Q., Wang, W., Huang, K., Coenen, F.: Zero-shot text classification via knowledge graph embedding for social media data. IEEE Internet Things J. 9(12), 9205–9213 (2021)

    Article  Google Scholar 

  7. Chen, X., et al.: Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. arXiv preprint arXiv:2205.02357 (2022)

  8. Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416 (2013)

    Google Scholar 

  9. Cheng, M., et al.: ViSTA: vision and scene text aggregation for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5184–5193 (2022)

    Google Scholar 

  10. Colla, D., Mensa, E., Radicioni, D.P., Lieto, A.: Tell me why: computational explanation of conceptual similarity judgments. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 74–85. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91473-2_7

    Chapter  Google Scholar 

  11. Corbiere, C., Ben-Younes, H., Ramé, A., Ollion, C.: Leveraging weakly annotated data for fashion image retrieval and label prediction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2268–2274 (2017)

    Google Scholar 

  12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  13. Ferrada, S., Bustos, B., Hogan, A.: IMGpedia: a linked dataset with content-based analysis of Wikimedia images. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 84–93. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_8

    Chapter  Google Scholar 

  14. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)

    Article  Google Scholar 

  15. Gao, J., Zhao, H., Yu, C., Xu, R.: Exploring the feasibility of ChatGPT for event extraction. arXiv preprint arXiv:2303.03836 (2023)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  17. Hendriksen, M., Vakulenko, S., Kuiper, E., de Rijke, M.: Scene-centric vs. object-centric image-text cross-modal retrieval: a reproducibility study. arXiv preprint arXiv:2301.05174 (2023)

  18. Kang, H., et al.: TSPNet: translation supervised prototype network via residual learning for multimodal social relation extraction. Neurocomputing 507, 166–179 (2022)

    Article  Google Scholar 

  19. Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)

    Google Scholar 

  20. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  21. Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)

    Article  Google Scholar 

  22. Li, M., et al.: Gaia: a fine-grained multimedia knowledge extraction system. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 77–86 (2020)

    Google Scholar 

  23. Li, Y., Li, J., Jin, H., Peng, L.: Focusing attention across multiple images for multimodal event detection. In: ACM Multimedia Asia, pp. 1–6. Association for Computing Machinery (2021)

    Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, C., Mao, Z., Zhang, T., Xie, H., Wang, B., Zhang, Y.: Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10921–10930 (2020)

    Google Scholar 

  26. Liu, Y., Li, H., Garcia-Duran, A., Niepert, M., Onoro-Rubio, D., Rosenblum, D.S.: MMKG: multi-modal knowledge graphs. In: Hitzler, P., et al. (eds.) ESWC 2019. LNCS, vol. 11503, pp. 459–474. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21348-0_30

    Chapter  Google Scholar 

  27. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  28. Mafla, A., Rezende, R.S., Gomez, L., Larlus, D., Karatzas, D.: StacMR: scene-text aware cross-modal retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2220–2230 (2021)

    Google Scholar 

  29. Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  30. Oñoro-Rubio, D., Niepert, M., García-Durán, A., González, R., López-Sastre, R.J.: Answering visual-relational queries in web-extracted knowledge graphs. arXiv preprint arXiv:1709.02314 (2017)

  31. Peng, Y., Zhang, J.: LineaRE: simple but powerful knowledge graph embedding for link prediction. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 422–431. IEEE (2020)

    Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  33. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  34. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)

  35. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  36. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  37. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706 (2007)

    Google Scholar 

  38. Sun, R., et al.: Multi-modal knowledge graphs for recommender systems. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1405–1414 (2020)

    Google Scholar 

  39. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: Rotate: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197 (2019)

  40. Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

    Article  Google Scholar 

  41. Tong, M., Wang, S., Cao, Y., Xu, B., Li, J., Hou, L., Chua, T.S.: Image enhanced event detection in news articles. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9040–9047 (2020)

    Google Scholar 

  42. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. In: International Conference on Machine Learning, pp. 2071–2080. PMLR (2016)

    Google Scholar 

  43. Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)

    Google Scholar 

  44. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  45. Wang, H., et al.: Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism. IEEE Trans. Multimedia 24, 2515–2525 (2021)

    Article  Google Scholar 

  46. Wang, M., Wang, H., Qi, G., Zheng, Q.: Richpedia: a large-scale, comprehensive multi-modal knowledge graph. Big Data Res. 22, 100159 (2020)

    Article  Google Scholar 

  47. Wang, M., Wang, S., Yang, H., Zhang, Z., Chen, X., Qi, G.: Is visual context really helpful for knowledge graph? A representation learning perspective. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2735–2743 (2021)

    Google Scholar 

  48. Wang, X., et al.: PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: Bhattacharya, A., et al. (eds.) DASFAA 2022. LNCS, vol. 13247, pp. 297–305. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_24

    Chapter  Google Scholar 

  49. Wen, H., et al.: Resin: a dockerized schema-guided cross-document cross-lingual cross-media information extraction and event tracking system. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pp. 133–143 (2021)

    Google Scholar 

  50. Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023)

  51. Wu, Y., Zhan, P., Zhang, Y., Wang, L., Xu, Z.: Multimodal fusion with co-attention networks for fake news detection. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2560–2569 (2021)

    Google Scholar 

  52. Yang, Y., Zhu, Y., Li, Y.: Personalized recommendation with knowledge graph via dual-autoencoder. Appl. Intell. 52(6), 6196–6207 (2022)

    Article  Google Scholar 

  53. Yao, L., Mao, C., Luo, Y.: KG-BERT: BERT for knowledge graph completion. arXiv preprint arXiv:1909.03193 (2019)

  54. Zhao, J., Huang, F., Lv, J., Duan, Y., Qin, Z., Li, G., Tian, G.: Do RNN and LSTM have long memory? In: International Conference on Machine Learning, pp. 11365–11375. PMLR (2020)

    Google Scholar 

  55. Zhao, Y., et al.: MoSE: modality split and ensemble for multimodal knowledge graph completion. arXiv preprint arXiv:2210.08821 (2022)

  56. Zheng, C., Wu, Z., Feng, J., Fu, Z., Cai, Y.: MNRE: a challenge multimodal dataset for neural relation extraction with visual evidence in social media posts. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)

    Google Scholar 

  57. Zhu, X., et al.: Multi-modal knowledge graph construction and application: a survey. arXiv preprint arXiv:2202.05786 (2022)

Download references

Acknowledgements

This work was supported by the Shanghai Municipal Special Fund for Promoting High-quality Development of Industries (2021-GZL-RGZN-01018) and the Shanghai Sailing Program (23YF1409400).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Ruan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, Y. et al. (2023). MMpedia: A Large-Scale Multi-modal Knowledge Graph. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14266. Springer, Cham. https://doi.org/10.1007/978-3-031-47243-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47243-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47242-8

  • Online ISBN: 978-3-031-47243-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics