skip to main content
10.1145/3477495.3531867acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training

Published: 07 July 2022 Publication History

Abstract

Previous entity linking methods in knowledge graphs (KGs) mostly link the textual mentions to corresponding entities. However, they have deficiencies in processing numerous multimodal data, when the text is too short to provide enough context. Consequently, we conceive the idea of introducing valuable information of other modalities, and propose a novel multimodal entity linking method with gated hierarchical multimodal fusion and contrastive training (GHMFC). Firstly, in order to discover the fine-grained inter-modal correlations, GHMFC extracts the hierarchical features of text and visual co-attention through the multi-modal co-attention mechanism: textual-guided visual attention and visual-guided textual attention. The former attention obtains weighted visual features under the guidance of textual information. In contrast, the latter attention produces weighted textual features under the guidance of visual information. Afterwards, gated fusion is used to evaluate the importance of hierarchical features of different modalities and integrate them into the final multimodal representations of mentions. Subsequently, contrastive training with two types of contrastive losses is designed to learn more generic multimodal features and reduce noise. Finally, the linking entities are selected by calculating the cosine similarity between representations of mentions and entities in KGs. To evaluate the proposed method, this paper releases two new open multimodal entity linking datasets: WikiMEL and Richpedia-MEL. Experimental results demonstrate that GHMFC can learn meaningful multimodal representation and significantly outperforms most of the baseline methods.

Supplementary Material

MP4 File (GHMFC_presentation.mp4)
Presentation video of our paper "Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training"

References

[1]
Omar Adjali, Romaric Besancc on, Olivier Ferret, et almbox. 2020 a. Building a Multimodal Entity Linking Dataset From Tweets. In Proceedings of the 12th Conference on Language Resources and Evaluation. 4285--4292.
[2]
Omar Adjali, Romaric Besancc on, Olivier Ferret, et almbox. 2020 b. Multimodal Entity Linking for Tweets. In Proceedings of the 42nd European Conference on Information Retrieval. 463--478.
[3]
Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, et almbox. 2010. Multimodal Fusion for Multimedia Analysis: A Survey. Multimedia systems, Vol. 16 (2010), 345--379.
[4]
Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and space-efficient entity linking for queries. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining . 179--188.
[5]
Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv preprint (2020), 2003.04297.
[6]
Andrew Chisholm and Ben Hachey. 2015. Entity Disambiguation with Web Links. Transactions of the Association for Computational Linguistics, Vol. 3 (2015), 145--156.
[7]
Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined Visual Representations as Multimodal Embeddings. In Proceedings of the 31th AAAI Conference on Artificial Intelligence. 4378--4384.
[8]
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Autoregressive Entity Retrieval. In International Conference on Learning Representations (ICLR) .
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, et almbox. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[10]
Stephen Dill, Nadav Eiron, David Gibson, et almbox. 2003. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings of the 12th International Conference on World Wide Web . 178--186.
[11]
Tenzing C Dolmans, Mannes Poel, Jan-Willem JR van't Klooster, et almbox. 2020. Perceived Mental Workload Classification Using Intermediate Fusion Multimodal Deep Learning. Frontiers in human neuroscience, Vol. 14 (2020), 609096.
[12]
Yotam Eshel, Noam Cohen, Kira Radinsky, et almbox. 2017. Named Entity Disambiguation for Noisy Text. In Proceedings of the 21st Conference on Computational Natural Language Learning. 58--68.
[13]
Andrea Frome, Greg S Corrado, Jonathon Shlens, et almbox. 2013. DeViSE: A Deep Visual-semantic Embedding Model. In Proceedings of the 26th International Conference on Neural Information Processing Systems. 2121--2129.
[14]
Konrad Gadzicki, Razieh Khamsehashari, and Christoph Zetzsche. 2020. Early vs Late Fusion in Multimodal Convolutional Neural Networks. In Proceedings of the 23th IEEE International Conference on Information Fusion. 1--6.
[15]
Jingru Gan, Jinchang Luo, Haiwei Wang, et almbox. 2021. Multimodal Entity Linking: A New Dataset and A Baseline. In Proceedings of the 29th ACM International Conference on Multimedia .
[16]
Octavian-Eugen Ganea and Thomas Hofmann. 2017. Deep Joint Entity Disambiguation with Local Neural Attention. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . 2619--2629.
[17]
Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2016. Exploiting entity linking in queries for entity retrieval. In Proceedings of the 2016 ACM international conference on the theory of information retrieval . 209--218.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, et almbox. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[19]
Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, et almbox. 2012. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management . 545--554.
[20]
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino, et almbox. 2011. Robust Disambiguation of Named Entities in Text. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 782--792.
[21]
Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, et almbox. 2009. Collective Annotation of Wikipedia Entities in Web Text. In Proceedings of the 15th ACM SIGKDD The International Conference on Knowledge Discovery and Data Mining . 457--466.
[22]
Phong Le and Ivan Titov. 2018. Improving Entity Linking by Modeling Latent Relations between Mentions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics . 1595--1604.
[23]
I. Loshchilov and F. Hutter. 2017. Fixing Weight Decay Regularization in Adam. CoRR (2017). http://arxiv.org/abs/1711.05101
[24]
Jiasen Lu, Jianwei Yang, Dhruv Batra, et almbox. 2016. Hierarchical Question-image Co-attention for Visual Question Answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems . 289--297.
[25]
Rada Mihalcea and Andras Csomai. 2007. Wikify! Linking Documents to Encyclopedic Knowledge. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management . 233--242.
[26]
David Milne and Ian H Witten. 2008. Learning to Link with Wikipedia. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. 509--518.
[27]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent Models of Visual Attention. In Proceedings of the 27th International Conference on Neural Information Processing Systems. 2204--2212.
[28]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Disambiguation for Noisy Social Media Posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics . 2000--2008.
[29]
Eric Müller-Budack, Jonas Theiner, Sebastian Diering, et almbox. 2021. Multimodal News Analytics Using Measures of Cross-modal Entity and Context Consistency. International Journal of Multimedia Information Retrieval, Vol. 10 (2021), 111--125.
[30]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, et almbox. 2011. Multimodal Deep Learning. In Proceedings of the The 28th International Conference on Machine Learning. 689--696.
[31]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532--1543.
[32]
Chenwei Ran, Wei Shen, and Jianyong Wang. 2018. An Attention Factor Graph Model for Tweet Entity Linking. In Proceedings of the 2018 World Wide Web Conference. 1135--1144.
[33]
Michael Röder, Ricardo Usbeck, Sebastian Hellmann, et almbox. 2014. N$^3$-A Collection of Datasets for Named Entity Recognition and Disambiguation in the NLP Interchange Format. In Proceedings of the 9th International Conference on Language Resources and Evaluation. 3529--3533.
[34]
Sanket Shah, Anand Mishra, Naganand Yadati, et almbox. 2019. Kvqa: Knowledge-aware Visual Question Answering. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 8876--8884.
[35]
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Transactions on Knowledge and Data Engineering, Vol. 27 (2015), 443--460.
[36]
Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders. 2005. Early Versus Late Fusion in Semantic Video Analysis. In Proceedings of the 13th Annual ACM International Conference on Multimedia. 399--402.
[37]
Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal Learning with Deep Boltzmann Machines. In Proceedings of the 25th International Conference on Neural Information Processing Systems. 2222--2230.
[38]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, et almbox. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . 6558.
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, et almbox. 2017. Attention is All you Need. In Proceedings of the 31st Neural Information Processing Systems. 5998--6008.
[40]
Denny Vrandevc ić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM, Vol. 57, 10 (2014), 78--85.
[41]
Meng Wang, Haofen Wang, Guilin Qi, et almbox. 2020. Richpedia: A Large-Scale, Comprehensive Multi-Modal Knowledge Graph. Big Data Research, Vol. 22 (2020), 100159.
[42]
Yansen Wang, Ying Shen, Zhun Liu, et almbox. 2019. Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 7216--7223.
[43]
Jennifer Williams, Ramona Comanescu, Oana Radu, et almbox. 2018. Dnn Multimodal Fusion Techniques for Predicting Video Sentiment. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics . 64--72.
[44]
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2020 a. Scalable Zero-shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . 6397--6407.
[45]
Zhiwei Wu, Changmeng Zheng, Yi Cai, et almbox. 2020 b. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In Proceedings of the 28th ACM International Conference on Multimedia. 1038--1046.
[46]
Ming-Wei Yih, Wen-tau an Chang, Xiaodong He, et almbox. 2015. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. 1321--1331.
[47]
Li Zhang, Yulong Zhang, Zhixu Li, Qiang Yang, et almbox. 2021. Attention-based Multimodal Entity Linking with High-Quality Images. In Proceedings of the 26th International Conference on Database Systems for Advanced Applications . 533--548.
[48]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, et almbox. 2018. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 5674--5681.

Cited By

View all
  • (2025)Research on application of knowledge graph in industrial control system security situation awareness and decision-making: A surveyNeurocomputing10.1016/j.neucom.2024.128721613(128721)Online publication date: Jan-2025
  • (2024)Advancing Arctic Sea Ice Remote Sensing with AI and Deep Learning: Opportunities and ChallengesRemote Sensing10.3390/rs1620376416:20(3764)Online publication date: 10-Oct-2024
  • (2024)A dual-way enhanced framework from text matching point of view for multimodal entity linkingProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29867(19008-19016)Online publication date: 20-Feb-2024
  • Show More Cited By

Index Terms

  1. Multimodal Entity Linking with Gated Hierarchical Fusion and Contrastive Training

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contrastive training
    2. entity linking
    3. knowledge graph
    4. multimodal fusion

    Qualifiers

    • Research-article

    Funding Sources

    • The 13th Five-Year All-Army Common Information System Equipment Pre-Research Project

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)267
    • Downloads (Last 6 weeks)30
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Research on application of knowledge graph in industrial control system security situation awareness and decision-making: A surveyNeurocomputing10.1016/j.neucom.2024.128721613(128721)Online publication date: Jan-2025
    • (2024)Advancing Arctic Sea Ice Remote Sensing with AI and Deep Learning: Opportunities and ChallengesRemote Sensing10.3390/rs1620376416:20(3764)Online publication date: 10-Oct-2024
    • (2024)A dual-way enhanced framework from text matching point of view for multimodal entity linkingProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i17.29867(19008-19016)Online publication date: 20-Feb-2024
    • (2024)Multimodal Recommender Systems: A SurveyACM Computing Surveys10.1145/369546157:2(1-17)Online publication date: 10-Oct-2024
    • (2024)Bridging Gaps in Content and Knowledge for Multimodal Entity LinkingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681661(9311-9320)Online publication date: 28-Oct-2024
    • (2024)UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language ModelsProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679793(1909-1919)Online publication date: 21-Oct-2024
    • (2024)CYCLE: Cross-Year Contrastive Learning in Entity-LinkingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679702(3197-3206)Online publication date: 21-Oct-2024
    • (2024)Video Multimodal Entity Linking via Multi-Perspective Enhanced Subgraph Contrastive NetworkInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450036034:11(1757-1781)Online publication date: 30-Aug-2024
    • (2024)TRAFMEL: Multimodal Entity Linking Based on Transformer Reranking and Multimodal Co-Attention FusionInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450013X34:06(973-997)Online publication date: 16-May-2024
    • (2024)Automatic generation of system model diagrams driven by multi-source heterogeneous dataJournal of Engineering Design10.1080/09544828.2024.236085335:11(1442-1486)Online publication date: 6-Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media