skip to main content
10.1145/3488560.3498475acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition

Published: 15 February 2022 Publication History

Abstract

In this paper, we study multimodal named entity recognition in social media posts. Existing works mainly focus on using a cross-modal attention mechanism to combine text representation with image representation. However, they still suffer from two weaknesses: (1) the current methods are based on a strong assumption that each text and its accompanying image are matched, and the image can be used to help identify named entities in the text. However, this assumption is not always true in real scenarios, and the strong assumption may reduce the recognition effect of theMNER model; (2) the current methods fail to construct a consistent representation to bridge the semantic gap between two modalities, which prevents the model from establishing a good connection between the text and image. To address these issues, we propose a general matching and alignment framework (MAF) for multimodal named entity recognition in social media posts. Specifically, to solve the first issue, we propose a novel cross-modal matching (CM) module to calculate the similarity score between text and image, and use the score to determine the proportion of visual information that should be retained. To solve the second issue, we propose a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent. We conduct extensive experiments, ablation studies, and case studies to demonstrate the effectiveness and efficiency of our method.The source code of this paper can be found in https://github.com/xubodhu/MAF.

Supplementary Material

MP4 File (WSDM-fp530.mp4)
Presentation video

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. stat, Vol. 1050 (2016), 21.
[2]
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications. 186--201.
[3]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821 (2021).
[6]
Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. 2020. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In International Conference on Learning Representations .
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[8]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
[9]
Fernando C. N. Pereira John D. Lafferty, Andrew McCallum. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the 18th Intl. Conf. on Machine Learning (ICML-2001) . 282--289.
[10]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT . 260--270.
[11]
Tian Li, Xiang Chen, Shanghang Zhang, Zhen Dong, and Kurt Keutzer. 2021. Cross-domain sentiment classification with contrastive learning and mutual information maximization. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8203--8207.
[12]
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1990--1999.
[13]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064--1074.
[14]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . 852--860.
[15]
Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing. 1524--1534.
[16]
Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. arXiv preprint cs/9907006 (1999).
[17]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[18]
Alakananda Vempala and Daniel Preoct iuc-Pietro. 2019. Categorizing and inferring the relationship between the text and image of twitter posts. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 2830--2840.
[19]
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3342--3352.
[20]
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021 b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14347--14355.
[21]
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021 a. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 833--842.
[22]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[23]
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).

Cited By

View all
  • (2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
  • (2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
  • (2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
February 2022
1690 pages
ISBN:9781450391320
DOI:10.1145/3488560
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. multimodal named entity recognition

Qualifiers

  • Research-article

Funding Sources

Conference

WSDM '22

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)213
  • Downloads (Last 6 weeks)9
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
  • (2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
  • (2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
  • (2025)A Multi-expert Collaborative Framework for Multimodal Named Entity RecognitionMultiMedia Modeling10.1007/978-981-96-2054-8_3(30-43)Online publication date: 3-Jan-2025
  • (2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
  • (2024)A Semantic Enhancement Framework for Multimodal Sarcasm DetectionMathematics10.3390/math1202031712:2(317)Online publication date: 18-Jan-2024
  • (2024)A Survey of Text-Matching TechniquesInformation10.3390/info1506033215:6(332)Online publication date: 5-Jun-2024
  • (2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
  • (2024)Dual Contrastive Learning for Cross-Domain Named Entity RecognitionACM Transactions on Information Systems10.1145/367887942:6(1-33)Online publication date: 18-Oct-2024
  • (2024)Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681598(7336-7345)Online publication date: 28-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media