research-article

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition

Authors:

Hongya WangAuthors Info & Claims

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

Pages 1215 - 1223

https://doi.org/10.1145/3488560.3498475

Published: 15 February 2022 Publication History

Abstract

In this paper, we study multimodal named entity recognition in social media posts. Existing works mainly focus on using a cross-modal attention mechanism to combine text representation with image representation. However, they still suffer from two weaknesses: (1) the current methods are based on a strong assumption that each text and its accompanying image are matched, and the image can be used to help identify named entities in the text. However, this assumption is not always true in real scenarios, and the strong assumption may reduce the recognition effect of theMNER model; (2) the current methods fail to construct a consistent representation to bridge the semantic gap between two modalities, which prevents the model from establishing a good connection between the text and image. To address these issues, we propose a general matching and alignment framework (MAF) for multimodal named entity recognition in social media posts. Specifically, to solve the first issue, we propose a novel cross-modal matching (CM) module to calculate the similarity score between text and image, and use the score to determine the proportion of visual information that should be retained. To solve the second issue, we propose a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent. We conduct extensive experiments, ablation studies, and case studies to demonstrate the effectiveness and efficiency of our method.The source code of this paper can be found in https://github.com/xubodhu/MAF.

Supplementary Material

MP4 File (WSDM-fp530.mp4)

Presentation video

Download
37.43 MB

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. stat, Vol. 1050 (2016), 21.

[2]

Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications. 186--201.

[3]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv preprint arXiv:2104.08821 (2021).

[6]

Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. 2020. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In International Conference on Learning Representations .

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[8]

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).

[9]

Fernando C. N. Pereira John D. Lafferty, Andrew McCallum. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the 18th Intl. Conf. on Machine Learning (ICML-2001) . 282--289.

[10]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT . 260--270.

[11]

Tian Li, Xiang Chen, Shanghang Zhang, Zhen Dong, and Kurt Keutzer. 2021. Cross-domain sentiment classification with contrastive learning and mutual information maximization. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 8203--8207.

[12]

Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1990--1999.

[13]

Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064--1074.

[14]

Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) . 852--860.

[15]

Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing. 1524--1534.

Digital Library

[16]

Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. arXiv preprint cs/9907006 (1999).

[17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[18]

Alakananda Vempala and Daniel Preoct iuc-Pietro. 2019. Categorizing and inferring the relationship between the text and image of twitter posts. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 2830--2840.

[19]

Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3342--3352.

[20]

Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021 b. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14347--14355.

[21]

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. 2021 a. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 833--842.

[22]

Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[23]

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020).

Cited By

Liu HXin XSong JPeng W(2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128792
Chen JSu LLi YLin MPeng YSun C(2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
https://doi.org/10.1016/j.jbi.2024.104754
Huang SCai YYuan LWang J(2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.104033
Show More Cited By

Index Terms

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Recently, Grounded Multimodal Named Entity Recognition (GMNER) task has been introduced to refine the Multimodal Named Entity Recognition (MNER) task.Existing MNER studies fall short in that they merely focus on extracting text-based entity-type pairs, ...
UAMNer: uncertainty-aware multimodal named entity recognition in social media posts
Abstract
Named Entity Recognition (NER) on social media is a challenging task, as social media posts are usually short and noisy. Recently, some work explores different ways to incorporate the visual information from the image to improve NER on social ...
A multi-task framework based on decomposition for multimodal named entity recognition
Abstract
Given a text-image pair, Multimodal Named Entity Recognition (MNER) is the task of identifying and categorizing entities in the text. Most existing work performs named entity labeling directly using final token representations derived by fusing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

February 2022

1690 pages

ISBN:9781450391320

DOI:10.1145/3488560

General Chairs:
K. Selcuk Candan
Arizona State University, USA
,
Huan Liu
Arizona State University, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Xin Luna Dong
Meta Platforms, Inc. (former Facebook), USA
,
Jiliang Tang
Michigan State University, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 February 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities
Informatization Development Special Project of Shanghai Municipal Commission of Economy and Information Technology
National Natural Science Foundation of China
Shanghai Sailing Program

Conference

WSDM '22

Sponsor:

WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining

February 21 - 25, 2022

AZ, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
1,049
Total Downloads

Downloads (Last 12 months)213
Downloads (Last 6 weeks)9

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu HXin XSong JPeng W(2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128792
Chen JSu LLi YLin MPeng YSun C(2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
https://doi.org/10.1016/j.jbi.2024.104754
Huang SCai YYuan LWang J(2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.104033
Xu BJiang HWei SDu MSong HWang H(2025)A Multi-expert Collaborative Framework for Multimodal Named Entity RecognitionMultiMedia Modeling10.1007/978-981-96-2054-8_3(30-43)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-96-2054-8_3
Wang MChen HShen DLi BHu S(2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
https://doi.org/10.7717/peerj-cs.1856
Zhong WZhang ZWu QXue YCai Q(2024)A Semantic Enhancement Framework for Multimodal Sarcasm DetectionMathematics10.3390/math1202031712:2(317)Online publication date: 18-Jan-2024
https://doi.org/10.3390/math12020317
Jiang PCai X(2024)A Survey of Text-Matching TechniquesInformation10.3390/info1506033215:6(332)Online publication date: 5-Jun-2024
https://doi.org/10.3390/info15060332
He LWang QLiu JDuan JWang H(2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
https://doi.org/10.3390/app14062333
Xu JYu JCai YChua T(2024)Dual Contrastive Learning for Cross-Domain Named Entity RecognitionACM Transactions on Information Systems10.1145/367887942:6(1-33)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3678879
Li ZYu JYang JWang WYang LXia RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681598(7336-7345)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681598
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten