MAFN: multi-level attention fusion network for multimodal named entity recognition

Zhou, Xiaoying; Zhang, Yijia; Wang, Zhuang; Lu, Mingyu; Liu, Xiaoxia

doi:10.1007/s11042-023-17376-5

MAFN: multi-level attention fusion network for multimodal named entity recognition

Published: 20 October 2023

Volume 83, pages 45047–45058, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xiaoying Zhou¹,
Yijia Zhang ORCID: orcid.org/0000-0002-5843-4675¹,
Zhuang Wang¹,
Mingyu Lu¹ &
…
Xiaoxia Liu²

166 Accesses
Explore all metrics

Abstract

Multimodal named entity recognition (MNER) aims to use the modality information of images and text to identify named entities from free text and classify them into predefined types, such as Person, Location, Organization, etc. However, most existing MNER methods adopt simple splicing and attention mechanisms and fail to fully utilize the modal information to capture the intra-modal and inter-modal interactions. This simple fusion operation may bring bias to the prediction results of named entities. In this paper, we propose a novel Multi-level Attention Fusion Network (MAFN) to deal with this problem. Specifically, This paper introduce a multi-level attention mechanism to learn intra-modal and inter-modal interactions to obtain multimodal representations for each word. Furthermore, we introduce a visual filter gate to remove words that cannot be aligned with any visual block to control the contribution of visual features dynamically. Experimental results on two publicly available Twitter datasets demonstrate that our method outperforms other state-of-the-art baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

Explicit Sparse Attention Network for Multimodal Named Entity Recognition

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Article 08 February 2024

Data Availability

We use two public datasets, Twitter 2015 and Twitter 2017. The data are downloaded from: https://github.com/jefferyYu/UMT/tree/master/data.

References

Chaudhari S, Mithal V, Polatkan G, Ramanath R (2021) An attentive survey of attention models. Acm Trans Intell Syst Technol (tist) 12(5):1–32
Atefeh F, Khreich W (2015) A survey of techniques for event detection in twitter. Comput Intell 31(1):132–164
Article MathSciNet Google Scholar
Athavale V, Bharadwaj S, Pamecha M et al. (2016) Towards deep learning in hindi ner: an approach to tackle the labelled data scarcity. arXiv:1610.09756
Cao P, Chen Y, Liu K et al (2018) Adversarial transfer learning for chinese named entity recognition with self-attention mechanism. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 182–192
Chinchor N, Robinson P (1997) Muc-7 named entity task definition. In: Proceedings of the 7th conference on message understanding, pp 1–21
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional lstm-cnns. Trans Assoc Comput Linguistics 4:357–370
Article Google Scholar
Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(ARTICLE):2493–2537
Cortes C, Lawarence N, Lee D et al (2015) Advances in neural information processing systems 28. In: Proceedings of the 29th annual conference on neural information processing systems
Davis A, Veloso A, Soares A et al (2012) Named entity disambiguation in streaming data. In: Proceedings of the 50th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 815–824
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Hammerton J (2003) Named entity recognition with long short-term memory. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003:172–175
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991
Ju X, Zhang D, Li J et al (2020) Transformer-based label set generation for multi-modal multi-label emotion detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 512–520
Lample G, Ballesteros M, Subramanian S et al (2016) Neural architectures for named entity recognition. arXiv:1603.01360
Liu M, Tu Z, Zhang T et al (2022) Ltp: a new active learning strategy for crf-based named entity recognition. Neural Process Lett 54(3):2433–2454
Lu D, Neves L, Carvalho V et al (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1990–1999
Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. arXiv:1802.07862
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108
Santos CNd, Guimaraes V (2015) Boosting named entity recognition with neural character embeddings. arXiv:1505.05008
Su W, Zhu X, Cao Y et al (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530
Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490
Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceeding of the international conference on document analysis and recognition, pp 337–342
Ding N, Hu S, Zhao W, Chen Y, Liu Z, Zheng H-T, Sun M (2021) Openprompt: an open-source framework for prompt-learning. arXiv:2111.01998
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. arXiv:1412.6572
Liu X, Liu K, Li X, Su J, Ge Y, Wang B, Luo J (2020) An iterative multi-source mutual knowledge transfer framework for machine reading comprehension. In: IJCAI, pp 3794–3800
Nazari M, Oroojlooy A, Snyder L, Takác M (2018) Reinforcement learning for solving the vehicle routing problem. Adv Neural Inform Process Syst 31
Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 1524–1534
Sharaff A, Pathak V, Paul SS (2023) Deep learning-based smishing message identification using regular expression feature generation. Expert Syst 40(4):e13153
Article Google Scholar
Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE international conference on multimedia and expo (ICME), IEEE, pp 1–6
Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: Proceedings of the AAAI conference on artificial intelligence, pp 14347–14355
Yu J, Jiang J, Yang L, et al. (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. Association for computational linguistics
Zadeh A, Chen M, Poria S et al (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250
Zhang Q, Fu J, Liu X et al (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence
Zheng C, Wu Z, Wang T et al (2020) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimedia 23:2520–2532
Article Google Scholar

Download references

Funding

This work is supported by a grant from Social and Science Foundation of Liaoning Province (No. L20BTQ008).

Author information

Authors and Affiliations

School of Information Science and Technology, Dalian Maritime University, Dalian, 116026, Liaoning, China
Xiaoying Zhou, Yijia Zhang, Zhuang Wang & Mingyu Lu
Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94305, USA
Xiaoxia Liu

Authors

Xiaoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yijia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxia Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yijia Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, X., Zhang, Y., Wang, Z. et al. MAFN: multi-level attention fusion network for multimodal named entity recognition. Multimed Tools Appl 83, 45047–45058 (2024). https://doi.org/10.1007/s11042-023-17376-5

Download citation

Received: 08 December 2022
Revised: 28 August 2023
Accepted: 01 October 2023
Published: 20 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17376-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAFN: multi-level attention fusion network for multimodal named entity recognition

Abstract

Access this article

Similar content being viewed by others

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

Explicit Sparse Attention Network for Multimodal Named Entity Recognition

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MAFN: multi-level attention fusion network for multimodal named entity recognition

Abstract

Access this article

Similar content being viewed by others

P-MNER: Cross Modal Correction Fusion Network with Prompt Learning for Multimodal Named Entity Recognition

Explicit Sparse Attention Network for Multimodal Named Entity Recognition

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation