Abstract
Multimodal named entity recognition (MNER) aims to use the modality information of images and text to identify named entities from free text and classify them into predefined types, such as Person, Location, Organization, etc. However, most existing MNER methods adopt simple splicing and attention mechanisms and fail to fully utilize the modal information to capture the intra-modal and inter-modal interactions. This simple fusion operation may bring bias to the prediction results of named entities. In this paper, we propose a novel Multi-level Attention Fusion Network (MAFN) to deal with this problem. Specifically, This paper introduce a multi-level attention mechanism to learn intra-modal and inter-modal interactions to obtain multimodal representations for each word. Furthermore, we introduce a visual filter gate to remove words that cannot be aligned with any visual block to control the contribution of visual features dynamically. Experimental results on two publicly available Twitter datasets demonstrate that our method outperforms other state-of-the-art baseline methods.
Similar content being viewed by others
Data Availability
We use two public datasets, Twitter 2015 and Twitter 2017. The data are downloaded from: https://github.com/jefferyYu/UMT/tree/master/data.
References
Chaudhari S, Mithal V, Polatkan G, Ramanath R (2021) An attentive survey of attention models. Acm Trans Intell Syst Technol (tist) 12(5):1–32
Atefeh F, Khreich W (2015) A survey of techniques for event detection in twitter. Comput Intell 31(1):132–164
Athavale V, Bharadwaj S, Pamecha M et al. (2016) Towards deep learning in hindi ner: an approach to tackle the labelled data scarcity. arXiv:1610.09756
Cao P, Chen Y, Liu K et al (2018) Adversarial transfer learning for chinese named entity recognition with self-attention mechanism. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 182–192
Chinchor N, Robinson P (1997) Muc-7 named entity task definition. In: Proceedings of the 7th conference on message understanding, pp 1–21
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional lstm-cnns. Trans Assoc Comput Linguistics 4:357–370
Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(ARTICLE):2493–2537
Cortes C, Lawarence N, Lee D et al (2015) Advances in neural information processing systems 28. In: Proceedings of the 29th annual conference on neural information processing systems
Davis A, Veloso A, Soares A et al (2012) Named entity disambiguation in streaming data. In: Proceedings of the 50th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 815–824
Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
Hammerton J (2003) Named entity recognition with long short-term memory. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003:172–175
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991
Ju X, Zhang D, Li J et al (2020) Transformer-based label set generation for multi-modal multi-label emotion detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 512–520
Lample G, Ballesteros M, Subramanian S et al (2016) Neural architectures for named entity recognition. arXiv:1603.01360
Liu M, Tu Z, Zhang T et al (2022) Ltp: a new active learning strategy for crf-based named entity recognition. Neural Process Lett 54(3):2433–2454
Lu D, Neves L, Carvalho V et al (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1990–1999
Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32
Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. arXiv:1802.07862
Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108
Santos CNd, Guimaraes V (2015) Boosting named entity recognition with neural character embeddings. arXiv:1505.05008
Su W, Zhu X, Cao Y et al (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530
Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490
Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceeding of the international conference on document analysis and recognition, pp 337–342
Ding N, Hu S, Zhao W, Chen Y, Liu Z, Zheng H-T, Sun M (2021) Openprompt: an open-source framework for prompt-learning. arXiv:2111.01998
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. arXiv:1412.6572
Liu X, Liu K, Li X, Su J, Ge Y, Wang B, Luo J (2020) An iterative multi-source mutual knowledge transfer framework for machine reading comprehension. In: IJCAI, pp 3794–3800
Nazari M, Oroojlooy A, Snyder L, Takác M (2018) Reinforcement learning for solving the vehicle routing problem. Adv Neural Inform Process Syst 31
Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 1524–1534
Sharaff A, Pathak V, Paul SS (2023) Deep learning-based smishing message identification using regular expression feature generation. Expert Syst 40(4):e13153
Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE international conference on multimedia and expo (ICME), IEEE, pp 1–6
Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: Proceedings of the AAAI conference on artificial intelligence, pp 14347–14355
Yu J, Jiang J, Yang L, et al. (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. Association for computational linguistics
Zadeh A, Chen M, Poria S et al (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250
Zhang Q, Fu J, Liu X et al (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence
Zheng C, Wu Z, Wang T et al (2020) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimedia 23:2520–2532
Funding
This work is supported by a grant from Social and Science Foundation of Liaoning Province (No. L20BTQ008).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, X., Zhang, Y., Wang, Z. et al. MAFN: multi-level attention fusion network for multimodal named entity recognition. Multimed Tools Appl 83, 45047–45058 (2024). https://doi.org/10.1007/s11042-023-17376-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17376-5