Skip to main content
Log in

MAFN: multi-level attention fusion network for multimodal named entity recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multimodal named entity recognition (MNER) aims to use the modality information of images and text to identify named entities from free text and classify them into predefined types, such as Person, Location, Organization, etc. However, most existing MNER methods adopt simple splicing and attention mechanisms and fail to fully utilize the modal information to capture the intra-modal and inter-modal interactions. This simple fusion operation may bring bias to the prediction results of named entities. In this paper, we propose a novel Multi-level Attention Fusion Network (MAFN) to deal with this problem. Specifically, This paper introduce a multi-level attention mechanism to learn intra-modal and inter-modal interactions to obtain multimodal representations for each word. Furthermore, we introduce a visual filter gate to remove words that cannot be aligned with any visual block to control the contribution of visual features dynamically. Experimental results on two publicly available Twitter datasets demonstrate that our method outperforms other state-of-the-art baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

We use two public datasets, Twitter 2015 and Twitter 2017. The data are downloaded from: https://github.com/jefferyYu/UMT/tree/master/data.

References

  1. Chaudhari S, Mithal V, Polatkan G, Ramanath R (2021) An attentive survey of attention models. Acm Trans Intell Syst Technol (tist) 12(5):1–32

  2. Atefeh F, Khreich W (2015) A survey of techniques for event detection in twitter. Comput Intell 31(1):132–164

    Article  MathSciNet  Google Scholar 

  3. Athavale V, Bharadwaj S, Pamecha M et al. (2016) Towards deep learning in hindi ner: an approach to tackle the labelled data scarcity. arXiv:1610.09756

  4. Cao P, Chen Y, Liu K et al (2018) Adversarial transfer learning for chinese named entity recognition with self-attention mechanism. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 182–192

  5. Chinchor N, Robinson P (1997) Muc-7 named entity task definition. In: Proceedings of the 7th conference on message understanding, pp 1–21

  6. Chiu JP, Nichols E (2016) Named entity recognition with bidirectional lstm-cnns. Trans Assoc Comput Linguistics 4:357–370

    Article  Google Scholar 

  7. Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(ARTICLE):2493–2537

  8. Cortes C, Lawarence N, Lee D et al (2015) Advances in neural information processing systems 28. In: Proceedings of the 29th annual conference on neural information processing systems

  9. Davis A, Veloso A, Soares A et al (2012) Named entity disambiguation in streaming data. In: Proceedings of the 50th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 815–824

  10. Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  11. Fukui A, Park DH, Yang D et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847

  12. Hammerton J (2003) Named entity recognition with long short-term memory. Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003:172–175

  13. He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  14. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991

  15. Ju X, Zhang D, Li J et al (2020) Transformer-based label set generation for multi-modal multi-label emotion detection. In: Proceedings of the 28th ACM international conference on multimedia, pp 512–520

  16. Lample G, Ballesteros M, Subramanian S et al (2016) Neural architectures for named entity recognition. arXiv:1603.01360

  17. Liu M, Tu Z, Zhang T et al (2022) Ltp: a new active learning strategy for crf-based named entity recognition. Neural Process Lett 54(3):2433–2454

  18. Lu D, Neves L, Carvalho V et al (2018) Visual attention model for name tagging in multimodal social media. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 1990–1999

  19. Lu J, Batra D, Parikh D et al (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv Neural Inform Process Syst 32

  20. Moon S, Neves L, Carvalho V (2018) Multimodal named entity recognition for short social media posts. arXiv:1802.07862

  21. Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process Mag 34(6):96–108

  22. Santos CNd, Guimaraes V (2015) Boosting named entity recognition with neural character embeddings. arXiv:1505.05008

  23. Su W, Zhu X, Cao Y et al (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv:1908.08530

  24. Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490

  25. Arshad O, Gallo I, Nawaz S, Calefati A (2019) Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceeding of the international conference on document analysis and recognition, pp 337–342

  26. Ding N, Hu S, Zhao W, Chen Y, Liu Z, Zheng H-T, Sun M (2021) Openprompt: an open-source framework for prompt-learning. arXiv:2111.01998

  27. Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples. arXiv:1412.6572

  28. Liu X, Liu K, Li X, Su J, Ge Y, Wang B, Luo J (2020) An iterative multi-source mutual knowledge transfer framework for machine reading comprehension. In: IJCAI, pp 3794–3800

  29. Nazari M, Oroojlooy A, Snyder L, Takác M (2018) Reinforcement learning for solving the vehicle routing problem. Adv Neural Inform Process Syst 31

  30. Ritter A, Clark S, Etzioni O (2011) Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 1524–1534

  31. Sharaff A, Pathak V, Paul SS (2023) Deep learning-based smishing message identification using regular expression feature generation. Expert Syst 40(4):e13153

    Article  Google Scholar 

  32. Wang X, Ye J, Li Z, Tian J, Jiang Y, Yan M, Zhang J, Xiao Y (2022) CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE international conference on multimedia and expo (ICME), IEEE, pp 1–6

  33. Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G (2021) Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: Proceedings of the AAAI conference on artificial intelligence, pp 14347–14355

  34. Yu J, Jiang J, Yang L, et al. (2020) Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. Association for computational linguistics

  35. Zadeh A, Chen M, Poria S et al (2017) Tensor fusion network for multimodal sentiment analysis. arXiv:1707.07250

  36. Zhang Q, Fu J, Liu X et al (2018) Adaptive co-attention network for named entity recognition in tweets. In: Proceedings of the AAAI conference on artificial intelligence

  37. Zheng C, Wu Z, Wang T et al (2020) Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans Multimedia 23:2520–2532

    Article  Google Scholar 

Download references

Funding

This work is supported by a grant from Social and Science Foundation of Liaoning Province (No. L20BTQ008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yijia Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, X., Zhang, Y., Wang, Z. et al. MAFN: multi-level attention fusion network for multimodal named entity recognition. Multimed Tools Appl 83, 45047–45058 (2024). https://doi.org/10.1007/s11042-023-17376-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17376-5

Keywords

Navigation