Skip to main content

A Survey on Multimodal Named Entity Recognition

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14089))

Included in the following conference series:

  • 1690 Accesses

Abstract

Multimodal Named Entity Recognition (MNER) is a task of identifying entities with specific semantic types from natural language text and using image information to improve the accuracy and robustness of entity recognition. Named entity recognition is an important application of multimodal learning and a fundamental problem in the field of natural language processing. This article reviews existing multimodal named entity recognition techniques for social media. We first introduce commonly used datasets for MNER. Then, we classify existing MNER techniques into five categories: pre-trained models, single-modal representation, multimodal representation, multimodal fusion and main models. Next, we investigate the most representative methods applied in MNER. Finally, we present the challenges faced by MNER and discuss the future directions of this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL HLT 2018 - 2018 Conference on North American Chapter Association Computing Linguistic Human Language Technology – Proceedings of Conference, vol. 1, pp. 852–860 (2018)

    Google Scholar 

  2. Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2022)

    Article  Google Scholar 

  3. Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM 2022 – Proceedings of the 15th ACM International Conference on Web Search Data Mining, pp. 1215–1223 (2022)

    Google Scholar 

  4. Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: 32nd AAAI Conference on Artificial Intelligence. AAAI 2018, pp. 5674–5681 (2018)

    Google Scholar 

  5. Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Pap. 1, 1990–1999 (2018)

    Google Scholar 

  6. Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021. 16, 14347–14355 (2021)

    Google Scholar 

  7. Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021, vol. 15, pp. 13860–13868 (2021)

    Google Scholar 

  8. Hu, Y., Zheng, L., Yang, Y., Huang, Y.: Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE Trans. Multimed. 20, 927–938 (2018)

    Article  Google Scholar 

  9. Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28, 1188–1197 (2012)

    Article  Google Scholar 

  10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12 (2013)

    Google Scholar 

  11. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014)

    Google Scholar 

  12. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, pp. 427–431 (2017)

    Google Scholar 

  13. Peters, M.E., et al.: Deep contextualized word representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp. 2227–2237 (2018)

    Google Scholar 

  14. McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems, pp. 6295–6306 (2017)

    Google Scholar 

  15. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)

    Google Scholar 

  16. Radford, A., Narasimhan, K.: Improving Language Understanding by Generative Pre-Training. Presented at the (2018)

    Google Scholar 

  17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Human Language Technologies – Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)

    Google Scholar 

  18. Liu, Y., et al.: RoBERTa: a robustly optimized Bert pretraining approach. ArXiv.abs/1907.1 (2019)

    Google Scholar 

  19. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv.abs/1909.1 (2019)

    Google Scholar 

  20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278–2323 (1998)

    Article  Google Scholar 

  21. Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6706–6716 (2020)

    Google Scholar 

  22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1 (2014)

    Google Scholar 

  23. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)

    Google Scholar 

  24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  25. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995 (2017)

    Google Scholar 

  26. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)

    Google Scholar 

  27. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Advances in Neural Information Processing Systems. pp. 4468–4476 (2017)

    Google Scholar 

  28. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv. abs/2010.1 (2020)

    Google Scholar 

  29. Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv Prepr. arXiv1908.03557. (2019)

    Google Scholar 

  30. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017)

    Article  Google Scholar 

  31. Chen, F.-L., et al.: Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 20, 38–56 (2023)

    Article  Google Scholar 

  32. Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pp. 11336–11344 (2020)

    Google Scholar 

  33. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)

    Google Scholar 

  34. Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5100–5111. Association for Computational Linguistics (2019)

    Google Scholar 

  35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  36. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems, pp. 9694–9705 (2021)

    Google Scholar 

  37. Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K.: Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439, 12–21 (2021)

    Article  Google Scholar 

  38. Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 1852–1862 (2020)

    Google Scholar 

  39. Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H.F., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM 2020 – Proceedings of the 28th ACM International Conference on Multimedia, pp. 1038–1046 (2020)

    Google Scholar 

  40. Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P.: Why pay more? A simple and efficient named entity recognition system for tweets. Expert Syst. Appl. 167, 114101 (2021)

    Article  Google Scholar 

  41. Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 23, 2520–2532 (2021)

    Article  Google Scholar 

  42. Shahzad, M., Amin, A., Esteves, D., Ngomo, A.C.N.: InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs. Proceedings of the International Florida Artificial Intelligence Research Society Conference, FLAIRS, vol. 34 (2021)

    Google Scholar 

  43. Wu, H., Cheng, S., Wang, J., Li, S., Chi, L.: Multimodal aspect extraction with region-aware alignment network. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12430, pp. 145–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60450-9_12

  44. Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. Proc. AAAI Conf. Artif. Intell. 31, 4378–4384 (2017)

    Google Scholar 

  45. Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3342–3352 (2020)

    Google Scholar 

  46. Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12

    Chapter  Google Scholar 

  47. Lu, J., Zhang, D., Zhang, J., Zhang, P.: Flat Multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 2055–2064. International Committee on Computational Linguistics (2022)

    Google Scholar 

  48. Zhao, F., Li, C., Wu, Z., Xing, S., Dai, X.: Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Association for Computing Machinery (2022)

    Google Scholar 

  49. Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics. NAACL 2022 - Find., pp. 1607–1618 (2022)

    Google Scholar 

  50. Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)

    Google Scholar 

  51. Jia, M., et al.: MNER-QG: an end-to-end MRC framework for Multimodal named entity recognition with query grounding. arXiv Prepr. arXiv2211.14739 (2022)

    Google Scholar 

  52. Cadene, R., Ben-younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering Sorbonne Universit. In: Conservatoire National des Arts et M. Cvpr 2019, pp. 1989--1998 (2019)

    Google Scholar 

  53. Zhang, X., Yuan, J., Li, L., Liu, J.: Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Association for Computing Machinery (2023)

    Google Scholar 

  54. Arshad, O., Gallo, I., Nawaz, S., Calefati, A.: Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceedings International Conference Document Analysis, Recognition, ICDAR, pp. 337–342 (2019)

    Google Scholar 

Download references

Acknowledgement

This research was funded by the National Natural Science Foundation of China, grant number 62272163; the Songshan Laboratory Pre-research Project, grant number YYJC012022023; and the Henan Province Science Foundation, grant number 232300420150, 222300420230; the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness, grant number HNTS2022005; the Henan Province Science and Technology Department Foundation, grant number 222102210027; the Science and Technology Plan Projects of State Administration for Market Regulation, grant number 2021MK067; and the Undergraduate Universities Smart Teaching Special Research Project of Henan Province, grant number 489–29.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiangtao Ma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qian, S., Jin, W., Chen, Y., Ma, J., Qiao, Y., Lu, J. (2023). A Survey on Multimodal Named Entity Recognition. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science(), vol 14089. Springer, Singapore. https://doi.org/10.1007/978-981-99-4752-2_50

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-4752-2_50

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-4751-5

  • Online ISBN: 978-981-99-4752-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics