A Survey on Multimodal Named Entity Recognition

Qian, Shenyi; Jin, Wenduo; Chen, Yonggang; Ma, Jiangtao; Qiao, Yaqiong; Lu, Jinyu

doi:10.1007/978-981-99-4752-2_50

Shenyi Qian¹³,
Wenduo Jin¹³,
Yonggang Chen¹⁴,
Jiangtao Ma¹³,
Yaqiong Qiao^15,16 &
…
Jinyu Lu¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14089))

Included in the following conference series:

International Conference on Intelligent Computing

1690 Accesses

Abstract

Multimodal Named Entity Recognition (MNER) is a task of identifying entities with specific semantic types from natural language text and using image information to improve the accuracy and robustness of entity recognition. Named entity recognition is an important application of multimodal learning and a fundamental problem in the field of natural language processing. This article reviews existing multimodal named entity recognition techniques for social media. We first introduce commonly used datasets for MNER. Then, we classify existing MNER techniques into five categories: pre-trained models, single-modal representation, multimodal representation, multimodal fusion and main models. Next, we investigate the most representative methods applied in MNER. Finally, we present the challenges faced by MNER and discuss the future directions of this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL HLT 2018 - 2018 Conference on North American Chapter Association Computing Linguistic Human Language Technology – Proceedings of Conference, vol. 1, pp. 852–860 (2018)
Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2022)
Article Google Scholar
Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM 2022 – Proceedings of the 15th ACM International Conference on Web Search Data Mining, pp. 1215–1223 (2022)
Google Scholar
Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: 32nd AAAI Conference on Artificial Intelligence. AAAI 2018, pp. 5674–5681 (2018)
Google Scholar
Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Pap. 1, 1990–1999 (2018)
Google Scholar
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021. 16, 14347–14355 (2021)
Google Scholar
Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021, vol. 15, pp. 13860–13868 (2021)
Google Scholar
Hu, Y., Zheng, L., Yang, Y., Huang, Y.: Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE Trans. Multimed. 20, 927–938 (2018)
Article Google Scholar
Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28, 1188–1197 (2012)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, pp. 427–431 (2017)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp. 2227–2237 (2018)
Google Scholar
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems, pp. 6295–6306 (2017)
Google Scholar
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)
Google Scholar
Radford, A., Narasimhan, K.: Improving Language Understanding by Generative Pre-Training. Presented at the (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Human Language Technologies – Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)
Google Scholar
Liu, Y., et al.: RoBERTa: a robustly optimized Bert pretraining approach. ArXiv.abs/1907.1 (2019)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv.abs/1909.1 (2019)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278–2323 (1998)
Article Google Scholar
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6706–6716 (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1 (2014)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995 (2017)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)
Google Scholar
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Advances in Neural Information Processing Systems. pp. 4468–4476 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv. abs/2010.1 (2020)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv Prepr. arXiv1908.03557. (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017)
Article Google Scholar
Chen, F.-L., et al.: Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 20, 38–56 (2023)
Article Google Scholar
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pp. 11336–11344 (2020)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5100–5111. Association for Computational Linguistics (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems, pp. 9694–9705 (2021)
Google Scholar
Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K.: Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439, 12–21 (2021)
Article Google Scholar
Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 1852–1862 (2020)
Google Scholar
Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H.F., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM 2020 – Proceedings of the 28th ACM International Conference on Multimedia, pp. 1038–1046 (2020)
Google Scholar
Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P.: Why pay more? A simple and efficient named entity recognition system for tweets. Expert Syst. Appl. 167, 114101 (2021)
Article Google Scholar
Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 23, 2520–2532 (2021)
Article Google Scholar
Shahzad, M., Amin, A., Esteves, D., Ngomo, A.C.N.: InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs. Proceedings of the International Florida Artificial Intelligence Research Society Conference, FLAIRS, vol. 34 (2021)
Google Scholar
Wu, H., Cheng, S., Wang, J., Li, S., Chi, L.: Multimodal aspect extraction with region-aware alignment network. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12430, pp. 145–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60450-9_12
Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. Proc. AAAI Conf. Artif. Intell. 31, 4378–4384 (2017)
Google Scholar
Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3342–3352 (2020)
Google Scholar
Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12
Chapter Google Scholar
Lu, J., Zhang, D., Zhang, J., Zhang, P.: Flat Multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 2055–2064. International Committee on Computational Linguistics (2022)
Google Scholar
Zhao, F., Li, C., Wu, Z., Xing, S., Dai, X.: Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Association for Computing Machinery (2022)
Google Scholar
Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics. NAACL 2022 - Find., pp. 1607–1618 (2022)
Google Scholar
Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)
Google Scholar
Jia, M., et al.: MNER-QG: an end-to-end MRC framework for Multimodal named entity recognition with query grounding. arXiv Prepr. arXiv2211.14739 (2022)
Google Scholar
Cadene, R., Ben-younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering Sorbonne Universit. In: Conservatoire National des Arts et M. Cvpr 2019, pp. 1989--1998 (2019)
Google Scholar
Zhang, X., Yuan, J., Li, L., Liu, J.: Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Association for Computing Machinery (2023)
Google Scholar
Arshad, O., Gallo, I., Nawaz, S., Calefati, A.: Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceedings International Conference Document Analysis, Recognition, ICDAR, pp. 337–342 (2019)
Google Scholar

Download references

Acknowledgement

This research was funded by the National Natural Science Foundation of China, grant number 62272163; the Songshan Laboratory Pre-research Project, grant number YYJC012022023; and the Henan Province Science Foundation, grant number 232300420150, 222300420230; the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness, grant number HNTS2022005; the Henan Province Science and Technology Department Foundation, grant number 222102210027; the Science and Technology Plan Projects of State Administration for Market Regulation, grant number 2021MK067; and the Undergraduate Universities Smart Teaching Special Research Project of Henan Province, grant number 489–29.

Author information

Authors and Affiliations

College of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, 450002, China
Shenyi Qian, Wenduo Jin & Jiangtao Ma
The State Information Center, Beijing, 100045, China
Yonggang Chen
School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou, 450046, China
Yaqiong Qiao
Henan Key Laboratory of Cyberspace Situation Awareness, Zhengzhou, 450001, China
Yaqiong Qiao
Henan Province Platform Economy Development Guidance Center Zhengzhou, Zhengzhou, China
Jinyu Lu

Authors

Shenyi Qian
View author publications
You can also search for this author in PubMed Google Scholar
Wenduo Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yonggang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiangtao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yaqiong Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiangtao Ma .

Editor information

Editors and Affiliations

Department of Computer Science, Eastern Institute of Technology, Zhejiang, China
De-Shuang Huang
University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Zhengzhou University of Light Industry, Zhengzhou, China
Baohua Jin
Zhong Yuan University of Technology, Zhengzhou, China
Boyang Qu
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Department of Computer Science, Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qian, S., Jin, W., Chen, Y., Ma, J., Qiao, Y., Lu, J. (2023). A Survey on Multimodal Named Entity Recognition. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science(), vol 14089. Springer, Singapore. https://doi.org/10.1007/978-981-99-4752-2_50

Download citation

DOI: https://doi.org/10.1007/978-981-99-4752-2_50
Published: 31 July 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4751-5
Online ISBN: 978-981-99-4752-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics