Abstract
Multimodal Named Entity Recognition (MNER) is a task of identifying entities with specific semantic types from natural language text and using image information to improve the accuracy and robustness of entity recognition. Named entity recognition is an important application of multimodal learning and a fundamental problem in the field of natural language processing. This article reviews existing multimodal named entity recognition techniques for social media. We first introduce commonly used datasets for MNER. Then, we classify existing MNER techniques into five categories: pre-trained models, single-modal representation, multimodal representation, multimodal fusion and main models. Next, we investigate the most representative methods applied in MNER. Finally, we present the challenges faced by MNER and discuss the future directions of this field.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL HLT 2018 - 2018 Conference on North American Chapter Association Computing Linguistic Human Language Technology – Proceedings of Conference, vol. 1, pp. 852–860 (2018)
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2022)
Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM 2022 – Proceedings of the 15th ACM International Conference on Web Search Data Mining, pp. 1215–1223 (2022)
Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: 32nd AAAI Conference on Artificial Intelligence. AAAI 2018, pp. 5674–5681 (2018)
Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference (Long Pap. 1, 1990–1999 (2018)
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021. 16, 14347–14355 (2021)
Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In: 35th AAAI Conference on Artificial Intelligence. AAAI 2021, vol. 15, pp. 13860–13868 (2021)
Hu, Y., Zheng, L., Yang, Y., Huang, Y.: Twitter100k: a real-world dataset for weakly supervised cross-media retrieval. IEEE Trans. Multimed. 20, 927–938 (2018)
Gálvez-López, D., Tardós, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Robot. 28, 1188–1197 (2012)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543. Association for Computational Linguistics (2014)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, pp. 427–431 (2017)
Peters, M.E., et al.: Deep contextualized word representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp. 2227–2237 (2018)
McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: Contextualized word vectors. In: Advances in Neural Information Processing Systems, pp. 6295–6306 (2017)
Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)
Radford, A., Narasimhan, K.: Improving Language Understanding by Generative Pre-Training. Presented at the (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Human Language Technologies – Proceedings of the Conference, vol. 1, pp. 4171–4186 (2019)
Liu, Y., et al.: RoBERTa: a robustly optimized Bert pretraining approach. ArXiv.abs/1907.1 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv.abs/1909.1 (2019)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE. 86, 2278–2323 (1998)
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6706–6716 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995 (2017)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: Advances in Neural Information Processing Systems. pp. 4468–4476 (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv. abs/2010.1 (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.-J., Chang, K.-W.: Visualbert: a simple and performant baseline for vision and language. arXiv Prepr. arXiv1908.03557. (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017)
Chen, F.-L., et al.: Vlp: A survey on vision-language pre-training. Mach. Intell. Res. 20, 38–56 (2023)
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, pp. 11336–11344 (2020)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 1–11 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5100–5111. Association for Computational Linguistics (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems, pp. 9694–9705 (2021)
Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K.: Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439, 12–21 (2021)
Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference, pp. 1852–1862 (2020)
Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H.F., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM 2020 – Proceedings of the 28th ACM International Conference on Multimedia, pp. 1038–1046 (2020)
Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P.: Why pay more? A simple and efficient named entity recognition system for tweets. Expert Syst. Appl. 167, 114101 (2021)
Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 23, 2520–2532 (2021)
Shahzad, M., Amin, A., Esteves, D., Ngomo, A.C.N.: InferNER: an attentive model leveraging the sentence-level information for Named Entity Recognition in Microblogs. Proceedings of the International Florida Artificial Intelligence Research Society Conference, FLAIRS, vol. 34 (2021)
Wu, H., Cheng, S., Wang, J., Li, S., Chi, L.: Multimodal aspect extraction with region-aware alignment network. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12430, pp. 145–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60450-9_12
Collell, G., Zhang, T., Moens, M.: Imagined visual representations as multimodal embeddings. Proc. AAAI Conf. Artif. Intell. 31, 4378–4384 (2017)
Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 3342–3352 (2020)
Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12
Lu, J., Zhang, D., Zhang, J., Zhang, P.: Flat Multi-modal interaction transformer for named entity recognition. In: Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, pp. 2055–2064. International Committee on Computational Linguistics (2022)
Zhao, F., Li, C., Wu, Z., Xing, S., Dai, X.: Learning from Different text-image Pairs: A Relation-enhanced Graph Convolutional Network for Multimodal NER. Association for Computing Machinery (2022)
Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. In: Findings of the Association for Computational Linguistics. NAACL 2022 - Find., pp. 1607–1618 (2022)
Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)
Jia, M., et al.: MNER-QG: an end-to-end MRC framework for Multimodal named entity recognition with query grounding. arXiv Prepr. arXiv2211.14739 (2022)
Cadene, R., Ben-younes, H., Cord, M., Thome, N.: MUREL: multimodal relational reasoning for visual question answering Sorbonne Universit. In: Conservatoire National des Arts et M. Cvpr 2019, pp. 1989--1998 (2019)
Zhang, X., Yuan, J., Li, L., Liu, J.: Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. Association for Computing Machinery (2023)
Arshad, O., Gallo, I., Nawaz, S., Calefati, A.: Aiding intra-text representations with visual context for multimodal named entity recognition. In: Proceedings International Conference Document Analysis, Recognition, ICDAR, pp. 337–342 (2019)
Acknowledgement
This research was funded by the National Natural Science Foundation of China, grant number 62272163; the Songshan Laboratory Pre-research Project, grant number YYJC012022023; and the Henan Province Science Foundation, grant number 232300420150, 222300420230; the Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness, grant number HNTS2022005; the Henan Province Science and Technology Department Foundation, grant number 222102210027; the Science and Technology Plan Projects of State Administration for Market Regulation, grant number 2021MK067; and the Undergraduate Universities Smart Teaching Special Research Project of Henan Province, grant number 489–29.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Qian, S., Jin, W., Chen, Y., Ma, J., Qiao, Y., Lu, J. (2023). A Survey on Multimodal Named Entity Recognition. In: Huang, DS., Premaratne, P., Jin, B., Qu, B., Jo, KH., Hussain, A. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2023. Lecture Notes in Computer Science(), vol 14089. Springer, Singapore. https://doi.org/10.1007/978-981-99-4752-2_50
Download citation
DOI: https://doi.org/10.1007/978-981-99-4752-2_50
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4751-5
Online ISBN: 978-981-99-4752-2
eBook Packages: Computer ScienceComputer Science (R0)