skip to main content
10.1145/3503161.3548427acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition

Published: 10 October 2022 Publication History

Abstract

Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.

Supplementary Material

MP4 File (MM22-fp3216.mp4)
There is a presentation video of the paper "Query Prior Matters: A MRC Framework for Multimodal Named Entity Recognition". In this video we will present the contributions, framework, algorithm details, experimental results and analysis of our paper, and conduct further qualitative analysis with two specific cases. We hope audiences can understand our work as soon as possible and our work can be discovered by more people, via this video.

References

[1]
Omer Arshad, Ignazio Gallo, Shah Nawaz, and Alessandro Calefati. 2019. Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR). 337--342.
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[3]
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 1870--1879.
[4]
Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Database Systems for Advanced Applications - 26th International Conference (DASFAA). 186--201.
[5]
Shaowei Chen, Yu Wang, Jie Liu, and Yuelin Wang. 2021. Bidirectional machine reading comprehension for aspect sentiment triplet extraction. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI). 12666--12674.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 4171--4186.
[7]
Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. 2020. Camouflaged object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 2777--2787.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770--778.
[9]
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
[10]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL). 260--270.
[11]
Fei Li, Zheng Wang, Siu Cheung Hui, Lejian Liao, Dandan Song, and Jing Xu. 2021. Effective named entity recognition with boundary-aware bidirectional neural networks. In Proceedings of the Web Conference 2021 (WWW). 1695--1703.
[12]
Fei Li, Zheng Wang, Siu Cheung Hui, Lejian Liao, Dandan Song, Jing Xu, Guoxiu He, and Meihuizi Jia. 2021. Modularized Interaction Network for Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP). 200--209.
[13]
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2019. A unified MRC framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 5849--5859.
[14]
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. Entity-relation extraction as multi-turn question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). 1340--1350.
[15]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 2117--2125.
[16]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[17]
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). 1990--1999.
[18]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bidirectional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL).
[19]
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).
[20]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 852--860.
[21]
Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Atsushi Otsuka, Itsumi Saito, Hisako Asano, and Junji Tomita. 2019. Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL). 2335--2345.
[22]
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational (NAACL). 2227--2237.
[23]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision (ICCV). 2641--2649.
[24]
Libo Qin, Tailu Liu, Wanxiang Che, Bingbing Kang, Sendong Zhao, and Ting Liu. 2021. A co-interactive transformer for joint slot filling and intent detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8193--8197.
[25]
Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically fused graph network for multi-hop reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 6140--6150.
[26]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 779--788.
[27]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[28]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In AAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015 (NIPS). 91--99.
[29]
Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP). 1524--1534.
[30]
Erik F Sang and Jorn Veenstra. 1999. Representing text chunks. In 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 173--179.
[31]
Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2020. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9073--9080.
[32]
Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective search for object recognition. International journal of computer vision 104, 2 (2013), 154--171.
[33]
Bo Xu, Shizhou Huang, Chaofeng Sha, and HongyaWang. 2022. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM). 1215--1223.
[34]
Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, (COLING). 3879--3889.
[35]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4683--4693.
[36]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2369--2380.
[37]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In Transactions of the Association for Computational Linguistics (TACL). 67--78.
[38]
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). 3342--3352.
[39]
Xuehui Yu, Pengfei Chen, Di Wu, Najmul Hassan, Guorong Li, Junchi Yan, Humphrey Shi, Qixiang Ye, and Zhenjun Han. 2022. Object Localization under Single Coarse Point Supervision. arXiv preprint arXiv:2203.09338 (2022).
[40]
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 14347--14355.
[41]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI). 5674--5681.
[42]
C Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In European conference on computer vision (ECCV). 391--405.

Cited By

View all
  • (2025)FeaMix: Feature Mix With Memory Batch Based on Self-Consistency Learning for Code Generation and Code TranslationIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33955319:1(192-201)Online publication date: Feb-2025
  • (2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
  • (2024)Machine Learning Algorithms for Fostering Innovative Education for University StudentsElectronics10.3390/electronics1308150613:8(1506)Online publication date: 16-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. machine reading comprehension
  2. multimodal named entity recognition
  3. transfer learning
  4. visual grounding

Qualifiers

  • Research-article

Funding Sources

  • National Key R\&D Program of China

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)134
  • Downloads (Last 6 weeks)6
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)FeaMix: Feature Mix With Memory Batch Based on Self-Consistency Learning for Code Generation and Code TranslationIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2024.33955319:1(192-201)Online publication date: Feb-2025
  • (2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
  • (2024)Machine Learning Algorithms for Fostering Innovative Education for University StudentsElectronics10.3390/electronics1308150613:8(1506)Online publication date: 16-Apr-2024
  • (2024)Dual Contrastive Learning for Cross-Domain Named Entity RecognitionACM Transactions on Information Systems10.1145/367887942:6(1-33)Online publication date: 18-Oct-2024
  • (2024)Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681598(7336-7345)Online publication date: 28-Oct-2024
  • (2024)Shapley Value-based Contrastive Alignment for Multimodal Information ExtractionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681367(5270-5279)Online publication date: 28-Oct-2024
  • (2024)VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NERProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658097(469-477)Online publication date: 30-May-2024
  • (2024)Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity RecognitionProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658011(795-803)Online publication date: 30-May-2024
  • (2024)ALDF: An Adaptive Logical Decision Framework for Multimodal Named Entity RecognitionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679706(436-445)Online publication date: 21-Oct-2024
  • (2024)Multimodal Relation Extraction via a Mixture of Hierarchical Visual Context LearnersProceedings of the ACM Web Conference 202410.1145/3589334.3645603(4283-4294)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media