research-article

Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts

Authors:

Changmeng Zheng,

Qing LiAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1038 - 1046

https://doi.org/10.1145/3394171.3413650

Published: 12 October 2020 Publication History

Abstract

Visual contexts often help to recognize named entities more precisely in short texts such as tweets or snapchat. For example, one can identify "Charlie'' as a name of a dog according to the user posts. Previous works on multimodal named entity recognition ignore the corresponding relations of visual objects and entities. Visual objects are considered as fine-grained image representations. For a sentence with multiple entity types, objects of the relevant image can be utilized to capture different entity information. In this paper, we propose a neural network which combines object-level image information and character-level text information to predict entities. Vision and language are bridged by leveraging object labels as embeddings, and a dense co-attention mechanism is introduced for fine-grained interactions. Experimental results in Twitter dataset demonstrate that our method outperforms the state-of-the-art methods.

Supplementary Material

MP4 File (3394171.3413650.mp4)

This video contains the presentation for the paper "Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts". In this paper, we propose a novel object-aware neural model that combines visual and textual representations into predicting named entities in social media posts.

Download
18.00 MB

References

[1]

Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 724--728.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[4]

Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.

[5]

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Artificial Intelligence Research, Vol. 49 (2014), 1--47.

[6]

Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined visual representations as multimodal embeddings. In Thirty-First AAAI Conference on Artificial Intelligence.

Digital Library

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 363--370.

Digital Library

[9]

Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2681--2690.

[10]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[12]

Rupinder Paul Khandpur, Taoran Ji, Steve Jan, Gang Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2017. Crowdsourcing cybersecurity: Cyber attack detection using social media. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1049--1057.

Digital Library

[13]

Douwe Kiela and Léon Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 36--45.

[14]

John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).

[15]

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).

[16]

Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1990--1999.

[17]

Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).

[18]

Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862 (2018).

[19]

Swit Phuvipadawat and Tsuyoshi Murata. 2010. Breaking news detection and tracking in Twitter. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 120--123.

Digital Library

[20]

Alan Ritter, Sam Clark, Oren Etzioni, et almbox. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 1524--1534.

Digital Library

[21]

Alan Ritter, Evan Wright, William Casey, and Tom Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 896--905.

Digital Library

[22]

Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2012. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, 4 (2012), 919--931.

Digital Library

[23]

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In Advances in neural information processing systems. 2440--2448.

[24]

István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there: Looking for help from tweets during a large scale disaster. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1619--1629.

[25]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[26]

Jie Yang and Yue Zhang. 2018. Ncrf: An open-source neural sequence labeling toolkit. arXiv preprint arXiv:1806.05626 (2018).

[27]

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281--6290.

[28]

Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Thirty-Second AAAI Conference on Artificial Intelligence.

[29]

Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1227--1236.

Cited By

Yuan LCai YXu JLi QWang T(2025)A Fine-Grained Network for Joint Multimodal Entity-Relation ExtractionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348510737:1(1-14)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485107
Liu HXin XSong JPeng W(2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128792
Chen JSu LLi YLin MPeng YSun C(2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
https://doi.org/10.1016/j.jbi.2024.104754
Show More Cited By

Index Terms

Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

UAMNer: uncertainty-aware multimodal named entity recognition in social media posts
Abstract
Named Entity Recognition (NER) on social media is a challenging task, as social media posts are usually short and noisy. Recently, some work explores different ways to incorporate the visual information from the image to improve NER on social ...
Fine-Grained Multimodal Named Entity Recognition and Grounding with a Generative Framework
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal Named Entity Recognition (MNER) aims to locate and classify named entities mentioned in a pair of text and image. However, most previous MNER works focus on extracting entities in the form of text but failing to ground text symbols to their ...
Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning
Named Entity Recognition (NER) in social media posts is challenging since texts are usually short and contexts are lacking. Most recent works show that visual information can boost the NER performance since images can provide complementary contextual ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities, SCUT
Science and Technology Programs of Guang-zhou
the Science and Technology Planning Project of Guangdong Province

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
1,057
Total Downloads

Downloads (Last 12 months)155
Downloads (Last 6 weeks)15

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yuan LCai YXu JLi QWang T(2025)A Fine-Grained Network for Joint Multimodal Entity-Relation ExtractionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348510737:1(1-14)Online publication date: Jan-2025
https://doi.org/10.1109/TKDE.2024.3485107
Liu HXin XSong JPeng W(2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128792
Chen JSu LLi YLin MPeng YSun C(2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
https://doi.org/10.1016/j.jbi.2024.104754
Huang SCai YYuan LWang J(2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.104033
He XLi SZhang YLi BXu SZhou Y(2025)The more quality information the better: Hierarchical generation of multi-evidence alignment and fusion model for multimodal entity and relation extractionInformation Processing & Management10.1016/j.ipm.2024.10387562:1(103875)Online publication date: Jan-2025
https://doi.org/10.1016/j.ipm.2024.103875
Wang MChen HShen DLi BHu S(2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
https://doi.org/10.7717/peerj-cs.1856
He LWang QLiu JDuan JWang H(2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
https://doi.org/10.3390/app14062333
Zhu LSun HGao QYi THe LLarson K(2024)Joint multimodal aspect sentiment analysis with aspect enhancement and syntactic adaptive learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/738(6678-6686)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/738
Xu JYu JCai YChua T(2024)Dual Contrastive Learning for Cross-Domain Named Entity RecognitionACM Transactions on Information Systems10.1145/367887942:6(1-33)Online publication date: 18-Oct-2024
https://dl.acm.org/doi/10.1145/3678879
Li ZYu JYang JWang WYang LXia RCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681598(7336-7345)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681598
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents