skip to main content
10.1145/3394171.3413650acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts

Published: 12 October 2020 Publication History

Abstract

Visual contexts often help to recognize named entities more precisely in short texts such as tweets or snapchat. For example, one can identify "Charlie'' as a name of a dog according to the user posts. Previous works on multimodal named entity recognition ignore the corresponding relations of visual objects and entities. Visual objects are considered as fine-grained image representations. For a sentence with multiple entity types, objects of the relevant image can be utilized to capture different entity information. In this paper, we propose a neural network which combines object-level image information and character-level text information to predict entities. Vision and language are bridged by leveraging object labels as embeddings, and a dense co-attention mechanism is introduced for fine-grained interactions. Experimental results in Twitter dataset demonstrate that our method outperforms the state-of-the-art methods.

Supplementary Material

MP4 File (3394171.3413650.mp4)
This video contains the presentation for the paper "Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts". In this paper, we propose a novel object-aware neural model that combines visual and textual representations into predicting named entities in social media posts.

References

[1]
Alan Akbik, Tanja Bergmann, and Roland Vollgraf. 2019. Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 724--728.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[4]
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 2 (2018), 423--443.
[5]
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of Artificial Intelligence Research, Vol. 49 (2014), 1--47.
[6]
Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined visual representations as multimodal embeddings. In Thirty-First AAAI Conference on Artificial Intelligence.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 363--370.
[9]
Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2681--2690.
[10]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[12]
Rupinder Paul Khandpur, Taoran Ji, Steve Jan, Gang Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2017. Crowdsourcing cybersecurity: Cyber attack detection using social media. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 1049--1057.
[13]
Douwe Kiela and Léon Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 36--45.
[14]
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[15]
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
[16]
Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1990--1999.
[17]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 (2016).
[18]
Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862 (2018).
[19]
Swit Phuvipadawat and Tsuyoshi Murata. 2010. Breaking news detection and tracking in Twitter. In 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Vol. 3. IEEE, 120--123.
[20]
Alan Ritter, Sam Clark, Oren Etzioni, et almbox. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 1524--1534.
[21]
Alan Ritter, Evan Wright, William Casey, and Tom Mitchell. 2015. Weakly supervised extraction of computer security events from twitter. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 896--905.
[22]
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2012. Tweet analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, 4 (2012), 919--931.
[23]
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In Advances in neural information processing systems. 2440--2448.
[24]
István Varga, Motoki Sano, Kentaro Torisawa, Chikara Hashimoto, Kiyonori Ohtake, Takao Kawai, Jong-Hoon Oh, and Stijn De Saeger. 2013. Aid is out there: Looking for help from tweets during a large scale disaster. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1619--1629.
[25]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[26]
Jie Yang and Yue Zhang. 2018. Ncrf: An open-source neural sequence labeling toolkit. arXiv preprint arXiv:1806.05626 (2018).
[27]
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281--6290.
[28]
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Thirty-Second AAAI Conference on Artificial Intelligence.
[29]
Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1227--1236.

Cited By

View all
  • (2025)A Fine-Grained Network for Joint Multimodal Entity-Relation ExtractionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348510737:1(1-14)Online publication date: Jan-2025
  • (2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
  • (2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fine-grained image representations
  2. modality gap
  3. multimodal named entity recognition

Qualifiers

  • Research-article

Funding Sources

  • Fundamental Research Funds for the Central Universities, SCUT
  • Science and Technology Programs of Guang-zhou
  • the Science and Technology Planning Project of Guangdong Province

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)155
  • Downloads (Last 6 weeks)15
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Fine-Grained Network for Joint Multimodal Entity-Relation ExtractionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.348510737:1(1-14)Online publication date: Jan-2025
  • (2025)CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognitionNeurocomputing10.1016/j.neucom.2024.128792614(128792)Online publication date: Jan-2025
  • (2025)A multimodal approach for few-shot biomedical named entity recognition in low-resource languagesJournal of Biomedical Informatics10.1016/j.jbi.2024.104754161(104754)Online publication date: Jan-2025
  • (2025)A knowledge-enhanced network for joint multimodal entity-relation extractionInformation Processing & Management10.1016/j.ipm.2024.10403362:3(104033)Online publication date: May-2025
  • (2025)The more quality information the better: Hierarchical generation of multi-evidence alignment and fusion model for multimodal entity and relation extractionInformation Processing & Management10.1016/j.ipm.2024.10387562:1(103875)Online publication date: Jan-2025
  • (2024)RSRNeT: a novel multi-modal network framework for named entity recognition and relation extractionPeerJ Computer Science10.7717/peerj-cs.185610(e1856)Online publication date: 9-Feb-2024
  • (2024)Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity RecognitionApplied Sciences10.3390/app1406233314:6(2333)Online publication date: 10-Mar-2024
  • (2024)Joint multimodal aspect sentiment analysis with aspect enhancement and syntactic adaptive learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/738(6678-6686)Online publication date: 3-Aug-2024
  • (2024)Dual Contrastive Learning for Cross-Domain Named Entity RecognitionACM Transactions on Information Systems10.1145/367887942:6(1-33)Online publication date: 18-Oct-2024
  • (2024)Generative Multimodal Data Augmentation for Low-Resource Multimodal Named Entity RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681598(7336-7345)Online publication date: 28-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media