research-article

Focusing Attention across Multiple Images for Multimodal Event Detection

Authors:

Liang PengAuthors Info & Claims

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Article No.: 74, Pages 1 - 6

https://doi.org/10.1145/3469877.3495642

Published: 10 January 2022 Publication History

Abstract

Multimodal social event detection has been attracting tremendous research attention in recent years, due to that it provides comprehensive and complementary understanding of social events and is important to public security and administration. Most existing works have been focusing on the fusion of multimodal information, especially for single image and text fusion. Such single image-text pair processing breaks the correlations between images of the same post and may affect the accuracy of event detection. In this work, we propose to focus attention across multiple images for multimodal event detection, which is also more reasonable for tweets with short text and multiple images. Towards this end, we elaborate a novel Multi-Image Focusing Network (MIFN) to connect text content with visual aspects in multiple images. Our MIFN consists of a feature extractor, a multi-focal network and an event classifier. The multi-focal network implements a focal attention across all the images, and fuses the most related regions with texts as multimodal representation. The event classifier finally predict the social event class based on the multimodal representations. To evaluate the effectiveness of our proposed approach, we conduct extensive experiments on a commonly-used disaster dataset. The experimental results demonstrate that, in both humanitarian event detection task and its variant of hurricane disaster, the proposed MIFN outperforms all the baselines. The ablation studies also exhibit the ability to filter the irrelevant regions across images which results in improving the accuracy of multimodal event detection.

References

[1]

Mahdi Abavisani, Liwei Wu, Shengli Hu, Joel Tetreault, and Alejandro Jaimes. 2020. Multimodal categorization of crisis events in social media. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14679–14689.

[2]

Firoj Alam, Ferda Ofli, and Muhammad Imran. 2018. Crisismmd: Multimodal twitter datasets from natural disasters. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12.

[3]

James Allan. 2002. Introduction to topic detection and tracking. In Topic detection and tracking. Springer, 1–16.

Digital Library

[4]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.

[5]

Yi Bin, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen, and Xuelong Li. 2018. Describing video with attention-based bidirectional LSTM. IEEE transactions on cybernetics 49, 7 (2018), 2631–2641.

[6]

Yi Bin, Yang Yang, Fumin Shen, and Xing Xu. 2016. Combining multi-representation for multimedia event detection using co-training. Neurocomputing 217(2016), 11–18.

Digital Library

[7]

Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G Hauptmann. 2016. Bi-level semantic representation analysis for multimedia event detection. IEEE transactions on cybernetics 47, 5 (2016), 1180–1197.

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[9]

Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. Springer, 162–190.

[10]

Po-Yao Huang, Junwei Liang, Jean-Baptiste Lamare, and Alexander G Hauptmann. 2018. Multimodal filtering of social media for temporal monitoring and event analysis. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 450–457.

Digital Library

[11]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation(1972).

[12]

Zhen-zhong Lan, Lei Bao, Shoou-I Yu, Wei Liu, and Alexander G Hauptmann. 2012. Double fusion for multimedia event detection. In International Conference on Multimedia Modeling. Springer, 173–185.

[13]

Nut Limsopatham and Nigel Collier. 2016. Bidirectional LSTM for named entity recognition in Twitter messages. (2016).

[14]

Zhihong Lin, Huidong Jin, Bella Robinson, and Xunguo Lin. 2016. Towards an accurate social media disaster event detection system based on deep learning and semantic representation. In Proceedings of the 14th Australasian Data Mining Conference, Canberra, Australia. 6–8.

[15]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3–11.

Digital Library

[16]

Kristen Lovejoy, Richard D Waters, and Gregory D Saxton. 2012. Engaging stakeholders through Twitter: How nonprofit organizations are getting more out of 140 characters or less. Public relations review 38, 2 (2012), 313–318.

[17]

Andrew J McMinn, Yashar Moshfeghi, and Joemon M Jose. 2013. Building a large-scale corpus for evaluating event detection on twitter. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 409–418.

Digital Library

[18]

Ferda Ofli, Firoj Alam, and Muhammad Imran. 2020. Analysis of social media data using multimodal deep learning for disaster response. arXiv preprint arXiv:2004.11838(2020).

[19]

L. Peng, Y. Yang, Z. Wang, Z. Huang, and H. T. Shen. 2020. MRA-Net: Improving VQA via Multi-modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 10.1109/TPAMI.2020.3004830.

[20]

Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2012. Social event detection using multimodal clustering and integrating supervisory signals. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval. 1–8.

Digital Library

[21]

H. T. Shen, Y. Zhu, W. Zheng, and X. Zhu. 2020. Half-Quadratic Minimization for Unsupervised Feature Selection on Incomplete Data. IEEE Transactions on Neural Networks and Learning Systems (2020), 10.1109/TNNLS.2020.3009632.

[22]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[23]

Kang Xu, Guilin Qi, Junheng Huang, Tianxing Wu, and Xuefeng Fu. 2018. Detecting bursts in sentiment-aware topics from social media. Knowledge-Based Systems 141 (2018), 44–54.

Digital Library

[24]

Xiao Yang, Craig Macdonald, and Iadh Ounis. 2018. Using word embeddings in twitter election classification. Information Retrieval Journal 21, 2 (2018), 183–207.

Digital Library

[25]

Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27, 11 (2018), 5600–5611.

Digital Library

Cited By

Wu YWu XLi JZhang YWang HDu WHe ZLiu JRuan T(2023)MMpedia: A Large-Scale Multi-modal Knowledge GraphThe Semantic Web – ISWC 202310.1007/978-3-031-47243-5_2(18-37)Online publication date: 6-Nov-2023
https://dl.acm.org/doi/10.1007/978-3-031-47243-5_2
Yang ZLuo DYou JGuo ZYang Z(2023)Multimodal Conditional VAE for Zero-Shot Real-World Event DiscoveryAdvanced Data Mining and Applications10.1007/978-3-031-46664-9_43(644-659)Online publication date: 27-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-46664-9_43

Index Terms

Focusing Attention across Multiple Images for Multimodal Event Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval
  2. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Real-World Event Detection Using Flickr Images
MMM 2014: Proceedings of the 20th Anniversary International Conference on MultiMedia Modeling - Volume 8326

This paper proposes a real-world event detection method by using the time and location information and text tags attached to the images in Flickr. Events can generally be detected by extracting images captured at the events which are annotated with text ...
A Unified Framework for Event Summarization and Rare Event Detection from Multiple Views
A novel approach for event summarization and rare event detection is proposed. Unlike conventional methods that deal with event summarization and rare event detection independently, our method solves them in a single framework by transforming them into a ...
Multimodal Event Detection on Chinese Glyphs
Advanced Intelligent Computing Technology and Applications
Abstract
At present, most of the event detection researches focus on the single text modality, and multimodal event extraction is in its infancy, which mainly used the feature information of images to enhance text information. However, obtaining the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

December 2021

508 pages

ISBN:9781450386074

DOI:10.1145/3469877

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Science and technology development fund of CETC

Conference

MMAsia '21

Sponsor:

SIGMM

MMAsia '21: ACM Multimedia Asia

December 1 - 3, 2021

Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
196
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu YWu XLi JZhang YWang HDu WHe ZLiu JRuan T(2023)MMpedia: A Large-Scale Multi-modal Knowledge GraphThe Semantic Web – ISWC 202310.1007/978-3-031-47243-5_2(18-37)Online publication date: 6-Nov-2023
https://dl.acm.org/doi/10.1007/978-3-031-47243-5_2
Yang ZLuo DYou JGuo ZYang Z(2023)Multimodal Conditional VAE for Zero-Shot Real-World Event DiscoveryAdvanced Data Mining and Applications10.1007/978-3-031-46664-9_43(644-659)Online publication date: 27-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-46664-9_43

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten