research-article

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

Authors:
Zhihui Li

Shandong Normal University and University of New South Wales, Jinan, China

Shandong Normal University and University of New South Wales, Jinan, China
View Profile

,
Xiaojun Chang

Monash University, Melbourne, VIC, Australia

Monash University, Melbourne, VIC, Australia
View Profile

,
Lina Yao

University of New South Wales, Sydney, NSW, Australia

University of New South Wales, Sydney, NSW, Australia
View Profile

,
Shirui Pan

Monash University, Melbourne, VIC, Australia

Monash University, Melbourne, VIC, Australia
View Profile

,
Ge Zongyuan

Monash University, Melbourne, VIC, Australia

Monash University, Melbourne, VIC, Australia
View Profile

,
Huaxiang Zhang

Shandong Normal University, Jinan, Shandong, China

Shandong Normal University, Jinan, Shandong, China
View Profile

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAugust 2020Pages 297–305https://doi.org/10.1145/3394486.3403072

Published:20 August 2020Publication History

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 297–305

ABSTRACT

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

References

Hrishikesh B. Aradhye, George Toderici, and Jay Yagnik. 2009. Video2Text: Learning to Annotate Video Content. In ICDM Workshops 2009.Google ScholarDigital Library
Ioannis Caragiannis, Xenophon Chatzigeorgiou, George A. Krimpas, and Alexandros A. Voudouris. 2019. Optimizing positional scoring rules for rank aggregation. Artif. Intell., Vol. 267 (2019), 58--77.Google ScholarCross Ref
Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2017a. Bi-Level Semantic Representation Analysis for Multimedia Event Detection. IEEE Trans. Cybernetics, Vol. 47, 5 (2017), 1180--1197.Google ScholarCross Ref
Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yaoliang Yu. 2015a. Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection. In IJCAI.Google Scholar
Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016a. Dynamic Concept Composition for Zero-Example Event Detection. In AAAI, Dale Schuurmans and Michael P. Wellman (Eds.).Google Scholar
Xiaojun Chang, Yi Yang, Eric P. Xing, and Yaoliang Yu. 2015b. Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM. In ICML, Francis R. Bach and David M. Blei (Eds.).Google Scholar
Xiaojun Chang, Yaoliang Yu, Yi Yang, and Alexander G. Hauptmann. 2015c. Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision. In ACM MM, Xiaofang Zhou, Alan F. Smeaton, Qi Tian, Dick C. A. Bulterman, Heng Tao Shen, Ketan Mayer-Patel, and Shuicheng Yan (Eds.).Google Scholar
Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2016b. They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers. In CVPR.Google Scholar
Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2017b. Semantic Pooling for Complex Event Analysis in Untrimmed Videos. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 8 (2017), 1617--1632.Google ScholarDigital Library
David L. Chen and William B. Dolan. [n.d.]. Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL.Google Scholar
Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos. IEEE Trans. Multimedia, Vol. 21, 1 (2019), 246--255.Google ScholarDigital Library
Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. In WSDM.Google Scholar
Michael J. Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In WMT@ACL.Google Scholar
Í caro Cavalcante Dourado, Daniel Carlos Guimar a es Pedronette, and Ricardo da Silva Torres. 2019. Unsupervised graph-based rank aggregation for improved retrieval. Inf. Process. Manage., Vol. 56, 4 (2019), 1260--1279.Google ScholarDigital Library
Elise Epaillard and Nizar Bouguila. 2019. Variational Bayesian Learning of Generalized Dirichlet-Based Hidden Markov Models Applied to Unusual Events Detection. IEEE Trans. Neural Netw. Learning Syst., Vol. 30, 4 (2019), 1034--1047.Google ScholarCross Ref
Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos. In ICCV.Google Scholar
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS.Google Scholar
Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2012. Attribute Learning for Understanding Unstructured Social Activity. In ECCV.Google Scholar
AmirHossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Composite Concept Discovery for Zero-Shot Video Event Detection. In ICMR.Google Scholar
Ryuhei Hamaguchi, Ken Sakurada, and Ryosuke Nakamura. 2019. Rare Event Detection Using Disentangled Representation Learning. In CVPR.Google Scholar
Haiqi Huang, Yueming Lu, Fangwei Zhang, and Songlin Sun. 2012. A Multi-modal Clustering Method for Web Videos. In ISCTCS.Google Scholar
Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video. In CVPR.Google Scholar
Lu Jiang, Deyu Meng, Shoou-I Yu, Zhen-Zhong Lan, Shiguang Shan, and Alexander G. Hauptmann. [n.d.]. Self-Paced Learning with Diversity. In NIPS.Google Scholar
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In CVPR.Google Scholar
John G Kemeny. 1959. Mathematics without numbers. Daedalus, Vol. 88, 4 (1959), 577--591.Google Scholar
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 3 (2014), 453--465.Google ScholarDigital Library
Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2020. BMAN: Bidirectional Multi-Scale Aggregation Networks for Abnormal Event Detection. IEEE Trans. Image Processing, Vol. 29 (2020), 2395--2408.Google ScholarDigital Library
Yonggang Li, Rui Ge, Yi Ji, Shengrong Gong, and Chunping Liu. 2019 a. Trajectory-Pooled Spatial-Temporal Architecture of Deep Convolutional Neural Networks for Video Event Detection. IEEE Trans. Circuits Syst. Video Techn., Vol. 29, 9 (2019), 2683--2692.Google ScholarDigital Library
Zhihui Li, Lina Yao, Xiaojun Chang, Kun Zhan, Jiande Sun, and Huaxiang Zhang. 2019 b. Zero-shot event detection via event-adaptive concept relevance mining. Pattern Recognit., Vol. 88 (2019), 595--603.Google ScholarDigital Library
Huan Liu, Qinghua Zheng, Minnan Luo, Dingwen Zhang, Xiaojun Chang, and Cheng Deng. 2017. How Unlabeled Web Videos Help Complex Event Detection?. In IJCAI, Carles Sierra (Ed.).Google Scholar
Jian Liu, Yubo Chen, and Kang Liu. 2019. Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection. In AAAI.Google Scholar
Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees Snoek. [n.d.]. Searching informative concept banks for video event detection. In ICMR.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google Scholar
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. 2009. Zero-shot Learning with Semantic Output Codes. In NIPS.Google ScholarDigital Library
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In CVPR.Google Scholar
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In CVPR.Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.Google Scholar
JC Platt. 1999. Probabilities for SV Machines, Advances in Large Margin Classifiers.Google Scholar
Arun Rajkumar and Shivani Agarwal. 2014. A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data. In ICML.Google Scholar
Mohammad Rastegari, Ali Diba, Devi Parikh, and Ali Farhadi. 2013. Multi-attribute Queries: To Merge or Not to Merge?. In CVPR.Google Scholar
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In CVPR Workshops.Google Scholar
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, Vol. 105, 3 (2013), 222--245.Google ScholarDigital Library
Yiyuan She and Art B. Owen. 2010. Outlier Detection Using Nonconvex Penalized Regression. CoRR, Vol. abs/1006.2592 (2010).Google Scholar
Lei-Lei Shi, Lu Liu, Yan Wu, Liang Jiang, Muhammad Kazim, Haider Ali, and John Panneerselvam. 2019. Human-Centric Cyber Social Computing Model for Hot-Event Detection and Propagation. IEEE Trans. Comput. Social Systems, Vol. 6, 5 (2019), 1042--1050.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS.Google Scholar
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR, Vol. abs/1212.0402 (2012).Google Scholar
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. [n.d.]. Sequence to Sequence Learning with Neural Networks. In NIPS.Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.Google Scholar
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The New Data and New Challenges in Multimedia Research. CoRR, Vol. abs/1503.01817 (2015).Google Scholar
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence - Video to Text. In ICCV.Google Scholar
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL.Google Scholar
Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing Object and Scene Semantics for Large-Scale Video Understanding. In CVPR.Google Scholar
Xianjun Xia, Roberto Togneri, Ferdous Sohel, and Defeng Huang. 2019. Auxiliary Classifier Generative Adversarial Network With Soft Labels in Imbalanced Acoustic Event Detection. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1359--1371.Google ScholarDigital Library
Haoran Yan, Xiaolong Jin, Xiangbin Meng, Jiafeng Guo, and Xueqi Cheng. 2019. Event Detection with Multi-Order Graph Convolution and Aggregated Attention. In EMNLP-IJCNLP.Google Scholar
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV.Google Scholar
Shoou-I Yu, Lu Jiang, and Alexander G. Hauptmann. 2014. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples. In ACM MM.Google Scholar
Dingwen Zhang, Junwei Han, Lu Jiang, Senmao Ye, and Xiaojun Chang. 2017. Revealing Event Saliency in Unconstrained Video Collection. IEEE Trans. Image Processing, Vol. 26, 4 (2017), 1746--1758.Google ScholarDigital Library
Hao Zhang and Chong-Wah Ngo. 2019. A Fine Granularity Object-Level Representation for Event Detection and Recounting. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1450--1463.Google ScholarDigital Library
Lingling Zhang, Jun Liu, Minnan Luo, Xiaojun Chang, and Qinghua Zheng. 2018. Deep Semisupervised Zero-Shot Learning with Maximum Mean Discrepancy. Neural Computation, Vol. 30, 5 (2018).Google ScholarDigital Library
Yu Zhao, Zhenhui Shi, Jingyang Zhang, Dong Chen, and Lixu Gu. 2019. A novel active learning framework for classification: Using weighted rank aggregation to achieve multiple query criteria. Pattern Recognition, Vol. 93 (2019), 581--602.Google ScholarDigital Library
Zhicheng Zhao, Xuanchong Li, Xingzhong Du, Qi Chen, Yanyun Zhao, Fei Su, Xiaojun Chang, and Alexander G. Hauptmann. 2018. A unified framework with a benchmark dataset for surveillance event detection. Neurocomputing, Vol. 278 (2018), 62--74.Google ScholarCross Ref
Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In CVPR.Google Scholar

Index Terms

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Retrieval of a complex multimedia event has long been regarded as a challenging task. Multimedia event recounting, other than event detection, focuses on providing comprehensible evidence which justifies a detection result. Recounting enables "video ...
Read More
Zero-shot event detection via event-adaptive concept relevance mining
Abstract
Zero-shot complex event detection has been an emerging task in coping with the scarcity of labeled training videos in practice. Aiming to progress beyond the state-of-the-art zero-shot event detection, we propose a new zero-shot event ...
Read More
Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts
CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
grounding visual concepts
multimedia event captioning
multimedia event detection
zero-shot learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 675
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization

Zero-shot event detection via event-adaptive concept relevance mining

Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media