ABSTRACT
The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.
- Hrishikesh B. Aradhye, George Toderici, and Jay Yagnik. 2009. Video2Text: Learning to Annotate Video Content. In ICDM Workshops 2009.Google ScholarDigital Library
- Ioannis Caragiannis, Xenophon Chatzigeorgiou, George A. Krimpas, and Alexandros A. Voudouris. 2019. Optimizing positional scoring rules for rank aggregation. Artif. Intell., Vol. 267 (2019), 58--77.Google ScholarCross Ref
- Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2017a. Bi-Level Semantic Representation Analysis for Multimedia Event Detection. IEEE Trans. Cybernetics, Vol. 47, 5 (2017), 1180--1197.Google ScholarCross Ref
- Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yaoliang Yu. 2015a. Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection. In IJCAI.Google Scholar
- Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016a. Dynamic Concept Composition for Zero-Example Event Detection. In AAAI, Dale Schuurmans and Michael P. Wellman (Eds.).Google Scholar
- Xiaojun Chang, Yi Yang, Eric P. Xing, and Yaoliang Yu. 2015b. Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM. In ICML, Francis R. Bach and David M. Blei (Eds.).Google Scholar
- Xiaojun Chang, Yaoliang Yu, Yi Yang, and Alexander G. Hauptmann. 2015c. Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision. In ACM MM, Xiaofang Zhou, Alan F. Smeaton, Qi Tian, Dick C. A. Bulterman, Heng Tao Shen, Ketan Mayer-Patel, and Shuicheng Yan (Eds.).Google Scholar
- Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2016b. They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers. In CVPR.Google Scholar
- Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2017b. Semantic Pooling for Complex Event Analysis in Untrimmed Videos. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 8 (2017), 1617--1632.Google ScholarDigital Library
- David L. Chen and William B. Dolan. [n.d.]. Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL.Google Scholar
- Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos. IEEE Trans. Multimedia, Vol. 21, 1 (2019), 246--255.Google ScholarDigital Library
- Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. In WSDM.Google Scholar
- Michael J. Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In WMT@ACL.Google Scholar
- Í caro Cavalcante Dourado, Daniel Carlos Guimar a es Pedronette, and Ricardo da Silva Torres. 2019. Unsupervised graph-based rank aggregation for improved retrieval. Inf. Process. Manage., Vol. 56, 4 (2019), 1260--1279.Google ScholarDigital Library
- Elise Epaillard and Nizar Bouguila. 2019. Variational Bayesian Learning of Generalized Dirichlet-Based Hidden Markov Models Applied to Unusual Events Detection. IEEE Trans. Neural Netw. Learning Syst., Vol. 30, 4 (2019), 1034--1047.Google ScholarCross Ref
- Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos. In ICCV.Google Scholar
- Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS.Google Scholar
- Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2012. Attribute Learning for Understanding Unstructured Social Activity. In ECCV.Google Scholar
- AmirHossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Composite Concept Discovery for Zero-Shot Video Event Detection. In ICMR.Google Scholar
- Ryuhei Hamaguchi, Ken Sakurada, and Ryosuke Nakamura. 2019. Rare Event Detection Using Disentangled Representation Learning. In CVPR.Google Scholar
- Haiqi Huang, Yueming Lu, Fangwei Zhang, and Songlin Sun. 2012. A Multi-modal Clustering Method for Web Videos. In ISCTCS.Google Scholar
- Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video. In CVPR.Google Scholar
- Lu Jiang, Deyu Meng, Shoou-I Yu, Zhen-Zhong Lan, Shiguang Shan, and Alexander G. Hauptmann. [n.d.]. Self-Paced Learning with Diversity. In NIPS.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In CVPR.Google Scholar
- John G Kemeny. 1959. Mathematics without numbers. Daedalus, Vol. 88, 4 (1959), 577--591.Google Scholar
- Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 3 (2014), 453--465.Google ScholarDigital Library
- Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2020. BMAN: Bidirectional Multi-Scale Aggregation Networks for Abnormal Event Detection. IEEE Trans. Image Processing, Vol. 29 (2020), 2395--2408.Google ScholarDigital Library
- Yonggang Li, Rui Ge, Yi Ji, Shengrong Gong, and Chunping Liu. 2019 a. Trajectory-Pooled Spatial-Temporal Architecture of Deep Convolutional Neural Networks for Video Event Detection. IEEE Trans. Circuits Syst. Video Techn., Vol. 29, 9 (2019), 2683--2692.Google ScholarDigital Library
- Zhihui Li, Lina Yao, Xiaojun Chang, Kun Zhan, Jiande Sun, and Huaxiang Zhang. 2019 b. Zero-shot event detection via event-adaptive concept relevance mining. Pattern Recognit., Vol. 88 (2019), 595--603.Google ScholarDigital Library
- Huan Liu, Qinghua Zheng, Minnan Luo, Dingwen Zhang, Xiaojun Chang, and Cheng Deng. 2017. How Unlabeled Web Videos Help Complex Event Detection?. In IJCAI, Carles Sierra (Ed.).Google Scholar
- Jian Liu, Yubo Chen, and Kang Liu. 2019. Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection. In AAAI.Google Scholar
- Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees Snoek. [n.d.]. Searching informative concept banks for video event detection. In ICMR.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google Scholar
- Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
- Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. 2009. Zero-shot Learning with Semantic Output Codes. In NIPS.Google ScholarDigital Library
- Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In CVPR.Google Scholar
- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In CVPR.Google Scholar
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.Google Scholar
- JC Platt. 1999. Probabilities for SV Machines, Advances in Large Margin Classifiers.Google Scholar
- Arun Rajkumar and Shivani Agarwal. 2014. A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data. In ICML.Google Scholar
- Mohammad Rastegari, Ali Diba, Devi Parikh, and Ali Farhadi. 2013. Multi-attribute Queries: To Merge or Not to Merge?. In CVPR.Google Scholar
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In CVPR Workshops.Google Scholar
- Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, Vol. 105, 3 (2013), 222--245.Google ScholarDigital Library
- Yiyuan She and Art B. Owen. 2010. Outlier Detection Using Nonconvex Penalized Regression. CoRR, Vol. abs/1006.2592 (2010).Google Scholar
- Lei-Lei Shi, Lu Liu, Yan Wu, Liang Jiang, Muhammad Kazim, Haider Ali, and John Panneerselvam. 2019. Human-Centric Cyber Social Computing Model for Hot-Event Detection and Propagation. IEEE Trans. Comput. Social Systems, Vol. 6, 5 (2019), 1042--1050.Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS.Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR, Vol. abs/1212.0402 (2012).Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. [n.d.]. Sequence to Sequence Learning with Neural Networks. In NIPS.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.Google Scholar
- Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The New Data and New Challenges in Multimedia Research. CoRR, Vol. abs/1503.01817 (2015).Google Scholar
- Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence - Video to Text. In ICCV.Google Scholar
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL.Google Scholar
- Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing Object and Scene Semantics for Large-Scale Video Understanding. In CVPR.Google Scholar
- Xianjun Xia, Roberto Togneri, Ferdous Sohel, and Defeng Huang. 2019. Auxiliary Classifier Generative Adversarial Network With Soft Labels in Imbalanced Acoustic Event Detection. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1359--1371.Google ScholarDigital Library
- Haoran Yan, Xiaolong Jin, Xiangbin Meng, Jiafeng Guo, and Xueqi Cheng. 2019. Event Detection with Multi-Order Graph Convolution and Aggregated Attention. In EMNLP-IJCNLP.Google Scholar
- Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV.Google Scholar
- Shoou-I Yu, Lu Jiang, and Alexander G. Hauptmann. 2014. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples. In ACM MM.Google Scholar
- Dingwen Zhang, Junwei Han, Lu Jiang, Senmao Ye, and Xiaojun Chang. 2017. Revealing Event Saliency in Unconstrained Video Collection. IEEE Trans. Image Processing, Vol. 26, 4 (2017), 1746--1758.Google ScholarDigital Library
- Hao Zhang and Chong-Wah Ngo. 2019. A Fine Granularity Object-Level Representation for Event Detection and Recounting. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1450--1463.Google ScholarDigital Library
- Lingling Zhang, Jun Liu, Minnan Luo, Xiaojun Chang, and Qinghua Zheng. 2018. Deep Semisupervised Zero-Shot Learning with Maximum Mean Discrepancy. Neural Computation, Vol. 30, 5 (2018).Google ScholarDigital Library
- Yu Zhao, Zhenhui Shi, Jingyang Zhang, Dong Chen, and Lixu Gu. 2019. A novel active learning framework for classification: Using weighted rank aggregation to achieve multiple query criteria. Pattern Recognition, Vol. 93 (2019), 581--602.Google ScholarDigital Library
- Zhicheng Zhao, Xuanchong Li, Xingzhong Du, Qi Chen, Yanyun Zhao, Fei Su, Xiaojun Chang, and Alexander G. Hauptmann. 2018. A unified framework with a benchmark dataset for surveillance event detection. Neurocomputing, Vol. 278 (2018), 62--74.Google ScholarCross Ref
- Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In CVPR.Google Scholar
Index Terms
- Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning
Recommendations
Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization
MM '16: Proceedings of the 24th ACM international conference on MultimediaRetrieval of a complex multimedia event has long been regarded as a challenging task. Multimedia event recounting, other than event detection, focuses on providing comprehensible evidence which justifies a detection result. Recounting enables "video ...
Zero-shot event detection via event-adaptive concept relevance mining
AbstractZero-shot complex event detection has been an emerging task in coping with the scarcity of labeled training videos in practice. Aiming to progress beyond the state-of-the-art zero-shot event detection, we propose a new zero-shot event ...
Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts
CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern RecognitionCurrent state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of ...
Comments