skip to main content
10.1145/3394486.3403072acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

Authors Info & Claims
Published:20 August 2020Publication History

ABSTRACT

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

References

  1. Hrishikesh B. Aradhye, George Toderici, and Jay Yagnik. 2009. Video2Text: Learning to Annotate Video Content. In ICDM Workshops 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ioannis Caragiannis, Xenophon Chatzigeorgiou, George A. Krimpas, and Alexandros A. Voudouris. 2019. Optimizing positional scoring rules for rank aggregation. Artif. Intell., Vol. 267 (2019), 58--77.Google ScholarGoogle ScholarCross RefCross Ref
  3. Xiaojun Chang, Zhigang Ma, Yi Yang, Zhiqiang Zeng, and Alexander G. Hauptmann. 2017a. Bi-Level Semantic Representation Analysis for Multimedia Event Detection. IEEE Trans. Cybernetics, Vol. 47, 5 (2017), 1180--1197.Google ScholarGoogle ScholarCross RefCross Ref
  4. Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yaoliang Yu. 2015a. Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection. In IJCAI.Google ScholarGoogle Scholar
  5. Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016a. Dynamic Concept Composition for Zero-Example Event Detection. In AAAI, Dale Schuurmans and Michael P. Wellman (Eds.).Google ScholarGoogle Scholar
  6. Xiaojun Chang, Yi Yang, Eric P. Xing, and Yaoliang Yu. 2015b. Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM. In ICML, Francis R. Bach and David M. Blei (Eds.).Google ScholarGoogle Scholar
  7. Xiaojun Chang, Yaoliang Yu, Yi Yang, and Alexander G. Hauptmann. 2015c. Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision. In ACM MM, Xiaofang Zhou, Alan F. Smeaton, Qi Tian, Dick C. A. Bulterman, Heng Tao Shen, Ketan Mayer-Patel, and Shuicheng Yan (Eds.).Google ScholarGoogle Scholar
  8. Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2016b. They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers. In CVPR.Google ScholarGoogle Scholar
  9. Xiaojun Chang, Yaoliang Yu, Yi Yang, and Eric P. Xing. 2017b. Semantic Pooling for Complex Event Analysis in Untrimmed Videos. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 8 (2017), 1617--1632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David L. Chen and William B. Dolan. [n.d.]. Collecting Highly Parallel Data for Paraphrase Evaluation. In ACL.Google ScholarGoogle Scholar
  11. Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos. IEEE Trans. Multimedia, Vol. 21, 1 (2019), 246--255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shumin Deng, Ningyu Zhang, Jiaojian Kang, Yichi Zhang, Wei Zhang, and Huajun Chen. 2020. Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. In WSDM.Google ScholarGoogle Scholar
  13. Michael J. Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In WMT@ACL.Google ScholarGoogle Scholar
  14. Í caro Cavalcante Dourado, Daniel Carlos Guimar a es Pedronette, and Ricardo da Silva Torres. 2019. Unsupervised graph-based rank aggregation for improved retrieval. Inf. Process. Manage., Vol. 56, 4 (2019), 1260--1279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Elise Epaillard and Nizar Bouguila. 2019. Variational Bayesian Learning of Generalized Dirichlet-Based Hidden Markov Models Applied to Unusual Events Detection. IEEE Trans. Neural Netw. Learning Syst., Vol. 30, 4 (2019), 1034--1047.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos. In ICCV.Google ScholarGoogle Scholar
  17. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In NIPS.Google ScholarGoogle Scholar
  18. Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2012. Attribute Learning for Understanding Unstructured Social Activity. In ECCV.Google ScholarGoogle Scholar
  19. AmirHossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. Composite Concept Discovery for Zero-Shot Video Event Detection. In ICMR.Google ScholarGoogle Scholar
  20. Ryuhei Hamaguchi, Ken Sakurada, and Ryosuke Nakamura. 2019. Rare Event Detection Using Disentangled Representation Learning. In CVPR.Google ScholarGoogle Scholar
  21. Haiqi Huang, Yueming Lu, Fangwei Zhang, and Songlin Sun. 2012. A Multi-modal Clustering Method for Web Videos. In ISCTCS.Google ScholarGoogle Scholar
  22. Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. 2019. Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video. In CVPR.Google ScholarGoogle Scholar
  23. Lu Jiang, Deyu Meng, Shoou-I Yu, Zhen-Zhong Lan, Shiguang Shan, and Alexander G. Hauptmann. [n.d.]. Self-Paced Learning with Diversity. In NIPS.Google ScholarGoogle Scholar
  24. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In CVPR.Google ScholarGoogle Scholar
  25. John G Kemeny. 1959. Mathematics without numbers. Daedalus, Vol. 88, 4 (1959), 577--591.Google ScholarGoogle Scholar
  26. Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2014. Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 3 (2014), 453--465.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sangmin Lee, Hak Gu Kim, and Yong Man Ro. 2020. BMAN: Bidirectional Multi-Scale Aggregation Networks for Abnormal Event Detection. IEEE Trans. Image Processing, Vol. 29 (2020), 2395--2408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yonggang Li, Rui Ge, Yi Ji, Shengrong Gong, and Chunping Liu. 2019 a. Trajectory-Pooled Spatial-Temporal Architecture of Deep Convolutional Neural Networks for Video Event Detection. IEEE Trans. Circuits Syst. Video Techn., Vol. 29, 9 (2019), 2683--2692.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhihui Li, Lina Yao, Xiaojun Chang, Kun Zhan, Jiande Sun, and Huaxiang Zhang. 2019 b. Zero-shot event detection via event-adaptive concept relevance mining. Pattern Recognit., Vol. 88 (2019), 595--603.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Huan Liu, Qinghua Zheng, Minnan Luo, Dingwen Zhang, Xiaojun Chang, and Cheng Deng. 2017. How Unlabeled Web Videos Help Complex Event Detection?. In IJCAI, Carles Sierra (Ed.).Google ScholarGoogle Scholar
  31. Jian Liu, Yubo Chen, and Kang Liu. 2019. Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection. In AAAI.Google ScholarGoogle Scholar
  32. Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees Snoek. [n.d.]. Searching informative concept banks for video event detection. In ICMR.Google ScholarGoogle Scholar
  33. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google ScholarGoogle Scholar
  34. Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In CVPR.Google ScholarGoogle Scholar
  35. Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. 2009. Zero-shot Learning with Semantic Output Codes. In NIPS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In CVPR.Google ScholarGoogle Scholar
  37. Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly Modeling Embedding and Translation to Bridge Video and Language. In CVPR.Google ScholarGoogle Scholar
  38. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.Google ScholarGoogle Scholar
  39. JC Platt. 1999. Probabilities for SV Machines, Advances in Large Margin Classifiers.Google ScholarGoogle Scholar
  40. Arun Rajkumar and Shivani Agarwal. 2014. A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data. In ICML.Google ScholarGoogle Scholar
  41. Mohammad Rastegari, Ali Diba, Devi Parikh, and Ali Farhadi. 2013. Multi-attribute Queries: To Merge or Not to Merge?. In CVPR.Google ScholarGoogle Scholar
  42. Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In CVPR Workshops.Google ScholarGoogle Scholar
  43. Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob J. Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision, Vol. 105, 3 (2013), 222--245.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yiyuan She and Art B. Owen. 2010. Outlier Detection Using Nonconvex Penalized Regression. CoRR, Vol. abs/1006.2592 (2010).Google ScholarGoogle Scholar
  45. Lei-Lei Shi, Lu Liu, Yan Wu, Liang Jiang, Muhammad Kazim, Haider Ali, and John Panneerselvam. 2019. Human-Centric Cyber Social Computing Model for Hot-Event Detection and Propagation. IEEE Trans. Comput. Social Systems, Vol. 6, 5 (2019), 1042--1050.Google ScholarGoogle ScholarCross RefCross Ref
  46. Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS.Google ScholarGoogle Scholar
  47. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR, Vol. abs/1212.0402 (2012).Google ScholarGoogle Scholar
  48. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. [n.d.]. Sequence to Sequence Learning with Neural Networks. In NIPS.Google ScholarGoogle Scholar
  49. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.Google ScholarGoogle Scholar
  50. Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The New Data and New Challenges in Multimedia Research. CoRR, Vol. abs/1503.01817 (2015).Google ScholarGoogle Scholar
  51. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to Sequence - Video to Text. In ICCV.Google ScholarGoogle Scholar
  52. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015b. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In NAACL.Google ScholarGoogle Scholar
  53. Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing Object and Scene Semantics for Large-Scale Video Understanding. In CVPR.Google ScholarGoogle Scholar
  54. Xianjun Xia, Roberto Togneri, Ferdous Sohel, and Defeng Huang. 2019. Auxiliary Classifier Generative Adversarial Network With Soft Labels in Imbalanced Acoustic Event Detection. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1359--1371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Haoran Yan, Xiaolong Jin, Xiangbin Meng, Jiafeng Guo, and Xueqi Cheng. 2019. Event Detection with Multi-Order Graph Convolution and Aggregated Attention. In EMNLP-IJCNLP.Google ScholarGoogle Scholar
  56. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing Videos by Exploiting Temporal Structure. In ICCV.Google ScholarGoogle Scholar
  57. Shoou-I Yu, Lu Jiang, and Alexander G. Hauptmann. 2014. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples. In ACM MM.Google ScholarGoogle Scholar
  58. Dingwen Zhang, Junwei Han, Lu Jiang, Senmao Ye, and Xiaojun Chang. 2017. Revealing Event Saliency in Unconstrained Video Collection. IEEE Trans. Image Processing, Vol. 26, 4 (2017), 1746--1758.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Hao Zhang and Chong-Wah Ngo. 2019. A Fine Granularity Object-Level Representation for Event Detection and Recounting. IEEE Trans. Multimedia, Vol. 21, 6 (2019), 1450--1463.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Lingling Zhang, Jun Liu, Minnan Luo, Xiaojun Chang, and Qinghua Zheng. 2018. Deep Semisupervised Zero-Shot Learning with Maximum Mean Discrepancy. Neural Computation, Vol. 30, 5 (2018).Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yu Zhao, Zhenhui Shi, Jingyang Zhang, Dong Chen, and Lixu Gu. 2019. A novel active learning framework for classification: Using weighted rank aggregation to achieve multiple query criteria. Pattern Recognition, Vol. 93 (2019), 581--602.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Zhicheng Zhao, Xuanchong Li, Xingzhong Du, Qi Chen, Yanyun Zhao, Fei Su, Xiaojun Chang, and Alexander G. Hauptmann. 2018. A unified framework with a benchmark dataset for surveillance event detection. Neurocomputing, Vol. 278 (2018), 62--74.Google ScholarGoogle ScholarCross RefCross Ref
  63. Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In CVPR.Google ScholarGoogle Scholar

Index Terms

  1. Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      August 2020
      3664 pages
      ISBN:9781450379984
      DOI:10.1145/3394486

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 August 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader