Abstract
Event detection have long been a fundamental problem in computer vision society. Various datasets for recognizing human events and activities have been proposed to help developing better models and methods, such as UCF101, HMDB51, etc. These datasets all share the same properties that either predefined scripts are provided or the images are almost actor-oriented with little background noise. These properties, however, are completely different from that of surveillance event detection, making the effective solutions on these datasets totally not suitable. Event detection in complex surveillance video is a much more difficult task with several challenges: heavy occlusions between pedestrians, low image resolution and uncontrolled scene condition. TRECVID-SED evaluation, aiming at detecting events in highly crowded airport, is well-known for its great difficulties. To deal with event detection in realistic scene, such as TRECVID-SED, we introduce a comprehensive solution framework based on pedestrian detection, deep key-pose detection and trajectory analysis. Explicitly, instead of detecting whole body of one person, we detect the head-shoulder of pedestrian, addressing the issue of heavy occlusion of pedestrians in complex scene. We also propose a trajectory-based event detection method so as to better focus on the key actors of events. For those events with discriminative poses, we model the event detection as key pose detection by taking advantages of Faster R-CNN. The presented framework achieves the best result in TRECVID-SED 2016 evaluation.









Similar content being viewed by others
References
Amor BB, Jingyong S, Srivastava A (2016) Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans Pattern Anal Mach Intell 38(1):1–13
S Bell, CL Zitnick, K Bala, R Girshick (2015) Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. arXiv 1–24
Cai Z, et al. (2016) A unified multi-scale deep convolutional neural network for fast object detection. European Conference on Computer Vision. Springer International Publishing
Chang BW, R Nevatia (2008) Robust object tracking by hierarchical association of detection responses." European Conference on Computer Vision. Springer Berlin Heidelberg
X Chang et al. (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Patt Anal Mach Intel
X Chang et al. (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybernet
Chen Q et al. (2015) Part-based deep network for pedestrian detection in surveillance videos." Visual Communications and Image Processing (VCIP), 2015. IEEE
Dalal N, B Triggs (2005) Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE
Felzenszwalb PF et al (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Gidaris, Spyros, and Nikos Komodakis (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. Proc IEEE Int Conf Comput Vis
Girshick R (2015) Fast r-cnn. Proc IEEE Int Conf Comput Vis
Girshick R et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Conf Comput Vis Patt Recog
Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203
https://www.nist.gov/itl/iad/mig/trecvid-multimedia-event-detection-evaluation-track
Karen, A Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2:83–97 Kuhn's original publication
D Le, S Phan, Y Miyao, S Satoh et al (2016) @ TRECVID
Lenz P, A Geiger, R Urtasun (2015) Followme: Efficient online min-cost flow tracking with bounded memory and computation. Proc IEEE Int Conf Comput Vis
Li Y, K He, J Sun (2016) "R-fcn: Object detection via region-based fully convolutional networks. Adv Neural Info Proc Syst
J. Liang, P. Huang, L. Jiang, Z. Lan, J. Chen, A. Hauptmann et al. @ TRECVID (2016) Multimedia event Detection, Ad-hoc Video Search, Surveillance event Detection
Liu L et al (2016) Learning spatio-temporal representations for action recognition: a genetic programming approach. IEEE Trans Cybernet 46(1):158–170
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. Adv Neural Inf Proces Syst 2:841–848
Peng X, Wang L, Wang X et al (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Prince, SJD (2012) Computer vision: models, learning, and inference". Cambridge University Press
Redmon J et al. (2016) You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Ren S et al. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neur Info Proc Syst
Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Simonyan K, A Zisserman (2014) Two-stream convolutional networks for action recognition in videos. Adv Neur Info Proc Syst
Solera F, S Calderara, R Cucchiara (2015) Learning to divide and conquer for online multi-target tracking. Proc IEEE Int Conf Comput Vis
Wang H et al. (2011) Action recognition by dense trajectories." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE
Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Wang, et al (2016) Tracklet association by online target-specific metric learning and coherent dynamics estimation. IEEE Trans Patt Anal Mach Intel
Wu J, Zhang Y, Lin W (2016) Good practices for learning to recognize actions using FV and VLAD. IEEE Trans Cybernet 46(12):2978–2990
P. Yang, J. Xiong, D. Xie, S. Pu, HRI Team @ TRECVID (2016) Surveillance event detection
S Yu, L Jiang, CMU Informedia @ TRECVID (2015). Proc TRECVID 2015 Work
Zach C, T Pock, H Bischof (2007) A duality based approach for realtime TV-L 1 optical flow. Pattern Recog 214–223
Zha Z-J et al (2013) Detecting group activities with multi-camera context. IEEE Trans Circ Syst Video Technol 23(5):856–869
Zhang L, Y Li, R Nevatia (2008) Global data association for multi-object tracking using network flows. Comput Vis Patt Recog, 2008. CVPR 2008. IEEE Conference on. IEEE
Zhang S et al (2015) Multi-target tracking by learning local-to-global trajectory models. Pattern Recogn 48(2):580–590
Zhang X et al (2016) Deep fusion of multiple semantic cues for complex event recognition. IEEE Trans Image Process 25(3):1033–1046
Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by Key Laboratory of Forensic Marks, Ministry of Public Security ,Beijing,China and Chinese National Natural Science Foundation (61532018, 61372169, 61471049).
Rights and permissions
About this article
Cite this article
Zhu, Y., Zhou, K., Wang, M. et al. A comprehensive solution for detecting events in complex surveillance videos. Multimed Tools Appl 78, 817–838 (2019). https://doi.org/10.1007/s11042-018-6163-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6163-6