Event patches: Mining effective parts for event detection and understanding
Introduction
Event detection targets the detection of complex events such as “Birthday party” among numerous long video sequences, has attracted growing interest recently from both academia and industry [1], [2], [3], [4], [5]. It is still a quite challenging video analysis task due to the tremendous intra-class variations of events. Landing a fishing, for example, can be done under different scenes with different fishing tools.
Recently, deep neural networks (DNNs), especially convolutional neural networks (CNNs) [6], [7], have demonstrated their remarkable power in learning feature representations, which leads to the record-breaking improvements on almost all the computer vision tasks, e.g. image classification [7], [8], [9], [10], object detection [11], [12], [13], [14], saliency detection [15], [16], [17], [18], visual tracking [19], [20], semantic segmentation [21], [22], [23], and super-resolution [24], [25]. Among these tasks, some treat CNN as a black-box feature extractor and use these deep features for their goals, while some others conduct in-depth studies on the properties of CNN features offline pre-trained on massive image data and classification task on ImageNet. Karpathy et al. [26] detected objects in every image with Region Convolutional Neural Networks (RCNN), and utilized a 4096-dimensional activation of the fully connected layer for images representation. Their visual-semantic alignment model produces state-of-the-art results. In [19], properties of CNN features are studied in depth, which helps to the design of effective CNN-based visual trackers.
However, previous approaches [27], [28], [29], [30], [31], [32] in video and event analysis suffered from huge computation costs in multiple feature extraction and classification process. Instead, Xu et al. [33] proposed a discriminative CNN video representation for event detection. They represented video with a global VLAD [34], [35] encoding of CNN descriptors and used a set of latent concept descriptors as the frame descriptors. Their method results in a new state-of-the-art performance in event detection. Though great successes have been achieved, Xu’s method samples all the video frames with the same rate for global video representations and events detection, without considering motions in the video might vary a lot in speed and semantic importance. Researches in human activity prediction works [36] demonstrate our view, which is an active research topic that predicts human activities from only partially observed videos. As videos for event detection differ in many aspects, such as video length, size, situation, viewpoints, the information from video parts is patchy [37], indicating that only parts of them are effective and others might be redundant or noisy for detection. Later, a multirate sampling solution [38] is proposed to consider the video content motion speed variation. However, it remains uncertain and unclear that whether a definite video part is important or redundant, is it contributes to defining the event or not. Researchers in [39] proposed a temporal segment network for action recognition, which divides the videos into three temporal segments, pools snippets from them, and models the long-range temporal structure of action videos. Their work inspires us to exploit videos segments semantic and temporal information for a better event understanding.
In this work, we strive to mine the effective parts that mostly contribute to defining events and utilize them for event detection. We name these parts the Event Patches. Our approach aims to mine appropriate video representations and use them for a better evaluation of event detection. Concretely, the contributions of this work are threefold:
- •
We proposed the idea that only parts of the video are effective for event detection and strive to mine the definite Event Patches in videos for event detection. Unlike previous methods on event detections that samples the video frames uniformly for global video representations without considering that some video parts might be redundant or noisy, we consider the event patches mined by our method should play more important roles for the task.
- •
We proposed to utilize deep features and VLAD encoding for concatenated video segments representation, then a Sparse Coding based Video Patches Mining (SVPM) method was proposed for event patches mining and event detection. For each event, its event patches often relate to the same or several visual concepts. Despite the high intra-class variation of event videos, our method can successfully obtain the most reasonable “epitomes” of events.
- •
We validated the mined Event Patches on event detection tasks. Compared with the previous CNN video representation method, our approach achieves an impressive improvement.
The remainder of this paper is organized as follows. Section 2 reviews the status of related work. Then we introduce the proposed video segments representation methods and our event patches mining method based sparse coding in Section 3. Section 4 presents the datasets and experimental results. Finally, Section 5 concludes the paper.
Section snippets
Related work
Event detection. Based on the difficulty, event detection can be categorized into simple event detection and multimedia event detection (MED).
Simple event detection includes the detection of news events [40], sports events [41], unusual surveillance events or those with repetitive patterns [42], [43]. Compared to the MED videos, these events are usually well-defined and describable by short video sequences.
Multimedia event detection was first introduced in the TRECVID competition by NIST for
Method
We follow the work of Xu et al. [33] using CNN representations of videos for event detection. Their work has demonstrated to be the best single feature for event detection. In their work, the event detection performance achieves best when the PCA dimension is 256 and VLAD centers K is set to 256. In order to make comparisons with the method, we follow their work to sample every five frames in videos and use the same assignments for PCA and K. Also, as their work used global representations for
Experiments
To evaluate the effectiveness of the proposed method, we conduct experiments on TRECVID MED 2011 dataset1.
Conclusions
In this paper, we tried to mine effective parts mostly contribute to defining events and utilized them for event detection. To accomplish this work, the idea of Event Patches that only parts of the video are effective for event detection is introduced. We firstly divided videos into segments and proposed a concatenated video segments representation (CVSeg) based on deep features and VLAD encoding. Event detection performances of our CVSeg method and previous CNN video representation method are
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 61472103, 61772158, 61701273 and 61702136) and the Project Funded by China Postdoctoral Science Foundation (No. 2017M610897).
References (52)
- et al.
Multimedia social event detection in microblog
MultiMedia Modeling
(2015) - et al.
Imagenet large scale visual recognition challenge
Int. J. Comput. Vis.
(2015) - et al.
A large-scale benchmark dataset for event recognition in surveillance video
IEEE Conference on Computer Vision and Pattern Recognition
(2011) - et al.
Machine recognition of human activities: a survey
IEEE Trans. Circuits Syst. Video Technol.
(2008) - et al.
Real-time multimedia social event detection in microblog
IEEE Trans. Cybern.
(2017) - et al.
Gradient-based learning applied to document recognition
Proc. IEEE
(1998) - et al.
Deep residual learning for image recognition
IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
(2012) - et al.
Very deep convolutional networks for large-scale image recognition
CoRR
(2014) - et al.
Continuous probability distribution prediction of image emotions via multitask shared sparse regression
IEEE Trans. Multimed.
(2017)